Biogeobears Validation

Validation tests of BioGeoBEARS — Introduction

(link to this section)

Here I post some validation examples for BioGeoBEARS.

I have left out comparisons to Lagrange, which are documented in Matzke (2014, Systematic Biology), and which test:

  1. the downpass log-likelihood calculations under the DEC model,
  2. the parameter inferences for d and e, and
  3. the maximum likelihood (ML) optimizer.

Although under the DEC model, the ML log-likelihood and parameter estimates are identical, the ancestral states estimates are not directly comparable between Lagrange and BioGeoBEARS, because:

  • Lagrange estimated node state/range probabilities under "local optimization", i.e. for every possible ancestral range at every node, Lagrange runs a new ML optimization of the parameters while fixing that range at that node.
  • BioGeoBEARS estimates ancestral states/ranges under the globally optimum ML model.

Local and global ancestral state/range probabilities will typically be similar but not identical.

In this page I focus on:

  1. Comparison of BioGeoBEARS ancestral states to Liam Revell's phytools, for a case where a BioGeoBEARS model can be converted into the (simple) model used by phytools
  2. Comparison of BioGeoBEARS Biogeographical Stochastic Mapping (BSM) to Liam Revell's phytools stochastic mapping, for a case where a BioGeoBEARS model can be converted into the (simple) model used by phytools
  3. Comparison of BioGeoBEARS log-likelihoods to ClaSSE (in diversitree), showing how the ClaSSE model can be simplified into a model identical with DEC or DEC+J, etc.
  4. Comparison of BioGeoBEARS ancestral state probabilities in time-stratified analyses to BioGeoBEARS ancestral state probabilities in non-time-stratified analyses. The latter are much simpler to implement. By setting up a time-stratified analysis, but set up the constraints matrices so that there is no actual change in geography between time slices, we can test that time-stratified analyses retrieve the same log-likelihoods, ML inferences, ancestral state probabilities, and stochastic mapping results

Note: I use the terms "state", "range", and "geographic range" interchangeably throughout, because in the framework of discrete historical biogeography, each possible geographic range is a state in the state space. Transitions between states are controlled by the anagenetic and cladogenetic transition matrices, which themselves are determined by the model parameters of the BioGeoBEARS supermodel.

Abbreviations:

  • ML = maximum likelihood
  • BSM = Biogeographical Stochastic Map(ping)(s)

BioGeoBEARS DEC and DEC+J — ancestral state/range probabilities versus BSMs

(link to this section)

NOTE: For an introduction to Biogeographical Stochastic Mapping (BSM), please see the section introducing Biogeographical Stochastic Mapping, and the Biogeographical Stochastic Mapping Example Script.

The input/output/script files for this section can be found in: Psychotria_M0_unconstrained_BSMs.zip

Here, we check that the ancestral state probabilities estimated under BioGeoBEARS DEC and DEC+J (non-stratified, unconstrained) are approximated by averaging many Biogeographical Stochastic Maps (BSMs).

DEC

(link to this section)

M0_unconstr_DEC_checkBSM_v1_MLunconstr_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Unconstrained DEC model (non-stratified) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

DEC+J

(link to this section)

M0_unconstr_DEC%2BJ_checkBSM_v1_MLunconstr_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Unconstrained DEC+J model (non-stratified) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

The BSMs line up well with the analytic probabilities for both DEC and DEC+J. This is also the case for time-stratified analyses and other models, although I am not posting these for the moment due to the lateness of the hour.

BioGeoBEARS stratified analyses — ancestral state/range probabilities versus BSMs

(link to this section)

The input/output/script files for this section can be found in: Psychotria_M0_stratified_BSMs.zip (6 subdirectories, one for each model)

Here, for the 6 standard [[BioGeoBEARS]] models, analyses that are time-stratified (but with constant geography) are run, to check that the ancestral state probabilities match the mean of 200 Biogeographic Stochastic Maps (BSMs). This is a double-check on the ancestral state probabilities and the BSM algorithms in a time-stratified analysis.

Time-stratified DEC and DEC+J

(link to this section)

M0strat_DEC_checkBSM_v1_MLstrat_marginals_allNodes.png
M0strat_DEC%2BJ_checkBSM_v1_MLstrat_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified DEC model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package). Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified DEC+J model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Time-stratified DIVALIKE and DIVALIKE+J

(link to this section)

M0strat_DIVALIKE_checkBSM_v1_MLstrat_marginals_allNodes.png
M0strat_DIVALIKE%2BJ_checkBSM_v1_MLstrat_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified DIVALIKE model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package). Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified DIVALIKE+J model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Time-stratified BAYAREALIKE and BAYAREALIKE+J

(link to this section)

M0strat_BAYAREALIKE_checkBSM_v1_MLstrat_marginals_allNodes.png
M0strat_BAYAREALIKE%2BJ_checkBSM_v1_MLstrat_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified BAYAREALIKE model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package). Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified BAYAREALIKE+J model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

BioGeoBEARS stratified vs. non-stratified analyses (ancestral state/range probabilities)

(link to this section)

The input/output/script files for this section can be found in: Psychotria_M0_unconstrained.zip and Psychotria_M0_stratified.zip

Programming a time-stratified analysis is much more complicated than a non-stratified analysis: imagine taking a timetree, slicing it into pieces wherever it crosses a stratum border, and then keeping track of all of the pieces as probabilities are passed around during the downpass and uppass. Then imagine a changing state space, changing constraints, etc. This is the kind of thing where a real object-oriented programming language with a real phylogeny structure based on node objects would have a major advantage over R (In R, I have to construct a big table and continually query it to keep track of all of the tree pieces and what time slice the calculations are taking place in.) It gets even more fun if there are fossil branches with tips that do not reach up to the present (0 million years ago).

To check that this has all been done correctly, we can compare non-stratified BioGeoBEARS analysis results to time-stratified analyses where each time-period has the same constraints as the recent. Here, I used the standard Hawaii island ages to do the time-stratification, but in each time period the manual dispersal matrix was all 1s.

ML, LnL, parameters

(link to this section)

Here are the parameter inferences for the 6 standard models, in the time-stratified analysis:

                    LnL numparams            d            e         j
DEC           -34.54313         2 3.501898e-02 2.840040e-02 0.0000000
DEC+J         -20.94759         3 1.000000e-12 1.000000e-12 0.1142670
DIVALIKE      -33.15172         2 4.472860e-02 1.000000e-12 0.0000000
DIVALIKE+J    -21.08621         3 1.000000e-12 1.000000e-12 0.1157244
BAYAREALIKE   -40.33461         2 1.863147e-02 3.060137e-01 0.0000000
BAYAREALIKE+J -21.55265         3 1.000000e-07 1.000000e-07 0.1081164

Here are the inferences in the non-stratified analysis (specifically, the analysis in the standard example script):

                    LnL numparams            d            e         j
DEC           -34.54196         2 3.504546e-02 2.835632e-02 0.0000000
DEC+J         -20.94759         3 1.000000e-12 1.000000e-12 0.1142757
DIVALIKE      -33.14968         2 4.474416e-02 1.000000e-12 0.0000000
DIVALIKE+J    -21.08621         3 1.000000e-12 1.000000e-12 0.1156998
BAYAREALIKE   -40.33704         2 1.738085e-02 3.040188e-01 0.0000000
BAYAREALIKE+J -21.55265         3 1.000000e-07 1.000000e-07 0.1081169

Pretty good. This just tests the downpass calculations and the ML optimizer on stratified vs. non-stratified analyses. Let's also compare the ancestral state probabilities under stratified and non-stratified analyses:

DEC

(link to this section)

Left: Non-stratified DEC M0 unconstrained; Right: Stratified DEC M0 unconstrained

Psychotria_DEC_M0_unconstrained_states.png
Psychotria_DEC_M0_stratified_states.png
Psychotria_DEC_M0_unconstrained_pies.png
Psychotria_DEC_M0_stratified_pies.png

DEC+J

(link to this section)

Left: Non-stratified DEC+J M0 unconstrained; Right: Stratified DEC+J M0 unconstrained

Psychotria_DEC+J_M0_unconstrained_states.png
Psychotria_DEC+J_M0_stratified_states.png
Psychotria_DEC+J_M0_unconstrained_pies.png
Psychotria_DEC+J_M0_stratified_pies.png

DIVALIKE

(link to this section)

Left: Non-stratified DIVALIKE M0 unconstrained; Right: Stratified DIVALIKE M0 unconstrained

Psychotria_DIVALIKE_M0_unconstrained_states.png
Psychotria_DIVALIKE_M0_stratified_states.png
Psychotria_DIVALIKE_M0_unconstrained_pies.png
Psychotria_DIVALIKE_M0_stratified_pies.png

DIVALIKE+J

(link to this section)

Left: Non-stratified DIVALIKE+J M0 unconstrained; Right: Stratified DIVALIKE+J M0 unconstrained

Psychotria_DIVALIKE+J_M0_unconstrained_states.png
Psychotria_DIVALIKE+J_M0_stratified_states.png
Psychotria_DIVALIKE+J_M0_unconstrained_pies.png
Psychotria_DIVALIKE+J_M0_stratified_pies.png

BAYAREALIKE

(link to this section)

Left: Non-stratified BAYAREALIKE M0 unconstrained; Right: Stratified BAYAREALIKE M0 unconstrained

Psychotria_BAYAREALIKE_M0_unconstrained_states.png
Psychotria_BAYAREALIKE_M0_stratified_states.png
Psychotria_BAYAREALIKE_M0_unconstrained_pies.png
Psychotria_BAYAREALIKE_M0_stratified_pies.png

BAYAREALIKE+J

(link to this section)

Left: Non-stratified BAYAREALIKE+J M0 unconstrained; Right: Stratified BAYAREALIKE+J M0 unconstrained

Psychotria_BAYAREALIKE+J_M0_unconstrained_states.png
Psychotria_BAYAREALIKE+J_M0_stratified_states.png
Psychotria_BAYAREALIKE+J_M0_unconstrained_pies.png
Psychotria_BAYAREALIKE+J_M0_stratified_pies.png

Phytools rerootingMethod vs. BioGeoBEARS BAYAREALIKE+a-d-e model (non-stratified BSMs and ancestral state/range probabilities)

(link to this section)

The input/output/script files for this section can be found in: Psychotria_M0_unconstrained_BAYAREALIKE+a.zip

To facilitate comparison with the ancestral states estimation and stochastic mapping algorithms in phytools, SIMMAP, etc., we need to create a "standard unordered character model with equal rates" in the BioGeoBEARS model object. This is simply done with the following code. The model ends up being the BAYAREALIKE model, with parameters d and e fixed to 0, and a (the "anagenetic range-switching" parameter) set to be freely estimated.

Specification of BAYAREALIKE+a-d-e model

(link to this section)

Specification of the BAYAREALIKE+a-d-e model is done as follows:

# Set up BAYAREALIKE model
# No subset sympatry
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0

# No vicariance
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","est"] = 0.0

# No jump dispersal/founder-event speciation
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = 0.01
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = 0.01

# Adjust linkage between parameters
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ysv","type"] = "1-j"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ys","type"] = "ysv*1/1"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["y","type"] = "1-j"

# Only sympatric/range-copying (y) events allowed, and with 
# exact copying (both descendants always the same size as the ancestor)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","init"] = 0.9999
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","est"] = 0.9999

# Turn off d and e; turn on a
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","type"] = "free"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","init"] = 0.03
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","est"] = 0.03

The model is thus really "BAYAREALIKE+a-d-e", i.e. "the BAYAREALIKE model, plus a, minus d, minus e". Alternatively we could call it the "instantaneous area switching model", the "anagenetic island model", a Markov-k (Mk model), the BEAST discrete phylogeography model with equal rates, the Jukes-Cantor DNA model, or an unordered character model — these are all equivalent. This is, by the way, the model that is being assumed when biogeography is treated as a standard unordered character. If the rates of different range-switches were different free parameters, this would essentially be a DNA GTR model, as in Sanmartin's island-switching paper.

Comparison of non-stratified BGB state probabilities and non-stratified BGB BSMs

(link to this section)

M0_unconstr_BAYAREALIKE%2Ba_checkBSM_v1_MLunconstr_marginals_allNodes.png
Figure Caption: Comparison of BioGeoBEARS analytic ancestral state/range probability estimates (x-axis), and the probabilities by taking the mean of 200 stochastic maps (y-axis). BioGeoBEARS implementation of the BAYAREALIKE+a-d-e model on the Psychotria dataset, non-stratified. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Comparison of phytools state probabilities and non-stratified BGB BSMs

(link to this section)

We can also compare the average of many stochastic maps to the analytic ancestral state probability estimates as estimated by the phytools function rerootingMethod.

M0_unconstr_BAYAREALIKE%2Ba_checkBSMunconstr_v1_phytools_rerootingMethod_allNodes.png

Figure Caption: Comparison of phytools rerootingMethod analytic ancestral state/range probability estimates (x-axis), and the probabilities by taking the mean of 200 stochastic maps (y-axis). BioGeoBEARS implementation of the BAYAREALIKE+a-d-e model on the Psychotria dataset, non-stratified. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

This graphic is the same as the previous one, because the same set of stochastic maps is being used for the y-axis, and it turns out that the phytools rerootingMethod ancestral state probabilities are nearly identical to the BioGeoBEARS ancestral state probabilities.

Comparison of phytools state probabilities and BGB non-stratified state probabilities

(link to this section)

We can also compare the log-likelihoods and parameter inferences between phytools rerootingMethod and BioGeoBEARS ancestral states:

#######################################################
# phytools ancestral states (ER model)
#######################################################

library(ape)
library(phytools)
library(BioGeoBEARS)

trfn = "tree.newick"

# Look at the raw Newick file:
moref(trfn)

# Look at your phylogeny:
tr = read.tree(trfn)
tr

tipnodenums = 1:length(tr$tip.label)
rootnodenum = length(tr$tip.label) + 1
intnodenums = (length(tr$tip.label)+1):(length(tr$tip.label) + tr$Nnode)
intnodenums
geogfn = "geog_data.txt"

# Look at the raw geography text file:
moref(geogfn)

# Look at your geographic range data:
tipranges = getranges_from_LagrangePHYLIP(lgdata_fn=geogfn)
tipranges

areanames = names(tipranges@df)
areanums = 1:length(areanames)
tipvals = rep(NA, nrow(tipranges@df))
for (i in 1:nrow(tipranges@df))
    {
    tipvals[i] = areanums[unlist(tipranges@df[i,])==1]
    }
names(tipvals) = row.names(tipranges@df)
tipvals
phytools_res = rerootingMethod(tree=tr, x=tipvals, model="ER")
phytools_res
phytools_ML = phytools_res$marginal.anc

#######################################################
# Store in a BioGeoBEARS-type format
#######################################################
resBAYAREALIKEcopy = resBAYAREALIKE
resBAYAREALIKEcopy$ML_marginal_prob_each_state_at_branch_top_AT_node[intnodenums, 2:5] = phytools_ML

#######################################################
# START VALIDATION CHECK
#######################################################
# Log-likelihood DOES match:
phytools_res$loglik
# -22.97566
resBAYAREALIKE$total_loglikelihood
# -22.97566

# The rate parameter DOES match
phytools_res$Q[1,2]
# 0.07572288
resBAYAREALIKE$outputs@params_table["a","est"]
# 0.07575708
#######################################################
# END VALIDATION CHECK
#######################################################

We can plot the analytic estimates from phytools and the BioGeoBEARS non-stratified analysis:

M0_unconstr_BAYAREALIKE%2Ba_checkMLunconstr_v1_phytools_rerootingMethod_allNodes.png
Figure Caption: Comparison of analytic ancestral state/range probabilities under phytools rerootingMethod (x-axis) or a BioGeoBEARS non-stratified analysis with the BAYAREALIKE+a-d-e model (y-axis). Psychotria dataset, non-stratified.

The correlation between the two algorithms is high (R2=1), basically identical.

Finally, as expected, if we do more stochastic maps, the mean state probabilities will (slowly) approach the analytic estimates. Here, I do this for the DEC model.

M0_DEC_checkBSM_v1_ML_marginals_allNodes_10000_BSMs.png
Figure Caption: Psychotria dataset, non-stratified, DEC model. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. Y-axis: mean of 10,000 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Phytools rerootingMethod vs. BioGeoBEARS BAYAREALIKE+a-d-e model (time-stratified BSMs and ancestral state/range probabilities)

(link to this section)

The input/output/script files for this section can be found in: Psychotria_M0_stratified_BAYAREALIKE+a.zip

To facilitate comparison with the ancestral states estimation and stochastic mapping algorithms in phytools, SIMMAP, etc., we need to create a "standard unordered character model with equal rates" in the BioGeoBEARS model object. This is simply done with the following code. The model ends up being the BAYAREALIKE model, with parameters d and e fixed to 0, and a (the "anagenetic range-switching" parameter) set to be freely estimated.

Specification of BAYAREALIKE+a-d-e model

(link to this section)

Specification of the BAYAREALIKE+a-d-e model is done as follows:

# Set up BAYAREALIKE model
# No subset sympatry
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0

# No vicariance
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","est"] = 0.0

# No jump dispersal/founder-event speciation
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = 0.01
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = 0.01

# Adjust linkage between parameters
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ysv","type"] = "1-j"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ys","type"] = "ysv*1/1"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["y","type"] = "1-j"

# Only sympatric/range-copying (y) events allowed, and with 
# exact copying (both descendants always the same size as the ancestor)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","init"] = 0.9999
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","est"] = 0.9999

# Turn off d and e; turn on a
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","type"] = "free"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","init"] = 0.03
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","est"] = 0.03

The model is thus really "BAYAREALIKE+a-d-e", i.e. "the BAYAREALIKE model, plus a, minus d, minus e". Alternatively we could call it the "instantaneous area switching model", the "anagenetic island model", a Markov-k (Mk model), the BEAST discrete phylogeography model with equal rates, the Jukes-Cantor DNA model, or an unordered character model — these are all equivalent. This is, by the way, the model that is being assumed when biogeography is treated as a standard unordered character. If the rates of different range-switches were different free parameters, this would essentially be a DNA GTR model, as in Sanmartin's island-switching paper.

Comparison of stratified BGB state probabilities and stratified BGB BSMs

(link to this section)

M0strat_BAYAREALIKE+a_checkBSM_v1_MLstrat_marginals_allNodes_200BSMs.png
Figure Caption: Comparison of BioGeoBEARS analytic ancestral state/range probability estimates (x-axis), and the probabilities by taking the mean of 200 stochastic maps (y-axis). BioGeoBEARS implementation of the BAYAREALIKE+a-d-e model on the Psychotria dataset, time-stratified with constant geography. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Comparison of phytools state probabilities and stratified BGB BSMs

(link to this section)

We can also compare the average of many stochastic maps to the analytic ancestral state probability estimates as estimated by the phytools function rerootingMethod.

M0strat_BAYAREALIKE+a_checkBSMstrat_v1_phytools_rerootingMethod_allNodes_200BSMs.png

Figure Caption: Comparison of phytools rerootingMethod analytic ancestral state/range probability estimates (x-axis), and the probabilities by taking the mean of 200 stochastic maps (y-axis). BioGeoBEARS implementation of the BAYAREALIKE+a-d-e model on the Psychotria dataset, time-stratified with constant geography. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

This graphic is the same as the previous one, because the same set of stochastic maps is being used for the y-axis, and it turns out that the phytools rerootingMethod ancestral state probabilities are nearly identical to the BioGeoBEARS ancestral state probabilities.

Comparison of phytools state probabilities and BGB stratified state probabilities

(link to this section)

We can also compare the log-likelihoods and parameter inferences between phytools rerootingMethod and BioGeoBEARS ancestral states:

#######################################################
# phytools ancestral states (ER model)
#######################################################

library(ape)
library(phytools)
library(BioGeoBEARS)

trfn = "tree.newick"

# Look at the raw Newick file:
moref(trfn)

# Look at your phylogeny:
tr = read.tree(trfn)
tr

tipnodenums = 1:length(tr$tip.label)
rootnodenum = length(tr$tip.label) + 1
intnodenums = (length(tr$tip.label)+1):(length(tr$tip.label) + tr$Nnode)
intnodenums
geogfn = "geog_data.txt"

# Look at the raw geography text file:
moref(geogfn)

# Look at your geographic range data:
tipranges = getranges_from_LagrangePHYLIP(lgdata_fn=geogfn)
tipranges

areanames = names(tipranges@df)
areanums = 1:length(areanames)
tipvals = rep(NA, nrow(tipranges@df))
for (i in 1:nrow(tipranges@df))
    {
    tipvals[i] = areanums[unlist(tipranges@df[i,])==1]
    }
names(tipvals) = row.names(tipranges@df)
tipvals
phytools_res = rerootingMethod(tree=tr, x=tipvals, model="ER")
phytools_res
phytools_ML = phytools_res$marginal.anc

#######################################################
# Store in a BioGeoBEARS-type format
#######################################################
resBAYAREALIKEcopy = resBAYAREALIKE
resBAYAREALIKEcopy$ML_marginal_prob_each_state_at_branch_top_AT_node[intnodenums, 2:5] = phytools_ML

#######################################################
# START VALIDATION CHECK
#######################################################
# Log-likelihood DOES match:
phytools_res$loglik
# -22.97551
resBAYAREALIKE$total_loglikelihood
# -22.97551

# The rate parameter DOES match
phytools_res$Q[1,2]
# 0.07572161
resBAYAREALIKE$outputs@params_table["a","est"]
# 0.07570332
#######################################################
# END VALIDATION CHECK
#######################################################

We can plot the analytic estimates from phytools and the BioGeoBEARS time-stratified analysis:

M0strat_BAYAREALIKE+a_checkMLstrat_v1_phytools_rerootingMethod_allNodes.png
Figure Caption: Comparison of analytic ancestral state/range probabilities under phytools rerootingMethod (x-axis) or a BioGeoBEARS time-stratified analysis with the BAYAREALIKE+a-d-e model (y-axis). Psychotria dataset, time-stratified with constant geography.

Clearly the correlation between the two algorithms is high (R2=1). I consider it "good enough." The probabilities are not quite identical for a few state/node combinations. This might be due to the slight differences in the ML parameter estimates between the two methods, and/or rounding and exponentiation errors (in particular, the BioGeoBEARS time-stratified analysis adds a bunch of extra exponentiations due to the fact that the phylogeny is sectioned into many pieces when the time-slicing is done).

Finally, as expected, if we do more stochastic maps, the mean state probabilities will (slowly) approach the analytic estimates.

MLstrat_vs_BSMstrat_5000_allNodes.png
Figure Caption: Psychotria dataset, time-stratified with constant geography. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. Y-axis: mean of 5000 stochastic maps for each state/range at each node.

BioGeoBEARS DEC and DEC+J as special cases of ClaSSE

(link to this section)

(this was originally posted in 2015 — I showed this graphic at Evolution 2015)

The BioGeoBEARS models can be shown to be special cases of Goldberg & Igic's ClaSSE model. ClaSSE is implemented in Rich Fitzjohn's diversitree package.

Here, I made a simple phylogeny and a 4-state geographic range space (2 areas, states are null, A, B, AB). I then set up a 4-state model for ClaSSE, and set ClaSSE parameters to be specified by values of d, e, and/or j (this requires writing some functions). Then, for various values of these parameters, I calculated the likelihood of the data in BioGeoBEARS and ClaSSE. To make the likelihoods comparable, the (state-independent) yule tree likelihood has to be added to the BioGeoBEARS likelihood, and the root-state likelihoods have to be removed from the ClaSSE likelihood. If this is done, it can be seen that there is very good agreement between the likelihoods produced by BioGeoBEARS DEC or DEC+J and ClaSSE:

(Note: This is NOT a demonstration that DEC, DEC+J etc. are "equivalent" to ClaSSE or GeoSSE! DEC is a special case of GeoSSE and ClaSSE, and GeoSSE, DEC, and DEC+J are all special cases of ClaSSE. "X is a special case of Y" means that by fixing the values of certain parameters in model X, you produce model Y. In the case below, if you take ClaSSE, fix extinction to 0 (eliminating the "E" in "SSE") and allow state-dependence in speciation only in the specific (equal per-event-weights) way that DEC or DEC+J specify the conditional probability of cladogenesis events (strongly constraining the "SS" part of "SSE"), then you get the same likelihoods.)

compare_many_LnLs_DEC_DECj_vs_claSSE_WORKS_ARCHIVE.png
Figure Caption: Log-likelihoods of a simple dataset (2 areas, 4 states/ranges) calculated under BioGeoBEARS (x-axis) and ClaSSE (y-axis). In the plot, blue "d" points indicate "DEC-e" models (d is positive, e=0, j=0). Black "e" points indicate traditional DEC models (d is positive, e is positive, j=0). Green "j" points represent the DEC+J model (all 3 parameters positive). A grid of possible d, e, and j parameter values was input into the BioGeoBEARS and ClaSSE likelihood calculators. These parameter value are not displayed, but the parameter combinations conferring the highest likelihood on the data, i.e. approaching the ML solution, are the points in the upper right of the plot.

The remaining small differences are due to the numeric integration steps in the ClaSSE ODE calculations. This is strong evidence that BioGeoBEARS and ClaSSE are doing the same math under these constraints, which is encouraging as they were programmed completely independently.

Why don't we just use ClaSSE for everything? As explained in Matzke (2013, Frontiers in Biogeography), as the state space increases at numstates=2areas, the number of parameters for the cladogenetic process in ClaSSE increases at something like (numstates3)/2. Even if many of these parameters were set to zero or equal to each other, I believe they still have to be represented in memory, and used in the numeric calculation. This probably means there are fairly strict limits to how many states can be used in a ClaSSE model, but the problem has not been thoroughly explored.

Note, 2017-06-24: the script for this graphic is here: attached script

DEC and DEC+J can be statistically compared

(link to this section)

Below is a brief response to the abstract of Ree and Sanmartín, scheduled at Evolution 2017. I have not seen the talk yet, of course, but the abstract contains sufficiently specific statements to offer a rebuttal.

I do, of course, welcome any and all criticisms of BioGeoBEARS models — I have offered a number of criticisms of these models myself! — but I do think they should be accurate. I hope there will be a more complete discussion of these issues in the literature, in the future.

I paste the abstract below, as it is impossible to give a direct link to abstracts on the meeting website:

https://www.xcdsystem.com/evolution/program/WVk3X5P/index.cfm?pgid=570&search=1&qtype=keyword&q=biogeography&submit=Go

Model selection in historical biogeography: why DEC and DEC+J likelihoods are not statistically comparable

Day: Saturday
Time: 4:15 PM - 4:29 PM
Location:_ A106
Back
Talk type:
Regular

Title:
Model selection in historical biogeography: why DEC and DEC+J likelihoods are not statistically comparable

Abstract:

The DEC+J model of geographic range evolution is a variant of DEC that allows "jump dispersal" or "founder event" speciation scenarios at internal phylogenetic nodes. It has become increasingly common for studies to conduct statistical model selection between DEC and DEC+J, even when jump dispersal is not integral to the hypotheses of interest. This is unfortunate because (a) DEC+J is a poor model for jump dispersal as a stochastic process, and (b) likelihoods of DEC and DEC+J are not statistically comparable. The primary purpose of modeling range evolution is to explain observed data via stochastic processes that generate biogeographic events with well-defined probabilities with respect to time. However, DEC and DEC+J do not describe distinct stochastic time-dependent processes. Instead, they differ in their ability to explain the data via cladogenetic events that have no time-dependent probabilities. Cladogenetic and anagenetic events of range evolution are not equivalent, and changing their relative contributions to the likelihood violates assumptions of model selection. Hypotheses of dispersal-driven speciation are much better served by models that treat cladogenetic events at phylogenetic nodes as stochastic outcomes of a lineage birth-death process, such as ClaSSE.

Keyword 1:
Biogeography

Keyword 2:
Phylogenetic comparative methods

Keyword 3:
Modeling

Talk Recorded:
Yes

Authors:
Richard Ree
Field Museum of Natural History
Isabel Sanmartín

As I explain in my talk (pdf available), DEC and DEC+J can be statistically compared because both models put valid conditional probabilities (likelihoods) on the same dataset (or, technically, they produce probabilities times (number of states-1), but this is a constant so it can be ignored in comparing models).

Conditional probabilities can be compared, as they represent the different fit of each model to the dataset.

I demonstrate the validity of these likelihoods in two ways:

1. Likelihoods for discrete data should add up to 1, when summed across all possible datasets (aka all possible data patterns). (If base frequencies are left out of the likelihood calculation, as is the case in Lagrange/BioGeoBEARS, it will be 1 times the number of possible ancestral states at the root.) I show that this is true for DEC and DEC+J in this script (zipfile).

2. DEC and DEC+J are special cases of the ClaSSE model. To show this, the DEC/DEC+J likelihood needs to be added to the likelihood of a Yule process on the input tree (again, this is an identical constant between the DEC and DEC+J models, so can be ignored in model comparisons) — and, the state frequencies at the root have to be set to be equal in ClaSSE, and then taken out of the likelihood (DEC and DEC+J assume equal state frequencies at the root, and, again, this is a constant that can be ignored when comparing the models).

I have demonstrated this in the attached script. (See also the ClaSSE section above, posted in 2015.) Likelihoods are not exactly identical between BioGeoBEARS and diversitree's ClaSSE — but ClaSSE is using a numerical approximation to calculate the likelihood, so this is expected. The correlation is extremely high (see graphic).

Now, there are numerous other issues that are worth discussing regarding DEC, DEC+J, and other biogeographical models. The biggest — as I have mentioned in Matzke (2014) and in many places on the BioGeoBEARS help pages — is that extinction is ignored in these models. They are effectively assuming a Yule process. This assumption is obviously wrong (Marshall 2017), but if we started ruling out methods that assumed or inferred a Yule process (many methods in phylogenetics underestimate extinction, even when they allow it), this area of science would have never gotten started.

In any case, a Yule process is a statistically valid model, even if it is a model that misses an important aspect of reality. The interesting question is this: what kind of data (or priors), and computationally feasible models, will it take to allow us to routinely infer the role of extinction in an not-too-biased way?

It would be great, of course, if we could snap our fingers and make a full ClaSSE-type model easy to use routinely for typical biogeographical problems. But the numerical calculations involved seem to get very slow for realistic problems. The best attempt I've seen at ClaSSE on a large state space is by Will Freyman and Sebastian Hoehna — some of the best programmers around — who implement a special case of a ClaSSE model to model polyploidy, in RevBayes. I believe their state space was of size 61 (51 chromosomes was the maximum observed, +10) — but this would still be a small biogeographical program (a 6-area problem would have 64 possible ranges in the state space).

However, improving our model realism, and our computational abilities, is what we are doing science for! I would just emphasize that we are not well served by trying to rule certain model comparisons invalid, if their internal math is correct. All models are wrong, but we can do a lot of good science by comparing imperfect models by their relative fit to the data — the idea that any model devised by biologists is going to be the "true" model is a phantom (Burnham and Anderson 2002). The whole point of AIC and related procedures is to quantify the relative degree of fit to an unknown reality, and try to devise models that capture key features of interest with high explanatory value — i.e., find the large effects (Burnham and Anderson 2002). I think jump dispersal/founder-event speciation is likely to be one of these, for many datasets (not all datasets, of course — several cases are known where non-J models are a better fit to a particular dataset).

Finally, as I have said many times, I view the DEC/DEC+J model comparison as the beginning of statistical model comparison in historical biogeography, not the end of it! In subsequent work, collaborators and I have explored the utility of using parameters for distance-dependent dispersal (the +x model variant, Van Dam and Matzke 2016, J. Biogeography), parameterizing the influence of manual-dispersal multipliers, rather than relying on subjectively-determined fixed multipliers (the +w model variant, Dupin, Matzke et al. 2016, J. Biogeography), trait-dependent dispersal (various talks and papers in submission), fossils and fossil detection models (my Evolution 2017 talk), etc.

Updates added post-talks

(link to this section)

Updates, post-talks: Nothing I saw in the talks or in post-talk discussions, so far at least, has rebutted the fundamental points about the likelihoods being valid probabilities summing to 1, or DEC and DEC+J being submodels of ClaSSE. Unless something changes on that score, it's hard to see how the claim that it is invalid to statistically compare the models can be sustained. Other kinds of discussions, e.g. whether there are better models out there, whether we should consider priors on models and how those should be justified, etc., are interesting but basically routine discussions in statistical model comparison.

Once we agree that statistical likelihood — Prob(data|model) — serves as the basis of statistical comparison of models against a common dataset (perhaps with a prior added), the rest follows.

See also

Advice on Statistical Model Comparison In BioGeoBEARS

References

Burnham, Kenneth P.; Anderson, David R. (1998, first edition; 2002, second edition). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. New York, Springer. doi: [10.1007/b97636 http://dx.doi.org/10.1007/b97636]
Link: http://www.springer.com/gb/book/9780387953649

Dupin, Julia; Matzke, Nicholas J.; Sarkinen, Tiina; Knapp, Sandra; Olmstead, Richard; Bohs, Lynn; Smith, Stacey (2016). Bayesian estimation of the global biogeographic history of the Solanaceae. Journal of Biogeography, 44(4), 887-899. http://dx.doi.org/10.1111/jbi.12898

Marshall, Charles R. (2017). Five palaeobiological laws needed to understand the evolution of the living biota. Nature Ecology & Evolution, 1, 0165. doi: 10.1038/s41559-017-0165

Matzke, Nicholas J. (2014). "Model Selection in Historical Biogeography Reveals that Founder-event Speciation is a Crucial Process in Island Clades." Systematic Biology, 63(6), 951–970. doi: 10.1093/sysbio/syu056

Van Dam, Matthew; Matzke, Nicholas J. (2016). Evaluating the influence of connectivity and distance on biogeographic patterns in the south-western deserts of North America. Journal of Biogeography. 43(8):1514–1532. http://dx.doi.org/10.1111/jbi.12727

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License