Biogeobears Validation

Validation tests of BioGeoBEARS — Introduction

(link to this section)

Here I post some validation examples for BioGeoBEARS.

I have left out comparisons to Lagrange, which are documented in Matzke (2014, Systematic Biology), and which test:

  1. the downpass log-likelihood calculations under the DEC model,
  2. the parameter inferences for d and e, and
  3. the maximum likelihood (ML) optimizer.

Although under the DEC model, the ML log-likelihood and parameter estimates are identical, the ancestral states estimates are not directly comparable between Lagrange and BioGeoBEARS, because:

  • Lagrange estimated node state/range probabilities under "local optimization", i.e. for every possible ancestral range at every node, Lagrange runs a new ML optimization of the parameters while fixing that range at that node.
  • BioGeoBEARS estimates ancestral states/ranges under the globally optimum ML model.

Local and global ancestral state/range probabilities will typically be similar but not identical.

In this page I focus on:

  1. Comparison of BioGeoBEARS ancestral states to Liam Revell's phytools, for a case where a BioGeoBEARS model can be converted into the (simple) model used by phytools
  2. Comparison of BioGeoBEARS Biogeographical Stochastic Mapping (BSM) to Liam Revell's phytools stochastic mapping, for a case where a BioGeoBEARS model can be converted into the (simple) model used by phytools
  3. Comparison of BioGeoBEARS log-likelihoods to ClaSSE (in diversitree), showing how the ClaSSE model can be simplified into a model identical with DEC or DEC+J, etc.
  4. Comparison of BioGeoBEARS ancestral state probabilities in time-stratified analyses to BioGeoBEARS ancestral state probabilities in non-time-stratified analyses. The latter are much simpler to implement. By setting up a time-stratified analysis, but set up the constraints matrices so that there is no actual change in geography between time slices, we can test that time-stratified analyses retrieve the same log-likelihoods, ML inferences, ancestral state probabilities, and stochastic mapping results

Note: I use the terms "state", "range", and "geographic range" interchangeably throughout, because in the framework of discrete historical biogeography, each possible geographic range is a state in the state space. Transitions between states are controlled by the anagenetic and cladogenetic transition matrices, which themselves are determined by the model parameters of the BioGeoBEARS supermodel.

Abbreviations:

  • ML = maximum likelihood
  • BSM = Biogeographical Stochastic Map(ping)(s)

BioGeoBEARS DEC and DEC+J — ancestral state/range probabilities versus BSMs

(link to this section)

The input/output/script files for this section can be found in: Psychotria_M0_unconstrained_BSMs.zip

Here, we check that the ancestral state probabilities estimated under BioGeoBEARS DEC and DEC+J (non-stratified, unconstrained) are approximated by averaging many Biogeographical Stochastic Maps (BSMs).

DEC

(link to this section)

M0_unconstr_DEC_checkBSM_v1_MLunconstr_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Unconstrained DEC model (non-stratified) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

DEC+J

(link to this section)

M0_unconstr_DEC%2BJ_checkBSM_v1_MLunconstr_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Unconstrained DEC+J model (non-stratified) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

The BSMs line up well with the analytic probabilities for both DEC and DEC+J. This is also the case for time-stratified analyses and other models, although I am not posting these for the moment due to the lateness of the hour.

BioGeoBEARS stratified analyses — ancestral state/range probabilities versus BSMs

(link to this section)

The input/output/script files for this section can be found in: Psychotria_M0_stratified_BSMs.zip (6 subdirectories, one for each model)

Here, for the 6 standard [[BioGeoBEARS]] models, analyses that are time-stratified (but with constant geography) are run, to check that the ancestral state probabilities match the mean of 200 Biogeographic Stochastic Maps (BSMs). This is a double-check on the ancestral state probabilities and the BSM algorithms in a time-stratified analysis.

Time-stratified DEC and DEC+J

(link to this section)

M0strat_DEC_checkBSM_v1_MLstrat_marginals_allNodes.png
M0strat_DEC%2BJ_checkBSM_v1_MLstrat_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified DEC model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package). Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified DEC+J model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Time-stratified DIVALIKE and DIVALIKE+J

(link to this section)

M0strat_DIVALIKE_checkBSM_v1_MLstrat_marginals_allNodes.png
M0strat_DIVALIKE%2BJ_checkBSM_v1_MLstrat_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified DIVALIKE model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package). Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified DIVALIKE+J model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Time-stratified BAYAREALIKE and BAYAREALIKE+J

(link to this section)

M0strat_BAYAREALIKE_checkBSM_v1_MLstrat_marginals_allNodes.png
M0strat_BAYAREALIKE%2BJ_checkBSM_v1_MLstrat_marginals_allNodes.png
Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified BAYAREALIKE model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package). Figure Caption: Ancestral state/range probabilities versus mean of stochastically mapped states. Time-stratified BAYAREALIKE+J model (but unconstrained, constant geography) on the Psychotria dataset. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. y-axis: mean of 200 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

BioGeoBEARS stratified vs. non-stratified analyses (ancestral state/range probabilities)

(link to this section)

The input/output/script files for this section can be found in: Psychotria_M0_unconstrained.zip and Psychotria_M0_stratified.zip

Programming a time-stratified analysis is much more complicated than a non-stratified analysis: imagine taking a timetree, slicing it into pieces wherever it crosses a stratum border, and then keeping track of all of the pieces as probabilities are passed around during the downpass and uppass. Then imagine a changing state space, changing constraints, etc. This is the kind of thing where a real object-oriented programming language with a real phylogeny structure based on node objects would have a major advantage over R (In R, I have to construct a big table and continually query it to keep track of all of the tree pieces and what time slice the calculations are taking place in.) It gets even more fun if there are fossil branches with tips that do not reach up to the present (0 million years ago).

To check that this has all been done correctly, we can compare non-stratified BioGeoBEARS analysis results to time-stratified analyses where each time-period has the same constraints as the recent. Here, I used the standard Hawaii island ages to do the time-stratification, but in each time period the manual dispersal matrix was all 1s.

ML, LnL, parameters

(link to this section)

Here are the parameter inferences for the 6 standard models, in the time-stratified analysis:

                    LnL numparams            d            e         j
DEC           -34.54313         2 3.501898e-02 2.840040e-02 0.0000000
DEC+J         -20.94759         3 1.000000e-12 1.000000e-12 0.1142670
DIVALIKE      -33.15172         2 4.472860e-02 1.000000e-12 0.0000000
DIVALIKE+J    -21.08621         3 1.000000e-12 1.000000e-12 0.1157244
BAYAREALIKE   -40.33461         2 1.863147e-02 3.060137e-01 0.0000000
BAYAREALIKE+J -21.55265         3 1.000000e-07 1.000000e-07 0.1081164

Here are the inferences in the non-stratified analysis (specifically, the analysis in the standard example script):

                    LnL numparams            d            e         j
DEC           -34.54196         2 3.504546e-02 2.835632e-02 0.0000000
DEC+J         -20.94759         3 1.000000e-12 1.000000e-12 0.1142757
DIVALIKE      -33.14968         2 4.474416e-02 1.000000e-12 0.0000000
DIVALIKE+J    -21.08621         3 1.000000e-12 1.000000e-12 0.1156998
BAYAREALIKE   -40.33704         2 1.738085e-02 3.040188e-01 0.0000000
BAYAREALIKE+J -21.55265         3 1.000000e-07 1.000000e-07 0.1081169

Pretty good. This just tests the downpass calculations and the ML optimizer on stratified vs. non-stratified analyses. Let's also compare the ancestral state probabilities under stratified and non-stratified analyses:

DEC

(link to this section)

Left: Non-stratified DEC M0 unconstrained; Right: Stratified DEC M0 unconstrained

Psychotria_DEC_M0_unconstrained_states.png
Psychotria_DEC_M0_stratified_states.png
Psychotria_DEC_M0_unconstrained_pies.png
Psychotria_DEC_M0_stratified_pies.png

DEC+J

(link to this section)

Left: Non-stratified DEC+J M0 unconstrained; Right: Stratified DEC+J M0 unconstrained

Psychotria_DEC+J_M0_unconstrained_states.png
Psychotria_DEC+J_M0_stratified_states.png
Psychotria_DEC+J_M0_unconstrained_pies.png
Psychotria_DEC+J_M0_stratified_pies.png

DIVALIKE

(link to this section)

Left: Non-stratified DIVALIKE M0 unconstrained; Right: Stratified DIVALIKE M0 unconstrained

Psychotria_DIVALIKE_M0_unconstrained_states.png
Psychotria_DIVALIKE_M0_stratified_states.png
Psychotria_DIVALIKE_M0_unconstrained_pies.png
Psychotria_DIVALIKE_M0_stratified_pies.png

DIVALIKE+J

(link to this section)

Left: Non-stratified DIVALIKE+J M0 unconstrained; Right: Stratified DIVALIKE+J M0 unconstrained

Psychotria_DIVALIKE+J_M0_unconstrained_states.png
Psychotria_DIVALIKE+J_M0_stratified_states.png
Psychotria_DIVALIKE+J_M0_unconstrained_pies.png
Psychotria_DIVALIKE+J_M0_stratified_pies.png

BAYAREALIKE

(link to this section)

Left: Non-stratified BAYAREALIKE M0 unconstrained; Right: Stratified BAYAREALIKE M0 unconstrained

Psychotria_BAYAREALIKE_M0_unconstrained_states.png
Psychotria_BAYAREALIKE_M0_stratified_states.png
Psychotria_BAYAREALIKE_M0_unconstrained_pies.png
Psychotria_BAYAREALIKE_M0_stratified_pies.png

BAYAREALIKE+J

(link to this section)

Left: Non-stratified BAYAREALIKE+J M0 unconstrained; Right: Stratified BAYAREALIKE+J M0 unconstrained

Psychotria_BAYAREALIKE+J_M0_unconstrained_states.png
Psychotria_BAYAREALIKE+J_M0_stratified_states.png
Psychotria_BAYAREALIKE+J_M0_unconstrained_pies.png
Psychotria_BAYAREALIKE+J_M0_stratified_pies.png

Phytools rerootingMethod vs. BioGeoBEARS BAYAREALIKE+a-d-e model (non-stratified BSMs and ancestral state/range probabilities)

(link to this section)

The input/output/script files for this section can be found in: Psychotria_M0_unconstrained_BAYAREALIKE+a.zip

To facilitate comparison with the ancestral states estimation and stochastic mapping algorithms in phytools, SIMMAP, etc., we need to create a "standard unordered character model with equal rates" in the BioGeoBEARS model object. This is simply done with the following code. The model ends up being the BAYAREALIKE model, with parameters d and e fixed to 0, and a (the "anagenetic range-switching" parameter) set to be freely estimated.

Specification of BAYAREALIKE+a-d-e model

(link to this section)

Specification of the BAYAREALIKE+a-d-e model is done as follows:

# Set up BAYAREALIKE model
# No subset sympatry
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0

# No vicariance
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","est"] = 0.0

# No jump dispersal/founder-event speciation
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = 0.01
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = 0.01

# Adjust linkage between parameters
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ysv","type"] = "1-j"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ys","type"] = "ysv*1/1"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["y","type"] = "1-j"

# Only sympatric/range-copying (y) events allowed, and with 
# exact copying (both descendants always the same size as the ancestor)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","init"] = 0.9999
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","est"] = 0.9999

# Turn off d and e; turn on a
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","type"] = "free"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","init"] = 0.03
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","est"] = 0.03

The model is thus really "BAYAREALIKE+a-d-e", i.e. "the BAYAREALIKE model, plus a, minus d, minus e". Alternatively we could call it the "instantaneous area switching model", the "anagenetic island model", a Markov-k (Mk model), the BEAST discrete phylogeography model with equal rates, the Jukes-Cantor DNA model, or an unordered character model — these are all equivalent. This is, by the way, the model that is being assumed when biogeography is treated as a standard unordered character. If the rates of different range-switches were different free parameters, this would essentially be a DNA GTR model, as in Sanmartin's island-switching paper.

Comparison of non-stratified BGB state probabilities and non-stratified BGB BSMs

(link to this section)

M0_unconstr_BAYAREALIKE%2Ba_checkBSM_v1_MLunconstr_marginals_allNodes.png
Figure Caption: Comparison of BioGeoBEARS analytic ancestral state/range probability estimates (x-axis), and the probabilities by taking the mean of 200 stochastic maps (y-axis). BioGeoBEARS implementation of the BAYAREALIKE+a-d-e model on the Psychotria dataset, non-stratified. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Comparison of phytools state probabilities and non-stratified BGB BSMs

(link to this section)

We can also compare the average of many stochastic maps to the analytic ancestral state probability estimates as estimated by the phytools function rerootingMethod.

M0_unconstr_BAYAREALIKE%2Ba_checkBSMunconstr_v1_phytools_rerootingMethod_allNodes.png

Figure Caption: Comparison of phytools rerootingMethod analytic ancestral state/range probability estimates (x-axis), and the probabilities by taking the mean of 200 stochastic maps (y-axis). BioGeoBEARS implementation of the BAYAREALIKE+a-d-e model on the Psychotria dataset, non-stratified. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

This graphic is the same as the previous one, because the same set of stochastic maps is being used for the y-axis, and it turns out that the phytools rerootingMethod ancestral state probabilities are nearly identical to the BioGeoBEARS ancestral state probabilities.

Comparison of phytools state probabilities and BGB non-stratified state probabilities

(link to this section)

We can also compare the log-likelihoods and parameter inferences between phytools rerootingMethod and BioGeoBEARS ancestral states:

#######################################################
# phytools ancestral states (ER model)
#######################################################

library(ape)
library(phytools)
library(BioGeoBEARS)

trfn = "tree.newick"

# Look at the raw Newick file:
moref(trfn)

# Look at your phylogeny:
tr = read.tree(trfn)
tr

tipnodenums = 1:length(tr$tip.label)
rootnodenum = length(tr$tip.label) + 1
intnodenums = (length(tr$tip.label)+1):(length(tr$tip.label) + tr$Nnode)
intnodenums
geogfn = "geog_data.txt"

# Look at the raw geography text file:
moref(geogfn)

# Look at your geographic range data:
tipranges = getranges_from_LagrangePHYLIP(lgdata_fn=geogfn)
tipranges

areanames = names(tipranges@df)
areanums = 1:length(areanames)
tipvals = rep(NA, nrow(tipranges@df))
for (i in 1:nrow(tipranges@df))
    {
    tipvals[i] = areanums[unlist(tipranges@df[i,])==1]
    }
names(tipvals) = row.names(tipranges@df)
tipvals
phytools_res = rerootingMethod(tree=tr, x=tipvals, model="ER")
phytools_res
phytools_ML = phytools_res$marginal.anc

#######################################################
# Store in a BioGeoBEARS-type format
#######################################################
resBAYAREALIKEcopy = resBAYAREALIKE
resBAYAREALIKEcopy$ML_marginal_prob_each_state_at_branch_top_AT_node[intnodenums, 2:5] = phytools_ML

#######################################################
# START VALIDATION CHECK
#######################################################
# Log-likelihood DOES match:
phytools_res$loglik
# -22.97566
resBAYAREALIKE$total_loglikelihood
# -22.97566

# The rate parameter DOES match
phytools_res$Q[1,2]
# 0.07572288
resBAYAREALIKE$outputs@params_table["a","est"]
# 0.07575708
#######################################################
# END VALIDATION CHECK
#######################################################

We can plot the analytic estimates from phytools and the BioGeoBEARS non-stratified analysis:

M0_unconstr_BAYAREALIKE%2Ba_checkMLunconstr_v1_phytools_rerootingMethod_allNodes.png
Figure Caption: Comparison of analytic ancestral state/range probabilities under phytools rerootingMethod (x-axis) or a BioGeoBEARS non-stratified analysis with the BAYAREALIKE+a-d-e model (y-axis). Psychotria dataset, non-stratified.

The correlation between the two algorithms is high (R2=1), basically identical.

Finally, as expected, if we do more stochastic maps, the mean state probabilities will (slowly) approach the analytic estimates. Here, I do this for the DEC model.

M0_DEC_checkBSM_v1_ML_marginals_allNodes_10000_BSMs.png
Figure Caption: Psychotria dataset, non-stratified, DEC model. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. Y-axis: mean of 10,000 stochastic maps for each state/range at each node. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Phytools rerootingMethod vs. BioGeoBEARS BAYAREALIKE+a-d-e model (time-stratified BSMs and ancestral state/range probabilities)

(link to this section)

The input/output/script files for this section can be found in: Psychotria_M0_stratified_BAYAREALIKE+a.zip

To facilitate comparison with the ancestral states estimation and stochastic mapping algorithms in phytools, SIMMAP, etc., we need to create a "standard unordered character model with equal rates" in the BioGeoBEARS model object. This is simply done with the following code. The model ends up being the BAYAREALIKE model, with parameters d and e fixed to 0, and a (the "anagenetic range-switching" parameter) set to be freely estimated.

Specification of BAYAREALIKE+a-d-e model

(link to this section)

Specification of the BAYAREALIKE+a-d-e model is done as follows:

# Set up BAYAREALIKE model
# No subset sympatry
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0

# No vicariance
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","est"] = 0.0

# No jump dispersal/founder-event speciation
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = 0.01
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = 0.01

# Adjust linkage between parameters
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ysv","type"] = "1-j"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ys","type"] = "ysv*1/1"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["y","type"] = "1-j"

# Only sympatric/range-copying (y) events allowed, and with 
# exact copying (both descendants always the same size as the ancestor)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","init"] = 0.9999
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","est"] = 0.9999

# Turn off d and e; turn on a
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","type"] = "free"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","init"] = 0.03
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","est"] = 0.03

The model is thus really "BAYAREALIKE+a-d-e", i.e. "the BAYAREALIKE model, plus a, minus d, minus e". Alternatively we could call it the "instantaneous area switching model", the "anagenetic island model", a Markov-k (Mk model), the BEAST discrete phylogeography model with equal rates, the Jukes-Cantor DNA model, or an unordered character model — these are all equivalent. This is, by the way, the model that is being assumed when biogeography is treated as a standard unordered character. If the rates of different range-switches were different free parameters, this would essentially be a DNA GTR model, as in Sanmartin's island-switching paper.

Comparison of stratified BGB state probabilities and stratified BGB BSMs

(link to this section)

M0strat_BAYAREALIKE+a_checkBSM_v1_MLstrat_marginals_allNodes_200BSMs.png
Figure Caption: Comparison of BioGeoBEARS analytic ancestral state/range probability estimates (x-axis), and the probabilities by taking the mean of 200 stochastic maps (y-axis). BioGeoBEARS implementation of the BAYAREALIKE+a-d-e model on the Psychotria dataset, time-stratified with constant geography. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

Comparison of phytools state probabilities and stratified BGB BSMs

(link to this section)

We can also compare the average of many stochastic maps to the analytic ancestral state probability estimates as estimated by the phytools function rerootingMethod.

M0strat_BAYAREALIKE+a_checkBSMstrat_v1_phytools_rerootingMethod_allNodes_200BSMs.png

Figure Caption: Comparison of phytools rerootingMethod analytic ancestral state/range probability estimates (x-axis), and the probabilities by taking the mean of 200 stochastic maps (y-axis). BioGeoBEARS implementation of the BAYAREALIKE+a-d-e model on the Psychotria dataset, time-stratified with constant geography. Vertical lines are the 95% confidence intervals under the multinomial distribution (calculated with the multinomialCI function of the MultinomialCI package).

This graphic is the same as the previous one, because the same set of stochastic maps is being used for the y-axis, and it turns out that the phytools rerootingMethod ancestral state probabilities are nearly identical to the BioGeoBEARS ancestral state probabilities.

Comparison of phytools state probabilities and BGB stratified state probabilities

(link to this section)

We can also compare the log-likelihoods and parameter inferences between phytools rerootingMethod and BioGeoBEARS ancestral states:

#######################################################
# phytools ancestral states (ER model)
#######################################################

library(ape)
library(phytools)
library(BioGeoBEARS)

trfn = "tree.newick"

# Look at the raw Newick file:
moref(trfn)

# Look at your phylogeny:
tr = read.tree(trfn)
tr

tipnodenums = 1:length(tr$tip.label)
rootnodenum = length(tr$tip.label) + 1
intnodenums = (length(tr$tip.label)+1):(length(tr$tip.label) + tr$Nnode)
intnodenums
geogfn = "geog_data.txt"

# Look at the raw geography text file:
moref(geogfn)

# Look at your geographic range data:
tipranges = getranges_from_LagrangePHYLIP(lgdata_fn=geogfn)
tipranges

areanames = names(tipranges@df)
areanums = 1:length(areanames)
tipvals = rep(NA, nrow(tipranges@df))
for (i in 1:nrow(tipranges@df))
    {
    tipvals[i] = areanums[unlist(tipranges@df[i,])==1]
    }
names(tipvals) = row.names(tipranges@df)
tipvals
phytools_res = rerootingMethod(tree=tr, x=tipvals, model="ER")
phytools_res
phytools_ML = phytools_res$marginal.anc

#######################################################
# Store in a BioGeoBEARS-type format
#######################################################
resBAYAREALIKEcopy = resBAYAREALIKE
resBAYAREALIKEcopy$ML_marginal_prob_each_state_at_branch_top_AT_node[intnodenums, 2:5] = phytools_ML

#######################################################
# START VALIDATION CHECK
#######################################################
# Log-likelihood DOES match:
phytools_res$loglik
# -22.97551
resBAYAREALIKE$total_loglikelihood
# -22.97551

# The rate parameter DOES match
phytools_res$Q[1,2]
# 0.07572161
resBAYAREALIKE$outputs@params_table["a","est"]
# 0.07570332
#######################################################
# END VALIDATION CHECK
#######################################################

We can plot the analytic estimates from phytools and the BioGeoBEARS time-stratified analysis:

M0strat_BAYAREALIKE+a_checkMLstrat_v1_phytools_rerootingMethod_allNodes.png
Figure Caption: Comparison of analytic ancestral state/range probabilities under phytools rerootingMethod (x-axis) or a BioGeoBEARS time-stratified analysis with the BAYAREALIKE+a-d-e model (y-axis). Psychotria dataset, time-stratified with constant geography.

Clearly the correlation between the two algorithms is high (R2=1). I consider it "good enough." The probabilities are not quite identical for a few state/node combinations. This might be due to the slight differences in the ML parameter estimates between the two methods, and/or rounding and exponentiation errors (in particular, the BioGeoBEARS time-stratified analysis adds a bunch of extra exponentiations due to the fact that the phylogeny is sectioned into many pieces when the time-slicing is done).

Finally, as expected, if we do more stochastic maps, the mean state probabilities will (slowly) approach the analytic estimates.

MLstrat_vs_BSMstrat_5000_allNodes.png
Figure Caption: Psychotria dataset, time-stratified with constant geography. x-axis: analytic ancestral state/range probabilities of each state/range at each internal node. Y-axis: mean of 5000 stochastic maps for each state/range at each node.

BioGeoBEARS DEC and DEC+J as special cases of ClaSSE

(link to this section)

The BioGeoBEARS models can be shown to be special cases of Goldberg & Igic's ClaSSE model. ClaSSE is implemented in Rich Fitzjohn's diversitree package.

Here, I made a simple phylogeny and a 4-state geographic range space (2 areas, states are null, A, B, AB). I then set up a 4-state model for ClaSSE, and set ClaSSE parameters to be specified by values of d, e, and/or j (this requires writing some functions). Then, for various values of these parameters, I calculated the likelihood of the data in BioGeoBEARS and ClaSSE. To make the likelihoods comparable, the (state-independent) yule tree likelihood has to be added to the BioGeoBEARS likelihood, and the root-state likelihoods have to be removed from the ClaSSE likelihood. If this is done, it can be seen that there is very good agreement between the likelihoods produced by BioGeoBEARS DEC or DEC+J and ClaSSE:

(Note: This is NOT a demonstration that DEC, DEC+J etc. are "equivalent" to ClaSSE or GeoSSE! DEC is a special case of GeoSSE and ClaSSE, and GeoSSE, DEC, and DEC+J are all special cases of ClaSSE. "X is a special case of Y" means that by fixing the values of certain parameters in model X, you produce model Y. In the case below, if you take ClaSSE, fix extinction to 0 (eliminating the "E" in "SSE") and allow state-dependence in speciation only in the specific (equal per-event-weights) way that DEC or DEC+J specify the conditional probability of cladogenesis events (strongly constraining the "SS" part of "SSE"), then you get the same likelihoods.)

compare_many_LnLs_DEC_DECj_vs_claSSE_WORKS_ARCHIVE.png
Figure Caption: Log-likelihoods of a simple dataset (2 areas, 4 states/ranges) calculated under BioGeoBEARS (x-axis) and ClaSSE (y-axis). In the plot, blue "d" points indicate "DEC-e" models (d is positive, e=0, j=0). Black "e" points indicate traditional DEC models (d is positive, e is positive, j=0). Green "j" points represent the DEC+J model (all 3 parameters positive). A grid of possible d, e, and j parameter values was input into the BioGeoBEARS and ClaSSE likelihood calculators. These parameter value are not displayed, but the parameter combinations conferring the highest likelihood on the data, i.e. approaching the ML solution, are the points in the upper right of the plot.

The remaining small differences are due to the numeric integration steps in the ClaSSE ODE calculations. This is strong evidence that BioGeoBEARS and ClaSSE are doing the same math under these constraints, which is encouraging as they were programmed completely independently.

Why don't we just use ClaSSE for everything? As explained in Matzke (2013, Frontiers in Biogeography), as the state space increases at numstates=2areas, the number of parameters for the cladogenetic process in ClaSSE increases at something like (numstates3)/2. Even if many of these parameters were set to zero or equal to each other, I believe they still have to be represented in memory, and used in the numeric calculation. This probably means there are fairly strict limits to how many states can be used in a ClaSSE model, but the problem has not been thoroughly explored.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License