Biogeobears Mistakes To Avoid

Introduction

(link to this section)

I have found that researchers tend to get excited once they get BioGeoBEARS installed and running.

The inputs are simple — a Newick tree file, a geography text file, and perhaps other text files. Often people have some initial difficulties due to errors in their input files or R script, but once those are resolved, a broad field of hypothesis-testing and ancestral range estimation opens up!

However, it is always important that researchers remember to think carefully about what they are doing with their input data, and what assumptions they are implicitly making, just by the choices of what sorts of phylogenies they use in the analysis.

Similarly, certain terms and thought patterns that are common in the history of phylogenetics and historical biogeography can be problematic if applied uncritically to likelihood/Bayesian analyses of biogeography. Often these are derived either from (a) intuitions about parsimony analysis, or (b) intuitions based on continuous-time models (e.g. for DNA or amino acids).

Below I highlight problems that I have occasionally seen in peoples' draft (or occasionally published!) BioGeoBEARS data and analyses. Please note that I am happy to look over draft manuscripts if you would like a second look before peer-review — email me at gro.soibmin|ekztam#gro.soibmin|ekztam (although please read the below first!).

This page is dedicated to concerns about implicit assumptions, concepts, terminology, and interpretation — for help with straight-up technical problems (e.g. your your input files won't load because you haven't specified the correct working directory), crashes due to underflow problems, etc., your best resource is searching/asking the BioGeoBEARS google group.

Don't confuse "ranges" and "areas"

(link to this section)

A ubiquitous confusion in discussions of historical biogeography involves the words "ranges" and "areas". Through conditioning, tradition, or imprecision, many people use these words rather randomly and interchangeably, leading to no end of confusion. The common term "ancestral area reconstruction" helps promote this confusion (on the critique of this term, see "Ancestral range estimation, not ancestral area reconstruction", below).

To clear it up:

  • A "range" is a geographic range. A geographic range can be made up of one or more areas.
    • (Actually, a range can be made of zero or more areas, because in DEC-type models the "null range" — a range of 0 areas — is a state in the transition matrix of the model.)
  • If you have areas A and B, then the possible ranges are A, B, and AB (and null). Thus, two areas makes for 4 possible ranges. 3 areas makes 8 possible ranges, etc.
  • If your analysis suggests that, say, ABC is the most probable ancestral range, it is pointless to speculate about "the" ancestral "area". The analysis is telling you the ancestral range wasn't a single area, it was an ancestral range covering 3 areas. Obviously, this model could be incorrect, but that is a different discussion.
  • Now, often enough (especially in island systems), the most probable ancestral range might in fact be a range of a single area. If this happens, great, but it does not mean the terms "area" and "range" should be mixed up in general.

Finally, everyone should understand the terms "state" and "state space". Biogeography models are particular cases of the generic class of models that we can call "discrete Markov models." Discrete Markov models assume that data can take a variety of discrete "states", and that these states interconvert stochastically with some probability (Markov process = stochastic process).

In a binary character model, there are 2 states (typically coded 0 and 1), and thus the "state space" is of size 2.

In a DNA model, there are 4 states (A, C, G, T), and thus the "state space" is of size 4.

In a discrete biogeography model, the "states" are the geographic ranges. Because the number of possible ranges doubles every time an area is added, the number of states in the model equals 2^(number of areas).

  • For 2 areas (A and B), there are 4 states/ranges (null, A, B, and AB).
  • For 3 areas (A, B, C), there are 8 states/ranges (null, A, B, C, AB, AC, BC, ABC)

Main point #1: states = ranges. I use these terms interchangeably.
Main point #2: Any area is also a range/state, but there are many ranges/states that are not single areas.

For more, see: Ancestral range estimation, not ancestral area reconstruction, and the BioGeoBEARS function numstates_from_numareas.

Use dated phylogenies

(link to this section)

The phylogenies used in BioGeoBEARS should generally be dated. They do not have to be ultrametric, because they might include fossil species/populations (see Fossil data in biogeographical analysis in BioGeoBEARS). (The phylogenies do have to be dichotomous; if you have a part of the phylogeny that is highly unresolved, you should run several possible trees.)

If the user inputs an undated tree (with branch lengths in terms of expected amount of molecular change, for instance), BioGeoBEARS will accept it, and likelihood calculations will be performed, and ancestral states estimated, under these assumptions, without throwing an error. HOWEVER, what is happening here is that BioGeoBEARS is assuming that the highest tip in the rooted tree (as read by APE when APE reads in the Newick file) is at time 0 (the present), and all lower tips in the phylogeny are interpreted as fossils, dated according to the branchlengths below the highest tip at time 0. This makes no difference to the likelihood calculations in an unconstrained, non-time-stratified analysis, so the analysis will run, but it is unclear what the point would be. The rates that you get out of such an analysis (for d, e, or a, the 3 anagenetic parameters) will not be rates per million years, they will be rates per unit branchlength. If you have some reason to think that this is more useful than rates in terms of time, you are free to do the analysis that way.

Of course, if you are doing time-stratified constraints, etc., you will have pathological results if you use an undated phylogeny.

Watch out for dated trees that are too short

(link to this section)

Recently, I have seen several users who have timetrees (dated phylogenies, i.e. chronograms) that are "very short." I find this very strange — first, that some program is producing these trees, and then it is also strange that people are trying to use these trees without noticing the weirdness at any point.

Background: A "very short" tree is one where all of the branches are very short — say, less than 0.1. Often, when all of the branch lengths are added together, the total length of all of the branches is less than 1, and the "height" of the tree (the distance from the root to the highest tips) is ~0.1. Unless you have a phylogeny of viruses or something else that evolves very quickly (Lake Victoria cichlids perhaps?), these are implausibly short branch lengths.

I am not sure where people are getting these dated, but very short, trees. I suspect that the culprit is that people are doing BEAST dating analyses, but with no actual dating information in terms of node priors or clock rates. THIS IS HAZARDOUS AND WILL RESULT IN TREES THAT ARE NOT PUBLISHABLE, IF THE PEER-REVIEWERS ARE PAYING ATTENTION. This procedure may not even give you correct *relative* dates, because when the tree model prior (the birth-rates, death rates, etc.) are dominating the analysis, weird things can happen.

If you have no good external dating information, but you want to estimate relative dates anyway (which is legitimate, if you state this clearly in your writeup), then in the BEAST setup, you should just put a tight normal prior on the date of the root, and give it a mean of, say, 1 or 10 or some such. Then you just recognize that all of your dates are relative to that prior constraint, and interpret your results accordingly.

Anyway, super-short trees may cause several problems for BioGeoBEARS users:

  1. Problems hitting the min/max limits on the parameter search. In the default BioGeoBEARS setup (see the example script), the default settings for the limits on rate parameters (the "min" and "max" columns of the parameters table, stored in BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table) are set up to make sense for a typical dated phylogeny, i.e., one which spans millions of years. The rate parameters (namely, parameters d, e, and/or a) are in units of events per million years. If your tree is only 0.001 million years long, then the rate parameter estimates may hit their max limit and still never reach their Maximum Likelihood (ML) values. This will mean that you don't estimate the ML, which means any model comparisons etc. are invalid. Crude fix: change the min and max so they make sense for your tree. Real fix: get more realistic branch lengths/ages on your tree.
  2. Ultra-short branches being mis-interpreted as direct ancestors. Another problem is that ultra-short branches might get misinterpretated as "hooks", i.e. direct ancestors, when you didn't mean this (see the rest of this page). Crude fix: Manually edit just those branches so they are longer than 0.0000001. Real fix: get more realistic branch lengths/ages on your tree.
  3. Machine precision and underflow or other calculation errors producing NaNs and crashes. Yet another problem is that machine precision can become dicey if the ML search function (optim or optimx) tries a low rate value on a super-short branch. This might calculate a probability that is so low that it goes below machine precision and results in a negative probability or some other problem that causes a NaN error and crashes the run. (NOTE: I have not really replicated this error, I can never seem to produce it on my brand-new, 64-bit Mac pro. But on an older machine, 32-bit, Windows, etc., it seems to sometimes be an issue.) Crude fix: Raise the "min" values for the parameters so that the optim/optimx function is less likely to try a super-low rate value. Real fix: get more realistic branch lengths/ages on your tree.
  4. Fundamental scientific / BEAST / dating problems. Apart from all of the technical issues, the more fundamental problem is that if your dating analysis was too casual/lazy/poorly thought out, the resulting tree will not be publishable, and then you are basically wasting your time. At the very least, make an effort to do correct relative dating (see above), using a tight normal prior on the root age in your BEAST analysis. For example, set the root age prior to Normal(mean=10, standard deviation=0.01). There is NO POINT to having high uncertainty on the root age in a relative dating analysis, since all of the dates are relative to that age anyway. It is, however, somewhat interesting to see whether or not nodes in the middle of the tree have tight or uncertain relative date estimates, since that tells you something about how strict or relaxed of a clock model is being estimated for your sequence data.

How to check if your tree is super-short

(link to this section)

How to check if your tree is super-short:

library(ape)

# Specify tree filename (trfn)
trfn = "tree.newick"

# Read tree and see original branch lengths
tr = read.tree(file=trfn)
tr$edge.length

# Plot the tree
plot(tr)
# Add the "time" (or relative time) axis
axisPhylo()

# Now, LOOK at the plot of the phylogeny.  If the x-axis spans, say, 5 or 10 or 50 time units 
# (typically with phylogenies, the units are millions of years), then your tree is not super-short. 
# But, if the x-axis spans only, say 0.06 units, then your tree is super-short, then you 
# probably have a problem.

Another method is to sum up the branch lengths in your tree:

library(ape)

# Specify tree filename (trfn)
trfn = "tree.newick"

# Read tree and see original branch lengths
tr = read.tree(file=trfn)
tr$edge.length

# Sum up the branch lengths
sum(tr$edge.length)

# You can also look at the minimum branch length
min(tr$edge.length)

If the sum of all of the branch lengths is <1, you probably have a very short tree (at least, assuming you have a reasonable number of species). All of this is a minor example of my overall philosophy, "Think about your data and what you are doing, don't just push buttons."

How to manually change the tree length

(link to this section)

If you want/need to manually make your tree longer, this is easy. Note that this will NOT fix any scientific/BEAST problems you might have had with your dating analysis in the first place. But it will get your branch lengths to a more reasonable length (BioGeoBEARS defaults assume most branches are between about 0.1 and, say, 10 or 50, time-units long).

# Modify branch lengths
tr2 = tr
tr2$edge.length = tr$edge.length * 1000
tr2$edge.length

# Write to new file
write.tree(phy=tr2, file="tree2.newick")

A tree with relative dating can be used in a non-time-stratified analysis, since the inference, likelihoods, etc. will be identical whether the dates are absolute or relative. The only thing that will change is the rate parameter ML estimates — if the branches are all 100 times longer, then the rate parameter estimates will be 100 times lower. The cladogenesis parameter estimates (e.g., j), will not change, because they are per-event-weights-per-cladogenesis event, not rates-per-million years or rates-per-time-unit.

Of course, if you are doing time-stratified constraints, etc., you will have pathological results if you use relative-dated phylogeny, because time-stratification requires absolute dates to specify geographical / geological hypotheses.

OTUs of the phylogeny should be species/populations, not specimens

(link to this section)

See also: How (and whether) to collapse tips to prune a tree

The OTUs (operational taxonomic units) at the tips of the input phylogeny should be species or monophyletic populations. A raw tree where the OTUs are individual specimens should NOT be used, unless each specimen can "stand in" for their respective species/populations. If an input phylogeny has multiple specimens for each species/monophyletic population, this will severely bias the results: a specimen can only live in a single area, so a phylogeny of specimens will have all of its tips living in single areas. When all of the tips inhabit single areas, this typically strongly favors the DEC+J model over the DEC model. This is fine if each species/monophyletic population really does live in a single area, but it's a big problem if species/monophyletic populations actually live in multiple areas, but you have input a specimen tree that forces each tip to live in a single area. Furthermore, if multiple specimens from the same species/population are in the phylogeny, a large number of fake "speciation"/"cladogenesis" events are introduced.

This subtle issue is often ignored, because most phylogenetic models (for DNA, amino acids, Mkv model for morphology, etc.) assume that nothing special happens at cladogenesis: whatever state is in the ancestor is instantaneously passed on to both daughters after cladogenesis. They are "continuous-time" models. In these models, all that matters is branchlength, not the number of cladogenesis events, so a specimen tree can be used without problems (or at least without major problems).

In contrast, in biogeographical models of geographic range evolution (Lagrange DEC, BioGeoBEARS DEC, DEC+J, etc.), the "character" (geographic range) changes according to an anagenetic model along branches, and changes according to cladogenetic model at splitting events (see the Figure at: http://phylo.wikidot.com/biogeobears#BioGeoBEARS_supermodel_graphic for a summary of the processes used by different models). Thus, it is problematic to introduce fake cladogenesis events that just represent the divergence of specimens, rather than the divergence of monophyletic populations/species.

The solution, if you have a specimen tree, is to collapse it to a species or population tree. Then, in any case where specimens from a particular population/species occupy multiple areas, the relevant tip is coded as living in e.g. 2 areas.

Code for collapsing a specimen tree to a species tree, given a specimen phylogeny, and an Excel table specifying which specimens go into which species/population OTUs, is found in the function prune_specimens_to_species.

See also: How (and whether) to collapse tips to prune a tree

Note: If the whole topic of coalescent gene trees versus phylogenies of populations/species is mysterious to you, you should probably read up on this topic before proceeding!

A simple Mk-like model, for biogeography on specimen trees or gene trees

(link to this section)

Despite my above statements, it can still be meaningful to do biogeography on a specimen tree — it is just not that meaningful to use models that allow gene lineages to exist in multiple areas. The actual physical ancestor of any particular genetic locus only lived in one location at any particular point in time. (There may well have been identical copies of that genetic locus that lived in many different places, but that is a different question.)

We can use BioGeoBEARS to make a simple model where any gene lineage can only live in one location at a time. This model is essentially the Lewis (2001) Markov-k (Mk) model for an unordered character with k states. If k=2, you have a binary character, if k=3 you have a multistate character with 3 states, etc.

The Mk model is available in many R packages (e.g. APE, phytools) and other software (SIMMAP), and treating biogeography as an unordered discrete character has a long history [Note 1] in biogeography (although not without controversy, especially in the case where you think that populations/species might sometimes occupy multiple areas).

However, the advantage of doing an Mk-like model in BioGeoBEARS is that you can make use of model additions like time stratification, manual dispersal multipliers and the +w model variant, distance matrices and the +x model, the +J model (although it is debatable if is a useful addition in the Mk-model context), etc. Technically these model additions make the model used something other that Mk…they could be called Mk+x, Mk+w, Mk+x+w, etc.

The other advantage of doing an Mk-like model in BioGeoBEARS is that you can easily make the standard graphics, use Biogeographical Stochastic Mapping, etc.

Of course, if desired, this could also be used to run non-biogeography characters in BioGeoBEARS (this may be particularly useful if you have, say, a character with hundreds of states).

The Mk model as BAYAREALIKE*+a-d-e

(link to this section)

In order to make BioGeoBEARS produce an Mk-like model, we need to do the following things:

  1. Set the maximum range size to 1 area by setting max_range_size=1
  2. Eliminate the null range (no areas occupied) by setting include_null_range=FALSE
  3. Set the parameter d (rate of range-expansion dispersal) to be fixed to 0
  4. Set the parameter e (rate of "extinction", i.e. rate of range-contraction/local extirpation) to be fixed to 0
  5. Set the parameter a (rate of range-switching, e.g. A->B) to be free
  6. Set the cladogenetic range-change model to be BAYAREALIKE (the range just before speciation is always copied to both descendants instantly after speciation)

Eliminating the null range is the "*" modification of DEC-type models (Massana et al., bioArXiv). Adding the "a" parameter is +a, and eliminating d and e makes the model "-d-e". So, if you like, another way to describe the Mk model for an unordered character is BAYAREALIKE*+a-d-e.

I give the commands for doing this, below:

# (you will want to splice these commands in at the appropriate place in the example script)
# 1. Set the maximum range size to 1 area by setting {{max_range_size=1}}
max_range_size = 1

# 2. Eliminate the null range (no areas occupied) by setting {{include_null_range=FALSE}}
BioGeoBEARS_run_object = define_BioGeoBEARS_run()

# Input the maximum range size
BioGeoBEARS_run_object$max_range_size = max_range_size

# set to FALSE for e.g. DEC* model, DEC*+J, etc.
BioGeoBEARS_run_object$include_null_range = FALSE

# 3. Set the parameter d (rate of range-expansion dispersal) to be fixed to 0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","est"] = 0.0

# 4. Set the parameter e (rate of "extinction", i.e. rate of range-contraction/local extirpation) to be fixed to 0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","est"] = 0.0

# 5. Set the parameter a (rate of range-switching, e.g. A->B) to be free
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","type"] = "free"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["a","est"] = 0.0

# 6. Set the cladogenetic range-change model to be BAYAREALIKE (the range just before speciation is always copied to both descendants instantly after speciation)
# Set up BAYAREALIKE model
# No subset sympatry
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0

# No vicariance
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","est"] = 0.0

# No jump dispersal/founder-event speciation
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = 0.01
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = 0.01

# Adjust linkage between parameters
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ysv","type"] = "1-j"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ys","type"] = "ysv*1/1"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["y","type"] = "1-j"

# Only sympatric/range-copying (y) events allowed, and with 
# exact copying (both descendants always the same size as the ancestor)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","init"] = 0.9999
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","est"] = 0.9999

Note 1: See Matzke (2013), and Figure 1, column "Character mapping"; see figure here: http://phylo.wikidot.com/biogeobears#BioGeoBEARS_supermodel_graphic

References

(link to this section)

Lewis, Paul O. (2001). "A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data." Systematic Biology, 50(6): 913-925. DOI: https://doi.org/10.1080/106351501753462876

Massana, Kathryn A.; Beaulieu, Jeremy M.; Matzke, Nicholas J.; O'Meara, Brian C. (2015). "Non-null Effects of the Null Range in Biogeographic Models: Exploring Parameter Estimation in the DEC Model." bioArXiv, doi: https://doi.org/10.1101/026914. http://biorxiv.org/content/early/2015/09/16/026914

Matzke, Nicholas J. (2013). "Probabilistic historical biogeography: new models for founder-event speciation, imperfect detection, and fossils allow improved accuracy and model-testing." Frontiers of Biogeography, 5(4), 242-248. ( Journal | Scholar)

OTUs of the phylogeny should be species/populations, not genera or other higher clades

(link to this section)

I am basically morally opposed to genus-level analyses in DEC, DEC+J, and other biogeography models, where the tips have the ranges of multiple species, instead of individual species. Perhaps there is an argument that would convince me otherwise, but as far as I can conceive of the issue, the biogeographical models explicitly assume that ranges along branches, and range-changes at splits, are processes that happen to species and at speciation. One can argue about the reality of species etc., but that decision has already been made once one signs up to use one of the biogeographical models. It's not clear at all what is happening if the tips are genera instead of species, and the ranges are genus ranges instead of species ranges. For example, what do divergences on the tree represent? The splitting of the species ancestral to both tip genera? If so, then why do the tips have the ranges of multiple species combined? If the ancestral ranges on the tree represent ancestral genus ranges, and divergences on the tree represent the splitting of ancestral genera, what does that even mean?

So, I recommend avoiding genus-level analyses in BioGeoBEARS wherever possible.

That said, if all one has available is basically a genus-level phylogeny, there are some imaginable alternatives that might be better. However, they are not necessarily easy. For example, a researcher could:

a. simulate random dated relationships of the species within each genus (if, say, you had geography but not the phylogeny for some species)
b. pick one species within each genus and use that species range
c. reconstruct / guess the ancestral range of the LCA (last common ancestor) of each genus, and put in that range on a tip with the age of the ancestral node (as if it were a fossil)

If all one is interested in is the deeper biogeographical events, (b) is fairly acceptable, at least if the species are reasonably randomly sampled, geographically speaking. It's not much different than any other analysis, where we know there has been extinction, and thus missing tips, but we ignore it because we have no choice.

It is true to say that the practical effect of using genus ranges will mostly depend on the average difference between genus ranges and (b). If many/most genus ranges have much wider ranges than individual species, I think that this will create a really strong bias towards widespread ancestors, low estimates of J, and very high uncertainty about many ancestral nodes. But if many of the tip ranges are about like what the individual species ranges will be, the practical strategy will be close to (b) anyway.

At the very least, if one wants to use a genus-level analysis in a paper using BioGeoBEARS, I would strongly recommend that the paper include a good discussion of why this was done, what assumptions genus-level analyses involve, and how this method might impact the results.

Watch out for ultra-short branches

(link to this section)

In order allow the possibility of including "direct ancestors" in an analysis (either because you have a complete fossil lineage, or you wish to calculate the likelihood of the data under a particular hypothesis where you assume, for the sake of argument, that a particular lineage at a particular time had a particular geographic range), BioGeoBEARS interprets very short branches as "hooks". "Hooks" are very short branches that contain a direct ancestor, but represent it as a short side-branch so that trees with the ancestor can be saved to and loaded from standard Newick files.

Basically, this means you should make sure your phylogeny has no branches shorter than the min_branchlength input. By default, min_branchlength=0.000001. If (as is typical) your dated tree is in units of millions of years, a branch of length 0.000001 represents a branch length of 1 year. Presumably this is impossible to get in a standard dating analysis, so this should be a safe cutoff. However, there are rare-but-imaginable situations (e.g. virus evolution) where ultrashort branches are possible. Also, some users might accidentally divide all their branchlengths by 1,000,000 in some other program, or you could have really bad luck and have e.g. a BEAST MCC tree calculate a very short average branchlength (sometimes negative branchlengths are produced by the MCC procedure, although BioGeoBEARS will throw an error for these).

If BioGeoBEARS detects branches shorter than min_branchlength, it will print a warning to screen, but it will keep running on the assumption that you meant to have short branches representing direct ancestors. If you did not intend this, you should modify your input tree (e.g., change the units and multiply all branch lengths by 1,000) to fix the problem.

To see more discussion of the hooks method for representing fossils, see Fossil Data In Biogeographical Analysis In BioGeoBEARS: Including "direct ancestors" as "hooks".

Code for adjusting branchlengths in a Newick file:

library(ape)

# Specify tree filename (trfn)
trfn = "tree.newick"

# Read tree and see original branch lengths
tr = read.tree(file=trfn)
tr$edge.length

# Modify branch lengths
tr2 = tr
tr2$edge.length = tr$edge.length * 1000
tr2$edge.length

# Write to new file
write.tree(phy=tr2, file="tree2.newick")

Ancestral range estimation, not ancestral area reconstruction

(link to this section)

The term "ancestral area reconstruction" (AAR) is common in historical biogeography analyses. It often gets used indiscriminately for any sort of inference of ancestral biogeography, whether done with parsimony, likelihood/Bayesian methods, or something else.

However, the term is problematic for a few reasons. First, "reconstruction" of ancestral states (or ranges) is classically the result of a parsimony analysis, where character states are mapped on the tree and (often) a single most parsimonious reconstructed history of that character is found. In a likelihood-based analysis, however, everything is about probabilities, and about estimating the probability of every possible ancestral state at each node, given all histories that are possible explanations of the data under a specified model and parameters.

All too often, including in biogeography, researchers will take the single most-probable ancestral state (or geographic range) at each node, depict that in a graphic, and interpret it as if it was a parsimony reconstruction with no ambiguity. While this is fine for nodes that estimate a particular ancestral range with high probability (>50%), it is quite common in likelihood analysis for some nodes (particularly deep in the tree) to have high ambiguity about the exact ancestral range. If the single-most-probable ancestral range at the root node only has 5% of the total probability, it probably doesn't make sense to treat it as if it were "the" "ancestral area reconstruction". What you really have is a highly ambiguous estimate of the ancestral range at that node, with the probability distribution spread out over many possible ancestral ranges. I feel that the term "ancestral range estimation" (ARE) helps us keep this in mind.

A second problem with "ancestral area reconstruction" is that it seems to suggest that the focus of the inference is presence/absence in particular single areas, rather than inference of geographic ranges. As far as the likelihood-based statistical models function, however, the "character states" are not areas, but "geographic ranges". Geographic ranges are just characters with a large number of character states (every allowed combination of presence/absence in an area) and a complex model describing the probability of moving between these states). I suspect that the focus on single areas, rather than on ranges, is part of what inspires the popularity of certain biogeographical models that are, at the very least, debatable (e.g., treating biogeography as a unordered character where lineages instantaneously switch between regions, e.g. when DNA models are applied to 4-island systems; or the Bayesian Binary Method of RASP, which treats presence/absence in each region as an independent binary character).

Finally, using the word "estimation" in ancestral range estimation helps us remember that we are making a statistical estimate of history, and that the statistical tools that are widely used elsewhere (e.g. Likelihood Ratio Test, AIC, AICc, etc.) can also be used in biogeography. I find it quite surprising that, until BioGeoBEARS, there was not a major move to do statistical model choice in biogeography (although you can find some hints at this in the literature), despite the advent of likelihood-based, parametric methods with the publication of Lagrange (Ree et al. 2005, Ree & Smith 2008). Perhaps some of the delay was due to people thinking about biogeography in terms of Ancestral Area Reconstruction rather than Ancestral Range Estimation.

(Note: I have to credit Brian Moore who long ago first explained to me that "estimation" was really a better word than "reconstruction" for what we are doing in biogeography.)

(Note also: I will reliably suggest this change in terminology in peer-reviews, if I happen to be the reviewer on your paper. You have been warned!)

The plot of most-probable states is not the same thing as the best joint history

(link to this section)

When BioGeoBEARS runs, one of the outputs is a PDF graphic. The first page shows the single-most probable state (geographic range) at each node and corner (corners represent the states instantaneously after cladogenesis, nodes instantaneously before). (Note: if there is a tie for most-probable state, the first state in the states list is shown. This is not desirable but programming a workable graphic solution, beyond the pie charts, is more difficult than it seems.)

The second page shows the pie charts giving the relative probability of each state.

There is a temptation to just look at the first page, which has a simple output, and to read it naively as the "best estimate of history". While this isn't completely a horrible thing to do, since in many cases the most-probable ancestral states will probably represent the true ancestral states, there are some problems with it. First, careful readers may sometimes notice that there can be "conflicts" between the most probable ancestral states at nodes and at the corners just above them. E.g., perhaps the most probable state at a node is range "AB", and both corners have a most-probable state of range "A". This is an impossible joint history under the DEC or DEC+J models. At least two things might be going on: (1) ties in the probability of states "A" and "B", mentioned above; or (2) at both corners, the alternatives to range "A" are many, but none of them singly has as much probability as range "A". This could fairly easily happen at both corners, especially when probabilities are spread out over many ranges.

The way to avoid errors in interpretation is to remember the difference between estimates of the most probable states/ranges at each node taken by itself, and the most probable joint history across all nodes. See Felsenstein 2004 for further discussion. Also, always be sure to consult the pie-charts to double-check your intuitions about the ancestral range estimates. While they are more complex, the pie charts are giving you the more complete picture of the ancestral range estimates.

How do I get the best joint history

(link to this section)

Jedi mind push: The best joint history is not the history you are looking for.

There is currently no method in BioGeoBEARS for the best joint history. Such a thing might be theoretically possible, but off the top of my head, I don't know how to do it. Felsenstein (2004), Inferring Phylogenies, discusses the possibility for the simpler case of DNA models, but I don't know of an implementation even for the DNA case (it may exist).

In any case, the best joint history would have a tiny tiny tiny probability of being the correct history, except in super-simple cases. For example, if you define "a history" as a state at each node, and if you have 6 areas, and ranges can have up to 3 areas maximum, you have:

numstates_from_numareas(6,3)

…= 41 possible states/ranges at each node. If we exclude the null range, that's 40 possible states at each node. The list of possible histories would be this long: (40^(num_nodes)). For a 6 species tree, that is 102,400,000 imaginable histories. Each will have some probability, even if tiny. Some histories will be WAY more probable, but it might well be that none of them have even >1% of the probability. It would get even more complex if you included the corner states (just after speciation) in the history. Is it meaningful to present the single best joint history, even if the highest-probability history has only a 0.05% chance of being the correct history?

Stochastic mapping gives you a probabilistic sample of these histories, conditional on a tree, parameters, and tip data. For questions like counting (estimating really!) the timing of events, or e.g. the number of disperal events, I recommend that.

The default plots, showing the single-most probable state at each node/corner, are good enough at giving you the general picture of the probable history. Looking at the pie-charts is NECESSARY for correct interpretation (whether or not you need to include them in a publication is a different question that depends on space/importance/etc.), because it will alert you to cases where the single-most probable state nevertheless has low probability.

The pie charts also help researchers keep in mind that the plot of the single-most-probable state at each node is not literally a joint history (so, it can have inconsistences), much less is it the best joint history, much less is it The One True History That We Know For Sure Now That We've Run BioGeoBEARS. These are probabilistic methods and models, all inferences and conclusions are thus probabilistic as well.

The plot of most-probable states is not the same thing as an estimate of event counts

(link to this section)

Similarly, it is tempting to look at the most-probable states/ranges plot and read off the apparent dispersal/subset sympatry/vicariance/etc. events. Sometimes, in more advanced analyses, researchers will count these events on a large tree or across several clades, and plot number of dispersal events between different regions through time. Again, this is not a horrible thing to do — it bears some relationship to the truth. However, it's sort of a "parsimony-thinking" way to do things. In reality, the likelihood analysis took into account all histories that were possible under the model, including things like double events (two changes on a branch), which you will not take into account when you just look at the most-probable states plot.

The right way to do event counts would be to use biogeographical stochastic mapping. For an example, and updates on progress in adding it to BioGeoBEARS (I am still working on it), see: Stochastic mapping under biogeographical models (in progress), and The right way to count events -- Biogeographical Stochastic Mapping (BSM)).

"Founder-event speciation" versus "founder-effect speciation"

(link to this section)

Please see the mini-essay, BioGeoBEARS deals with "Founder-event speciation" not "founder-effect speciation".

Issues with Maximum Likelihood (ML) optimization

(link to this section)

For models with 2 or 3 free parameters, I have generally found the optimx ML search function generally pretty reliable at actually finding the ML solution. However, it is not perfect (e.g. sometimes problems seem to arise when j is very close to 0 anyway, and thus not adding anything to DEC except making it harder to find the DEC parameters). If you have 4 or more free parameters, you will have to work harder to ensure you are finding the ML solution.

Things specifically to watch out for:

  • Optimization check #1: The three-parameter model should always outperform (get higher likelihoods, i.e. less negative LnL) than the 2-parameter models nested within it. (This statement only applies when the models are nested.) If DEC gives LnL=-10, and DEC+J gives -15, you have a problem with optimization.
  • Optimization check #2: The LnLs are printed to screen during the ML search. They should, with some noise, climb rapidly and then bounce around near a peak. If they don't, you may be stuck in a flat region of parameter space.
  • Optimization check #3: Searches with different starting parameter values should reach the same ML solution.
  • Optimization check #4: It is worth reading the help file for optimx, particularly the "Value" section which describes the optimx output, and the codes indicating whether convergence occurred.
  • Optimization check #5: For more complex models, I have found that GenSA (Generalized Simulated Annealing) usually seems to do a better (although slower) search. In general, comparing two runs with identical settings, except one has BioGeoBEARS_run_object$use_optimx = TRUE, versus BioGeoBEARS_run_object$use_optimx = "GenSA", can be a useful check.
  • Optimization check #6: Sometimes, turning on parameter rescaling, or using speedup=FALSE, can improve the search, although in my experience it seems quite rare that these make a difference.
  • Optimization check #7: An excellent strategy is to gradually build up to the most complex models, using the ML parameters inferred from each simpler model as the starting parameter values for the next most complex model (i.e., the model with one more parameter). As the most complex model can be reached this way through several paths, running the optimizations progressively through two or more different "paths" of models is a pretty good check.

Nothing is ever 100% perfect in the world of numerical optimization, but with careful study and critical thinking about our results, we can do pretty well. For more detailed advice on these issues, see Advice On Statistical Model Comparison In Biogeobears.

Issues with interpretation of Maximum Likelihood (ML) statistical methods

(link to this section)

Corrections to some somewhat-common misunderstandings:

  • The Likelihood Ratio Test (LRT) can be used to compare, pairwise, two models. One model must be nested within the other. The LRT gives a p-value on the null hypothesis, which is that the two models confer equal likelihood on the data. If p is less than your cutoff, say, p<0.05, then you can say your data reject the null hypothesis.
  • AIC, AICc, etc. can be used to compare ANY models, whether or not they are nested, and can compare a bunch of models at once by calculating the relative likelihood of each model. However, there is no statistical frequentist theory yielding a p-value cutoff for AIC/AICc. They just give you a relative sense of model fit.
  • LRT, AIC, etc. are used to compare DIFFERENT MODELS on the SAME data. You cannot run model 1 on dataset A and model 2 on dataset B and then draw a conclusion about model 1 vs. model 2.

For more detailed advice on these issues, see Advice On Statistical Model Comparison In Biogeobears.

The acronym for BioGeoBEARS

(link to this section)

The correct acronym of BioGeoBEARS is "BioGeography with Bayesian (and likelihood) Evolutionary Analysis in R Scripts", not "BioGeography with Bayesian Evolutionary Analysis in R Scripts". The "and likelihood" is important, because so far the vast majority of BioGeoBEARS usage has been with just the likelihood-based inferences. Philosophically, I think Bayesian analysis is preferable, and I give an example in my thesis and my Systematic Biology submission, but realistically (a) for the questions most people are interested in, Bayesian vs. likelihood analysis will give about the same result, and (b) the likelihood analysis is faster and simpler.

I didn't put "likelihood" into the acronym, solely because "BioGeoBLEARS" just doesn't have the same ring to it, and the program was developed at U.C. Berkeley, home of the Cal Bears. I make this note just to avoid the possibility that someone in the future will think I don't know the difference between Bayesian and likelihood-based analyses. :-)

FAQ — Frequently Asked Questions

(link to this section)

Here, I post a few answers to questions that have come up many times.

Question: Too many areas / too many states / BioGeoBEARS too slow. Answers: diverse

(link to this section)

Question on the BioGeoBEARS listserv:

I am new to your program and am trying to analyze a relatively large data set consisting of 1277 species classified into 21 areas. The maximum number of areas occupied by a species is 16. I was hoping that you could provide some advice regarding parameter settings in BioGeoBEARS. I ran a preliminary unconstrained DEC analysis in RASP that finished in less than a day. However, I had to constrain the maximum number of areas to a very low value (e.g. 2). I am pretty sure that no analysis will finish if I specify a max areas value of 16. An obvious solution is to reduce the number of areas. However, we are a bit hesitant to do this as we would lose power. There is obviously a trade-off between power and resolution and being able to run a meaningful analysis to completion. Any thoughts are welcome.

My answer, slightly updated:

Hi — I'm traveling to Australia, so I'm only partially able to access things. But briefly, the key thing is the number of states. The most states I've ever done is about 2000. This took a few hours for a 12-taxon tree, on ~12 cores. A 1200-taxon tree would take about 100 times longer.

You can check the number of states with:

numstates_from_numareas(numareas=21, maxareas=16, include_null_range=TRUE)

…you can see that's 2,089,605 states. That means the transition matrix is 2089605x2089605 — that's too big to hold in memory, let alone exponentiate.

So, your only choices are:

1. Reduce areas or max number of areas. Focus on what you think are the "most discrete" features in the geography, i.e. geographical features that are clearly conserved on the phylogeny. IMHO there isn't a huge point in doing ancestral state inference except on things that are pretty conserved on the phylogeny anyway.

1a. One option is to just model the centroid point of each species, thus each species occupies 1 area. You could do up to 1000 or 2000 areas this way, but you are sacrificing whatever is useful about modeling species as widespread (e.g., information about the cladogenetic process). Whether this is important depends on how likely you think it is you will successfully get such information anyway…

2. Use BayArea (the C++ program, not the BAYAREALIKE model in BioGeoBEARS). This means you are assuming BayArea's cladogenesis model (ancestral range always copied exact to both descendants, e.g. assuming widespread sympatry). RevBayes can combine BayArea-type Bayesian history sampling with cladogenesis models, but my understanding is that sampling cladogenetic histories effectively is very difficult, at least for larger problems.

3. Get sparse matrix exponentiation working in BioGeoBEARS or your own program (I can probably get sparse matrix stuff working, it would take me about a solid week which I won't have for awhile. And it might not solve your scale of problem anyway).

4. Be a genius and solve the computational problems in some new way.

5. One option is subdividing the problem into subregions and doing each separately.

Sorry I don't have an easier answer!

Cheers, Nick

Question: Function not found, or other crash. Answer: SOURCE() ALL PATCHES/UPDATES

(link to this section)

Here is a question that looks to be due to a failure to source() all of the latest patches. The errors this can cause are diverse, but this often crashes (if e.g. the patch moves some code to a new function, which you don't have):

I am also running BioGeoBEARS on a large dataset (900 terminals and 11 areas - maximum number of occupied areas=9).

On a linux cluster (120 cores - 64 GB memory) I manage to run part of DEC analysis. However, before it completes the analysis and loads results on .Rdata, I get this error message:

> Uppass starting for marginal ancestral states estimation!
> Error in if (sum(relprob_each_split_scenario) > 0) { :
> missing value where TRUE/FALSE needed
> Calls: bears_optim_run … calc_uppass_probs_new2 -> calc_uppass_scenario_probs_new2
> In addition: Warning message:
> In (function (npt = min(n + 2L, 2L * n), rhobeg = NA, rhoend = NA, :
> unused control arguments ignored
> Execution halted


(link to this section)

The answer is MAKE SURE YOU SOURCE() ALL OF THE PATCHES. My reply:

It is likely due to not sourcing all of the updated functions (I did some updates back in May/early June). Just make sure the *current*, *complete* list of source commands is being used. They are listed here:

http://phylo.wikidot.com/biogeobears#script

The list of source() commands expanded in the last update, so an older script won't have them all.

You will have to run these source command *every* time after "library(BioGeoBEARS)" is being used. Alternatively, you can download each file and source them locally — again, every time after "library(BioGeoBEARS)" is being used.

Yes, I should do a CRAN update or at least github, but I've been moving to Australia and enough updates have been made now that just doing the documentation is a bit of a project. It will happen eventually!

I just ran the Psychotria example, both stratified and non-stratified, and both worked.

If this does not work, please tell me!

Thanks!
Nick

Mistakes in the literature

(link to this section)

I will accumulate mistakes that I spot in the literature, especially pertaining to likelihood, the Likelihood Ratio Test, AIC, and optimization problems, in the following page: Mistakes in the Literature. I may write a review on the topic at some point for a journal where basic mistakes seem particularly frequent.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License