Table of Contents
|
I have been getting emailed questions about analyses using "*" models — DECstar/DEC*, or other +* models. I have answered these one-by-one, but I might as well provide the same info to everybody. I give several attempts below. Please note that these are my opinions, other authors on Massana et al. might have a different take!
Key points about +* models
1. REQUIREMENT: Read Massana et al. very carefully, and try hard to understand it, before trying to interpret model comparison results that include +* models.
1a. Here is the paper:
Kathryn A Massana, Jeremy M Beaulieu, Nicholas J Matzke, Brian C O'Meara (2015). Non-null Effects of the Null Range in Biogeographic Models: Exploring Parameter Estimation in the DEC Model. bioRxiv. http://dx.doi.org/10.1101/026914 http://biorxiv.org/content/early/2015/09/16/026914
1b. ESPECIALLY do this before trying to publish. I have seen some common mistakes that are fatal if peer-reviewers are paying attention.
2. REQUIREMENT: Check if your parameter estimates have hit the maximum. For example: BioGeoBEARS has default limits on parameters. These normally help to improve the search by limiting the difference in scaling parameters. For d and e, the default maximum is ~5. If your Maximum Likelihood (ML) parameter estimates hit 5 or are very close to it, this means your search stopped because it hit this default limit, NOT because you've actually gotten the ML result.
2a. If you don't have the actual ML result, then your AIC, AICc, statistical model comparison results, etc. are invalid. This is a rejection-worthy mistake!
2b. The solution is to change the max for d and e, and re-run. If you hit the max again, you still have the same problem. Raise the max again and re-run.
2c. You can look at the min/max settings like this:
BioGeoBEARS_run_object@BioGeoBEARS_model_object@params_table
2d. You can change the max like this:
# Defaults
BioGeoBEARS_run_object@BioGeoBEARS_model_object@params_table["d","max"] = 5
BioGeoBEARS_run_object@BioGeoBEARS_model_object@params_table["e","max"] = 5
# Raising the max
BioGeoBEARS_run_object@BioGeoBEARS_model_object@params_table["d","max"] = 100
BioGeoBEARS_run_object@BioGeoBEARS_model_object@params_table["e","max"] = 100
3. REQUIREMENT: Think carefully about what ultra-high rate estimates mean. Inferred rate parameters are not just yadda yadda yadda, they are telling you how the model is behaving while fitting the data. Sufficiently huge differences in rates (e.g. rates of zero, or super-high rates) effectively can produce entirely different models from what you might have in your personal conceptual understanding of what you think a model is doing.
3a. For example, DEC* with moderate d and ultra-high e is very similar to an Mk model (Markov-k by Lewis 2001; this is a model for an unordered character, or a biogeographical model where the only ranges allowed are single-area ranges).
3b. Advice for thinking about rates: If your tree's branch lengths are in millions of years, your rates are in expected number of changes per million years. Typically, if a character (or geographic range) is conserved, the rate of change will be lower than the diversification rate. This means that changes are typically rarer than branching events on the tree, which means that the character is conserved, which means that there is some phylogenetic signal in the character.
You can check the diversification rate of your phylogeny with a small script like this (this loads the example Psychotria dataset in BioGeoBEARS):
library(ape)
library(BioGeoBEARS)
# Load the Psychotria tree
extdata_dir = np(system.file("extdata", package="BioGeoBEARS"))
trfn = np(paste(addslash(extdata_dir), "Psychotria_5.2.newick", sep=""))
tr = read.tree(trfn)
# Estimate the rates (this is a simple method in APE,
# see geiger for more sophisticated options)
# birthrate means: speciation rate
# deathrate means: extinction rate
#
# Note: the lineage extinction rate is a different thing
# from the "e" in DEC, which is a range of range
# contration
#
birthdeath_results = birthdeath(tr)
birthdeath_results
diversification_rate = birthdeath_results$para["b-d"]
relative_extinction_rate = birthdeath_results$para["d/b"]
birthrate = diversification_rate / (1-relative_extinction_rate)
deathrate = relative_extinction_rate * birthrate
# Look at the estimates
# Diversification rate (speciation - extinction)
diversification_rate
# Relative extinction rate (extinction / speciation)
relative_extinction_rate
# Speciation rate
birthrate
# Extinction rate
deathrate
In Hawaiian Psychotria, the estimated speciation rate is ~0.32. This is also the diversification rate, because extinction is estimated at 0 (it is very common for diversification inference models to infer a lineage extinction of 0 as the ML estimate, even though the true rate of extinction is likely to be much higher).
Note that the speciation rate (0.32) is much higher than the estimated d and e under the DEC model (d~0.034, e~0.025). This is typical (although, e is often estimated as 0 for many clades).
3c. ZEN VERSION. Here is some zen on this topic — if you don't get this, you won't understand what is going on with these model/parameter combinations:
Remember, if d is moderate and e is very, very high, then any time there is a "d event" (range expansion), this would be nearly-instantaneously followed by an "e event" (range contraction). Therefore, this is basically a "jump" process (a lineage moves to a new region and leaves behind no ancestors).
Therefore, the DEC* model with the following parameter combinations are pretty much really different things:
DEC*-with-moderate-d-and-moderate-e
DEC*-with-moderate-d-and-near-infinite-e
As I think we say in Massana et al., this weird behavior results from treatment of the "null range". Really, neither DEC, DEC*, or DEC+J (or the other available biogeographical models) deal with this at all well, for typical datasets. The fundamental problem with all of them is that they leave out lineage extinction. Lineage extinction would be great to include, but it requires an SSE model, and these models get too slow above about 2 or 3 areas. So, we are stuck with approximations.
Background: the "DEC*" model
The "*" model variant is introduced and discussed in:
Kathryn A Massana, Jeremy M Beaulieu, Nicholas J Matzke, Brian C O'Meara (2015). Non-null Effects of the Null Range in Biogeographic Models: Exploring Parameter Estimation in the DEC Model. bioRxiv. http://dx.doi.org/10.1101/026914 http://biorxiv.org/content/early/2015/09/16/026914
The DEC* model can be run in the modified version of Lagrange discussed in Massana et al. Alternatively, +"*" model variants can be run in BioGeoBEARS by changing BioGeoBEARS_run_object$include_null_range to FALSE for each model in the example script.
What the "*" model does is remove the null range from the state space. In biogeography analyses, the "state space" is the list of possible ranges (please note: ranges and areas are different things! Ranges are combinations of areas).
So, if you have 2 areas, A and B, you have 4 possible ranges in the state space of DEC:
DEC ranges: null, A, B, AB
This means the anagenetic transition matrix is 4x4.
But in DEC*, the null range is removed:
DEC* ranges: A, B, AB
This means the anagenetic transition matrix is 3x3.
It turns out that including or excluding the null range from the state space can have huge effects on parameter inference, and model interpretation. If the null range is included, then e has to be small, because the null range is never observed in the data, and the best way to avoid producing the null range is to have a low e.
However, if the null range is excluded, then there is nothing pushing e down. If the observed data at the tips are mostly single-area ranges, then these data are most likely if e is very, very high — the higher e is, the shorter any widespread ranges will persist, and the lower the probability is of there being any widespread ranges at the tips.
Note: IF the tip-data are NOT mostly single-area ranges, you will get different behavior from DEC* etc.! You won't get e shooting towards infinity, you won't get an Mk-like model, etc. Model behavior is data-dependent, there is not "one" behavior! I have seen this mistake at Evolution 2017!
For more discussion, see the Massana et al. paper.
Answer to a question
Answer to a question:
Ok, so just to see if I understood everything, Massana et al. suggested that the exclusion of the null range in the anagenesis transition matrix would improve the estimation of the parameters, especially e, because it was being underestimated in "regular" DEC analysis. In the paper, she is suggesting to use it only on DEC models, which makes a lot of sense to me, however, in your talk last year on Evolution2015, you suggested that it could be applied to all other models, and it should work in the same way, improving the parameters estimation during the analysis. So there is no problem with this, right?
Hi! Adding the "*" may well improve the log-likelihood — this is often observed.
Whether or not it "improves the parameter estimation" is a totally different question, and it depends what the true model is out there in real life, which we don't know. I tend to think e=5 or e=100 is a pretty ridiculous parameter "estimate" — remember, that number means "events per million years". So e=100 means "100 range contraction events per million years".
Typically, rate parameters are only reliably inferred when they are substantially less than the diversification rate. A typical diversification rate is something like 0.3 speciation events per million years (you could estimate it on this tree using various packages, e.g. birthdeath in APE). If events are rarer than speciation, e.g. d=0.05, then you can expect that changes will happen less frequently than speciation events, and there will be phylogenetic signal to estimate the parameter. But any rate much greater than the speciation rate is going to indicate an event that happens so frequently that there is no phylogenetic signal. You can infer that the rate is high, but the exact inference is probably highly uncertain.
Massana et al. shows:
A. If the true (simulation) model is DEC*, plus moderate d, plus moderate e, then you get many tips living in multiple areas, and then inference under DEC* is better than inference under DEC.
B. It also shows that, for many empirical datasets, few or no tips live in multiple areas. In this situation,
B1. DEC+J will often estimate d~0, e~0, j=moderate
B2. DEC* will often estimate d~moderate, e=infinity (effectively)
Both DEC+J and DEC*-with-high-e, when simulated under, produce most tips with single areas, so they fit the data better than standard DEC, and also better than DEC*-with-moderate-d-and-e.
Combining DEC* and DEC+J into DEC*+J sometimes gets a higher LnL than either alone, so this is worth trying.
However, both B1 and B2 are essentially proposing "jump" models, so in the situation where these kinds of parameters estimates are made, what you have learned from your data is mainly: your data do not support a slow, gradual process where ranges expand once every few million years per lineage, and then contract at similar rates. Instead, they support some kind of "jump" process. Whether DEC*-with-high-e, or DEC+J, fit better, is probably a matter of whether range-switches at the tips are more correlated with the branchlength between tips, or the number of nodes between tips.
Summary
My summary: Try out the +* models if you are interested in seeing the effect of excluding the null range from the state space. But, be sure to think carefully about the parameter estimates that result, and ask yourself what they mean! Biogeographical Stochastic Mapping (BSM) is a possible method to visualize more directly what exactly a particular set of model parameters is suggesting about the biogeographic history.