Fossil Data In Biogeographical Analysis In Biogeobears

As BioGeoBEARS has developed, I have programmed a number of methods for including fossil data. The initial steps are described in chapter 4 of my Ph.D., and additional methods have been developed to help collaborators deal with the specific fossil data available to them. It will be some time before all of these methods are formally described and published, as first I have to get all of my other Ph.D. chapters accepted. However, the methods exist, so they should be used if needed. (Whether or not they are easy to figure out and use is another question — in some cases it will be awhile before I can work up proper tutorials/examples, and sometimes the methods may need further refinement to deal with particular datasets; I am open to collaborations if a researcher finds that useful.)

In the meantime, I will briefly describe the options for fossil data that currently exist in the BioGeoBEARS package / the additional source files which will soon be added to the package.

(Back-link to BioGeoBEARS)

Fossil data have many well-known issues — incompleteness, uncertain taxonomy, uncertain dating, etc. These difficulties apply to any usage of fossils in phylogenetic comparative methods, ancestral state estimation methods, etc.

Furthermore, specifically with the question of biogeographical analysis and ancestral range estimation, fossil data have additional peculiarities that should be considered, the primary one being that a fossil specimen is only ever found in one time and place. Estimating a geographic range inherently requires the identification of a number of specimens belonging to the same species/clade.

In some cases, sufficient fossils exist to coherently describe the "range" of a fossil species, but it is very common for fossil species to be known from only one locality or region. The true range is not known — the species may have lived elsewhere, but has not been found yet, or perhaps no rock record of the correct age/environment/region exists.

Despite the difficulties — and actually, many of these difficulties may also exist for living taxa more than some might wish — it still seems that we will do better biogeography with fossil data included than by ignoring it. Furthermore, sometimes paleontologists have all-fossil datasets, and they may wish to ask the same sorts of questions about ancestral ranges, the importance of different processes, statistical model selection, etc., that neontologists ask about living taxa.

I have developed a number of methods to enable various possible ways of including fossil information. It will be some time before a really thorough overview and tutorial exists (and hopefully I, or I and collaborators, will be able to publish thorough descriptions and tests of the methods). The question of which method is "best" will depend on the researcher's question and dataset. And there are probably additional methods I haven't thought of.

As always, if you have specific questions, email me at gro.soibmin|ekztam#gro.soibmin|ekztam, or the BioGeoBEARS google group. I am busy enough that for serious help assistance/analysis set up, I tend to prefer an official collaboration, especially as these are new methods that require some care in use and interpretation; but I can give short hints otherwise to people who want to try stuff out and experiment for themselves.

Constraining the ranges at nodes

One way to include fossil information in BioGeoBEARS analyses is with the "fixnode" and "fixlikes" options in the BioGeoBEARS_run_object. Here, you fix the likelihoods under different states/ranges for a particular node or nodes. One or more internal nodes to fix are listed in "fixnode" (numbered according to APE's node-numbering scheme, see APE's "nodelabels" function), and the corresponding likelihoods under each state are assigned to fixlikes (either a list of likelihoods for a single fixed node, or a matrix for multiple fixed nodes, with #rows=number of fixed nodes, #columns=number of states/ranges in the state space.

For example, let's say you have an analysis with 2 areas, A and B. You have 2^2=4 possible ranges: {null, A, B, or AB}. Let's say that a fossil from your taxon is found in region "A", and has the correct age and characters to correspond to, say, node 20.

(Perhaps the fossil is similar enough to be considered "close enough", whether or not you think the fossil really came from the exact population representing the exact ancestral population corresponding to node 20. Of course, the question of direct ancestry is fraught, and such decisions would be a matter of expert opinion and somewhat subjective.)

In this situation, you would say fixnode = c(20), and fixlikes = c(0,1,0,1) for the ranges {null, A, B, AB}, respectively. That is, at node 20, you are fixing:

  • the likelihood of the data, under the hypothesis that the range is null, equal to 0 (as it is impossible that a direct ancestor lived nowhere; this hypothesis always conveys 0 likelihood under default LAGRANGE-DEC / BioGeoBEARS analyses)
  • the likelihood of the data, under the hypothesis that the range is "A", is 1
  • the likelihood of the data, under the hypothesis that the range is "B", is 0, as this is an impossible explanation, given the fossil in A
  • the likelihood of the data, under the hypothesis that the range is "AB", is also 1

As with uncertainty in base-calls in nucleotide data, likelihood "1" is assigned to any state that is a possible explanation of the observed data. If the fossil lives in "A", the true range could be either "A" or "AB". The likelihood of the data under each hypothesis is what is being specified. Determining the relative probability of each hypothesis, given the data, is a different question, and this answer is the product of the analysis, not the input. See Felsenstein 2004's discussion of the likelihoods assigned to ambiguous states in DNA for further discussion.

Of course, reviewers may well object to identifying a fossil with a particular node in a tree, since the obvious question is, "How do you know that fossil goes with that node?" So, one would want to make a decent argument for this strategy. The main argument would be that it is better than *not* including the fossil information in some fashion. And, of course, it would be simple to place the fossil at various different nodes and observe the impact on the ancestral range estimates, and the overall data likelihood.

Actually, the main use of the fixnode/fixlikes option was simply to allow hypothesis testing for historical scenarios. For example, one could constrain node 20 to have to include area "A", and then constrain it to exclude "A", and then compare the likelihoods under these two scenarios. The difference in likelihoods would indicate whether your data have a strong preference for one of the hypotheses. Probably this will remain the main purpose of fixnode/fixlikes, but it is one simple way to try and include fossil data in an analysis, if not a particularly satisfactory one.

Including fossils as terminal taxa

Wood, Matzke, Gillespie, & Griswold (2013, Systematic Biology) pioneered the combination of tip-dating and parametric biogeography (with LAGRANGE DEC). To my knowledge, only one earlier paper used LAGRANGE on fossils (Nesbitt et al. (2009), Science, DOI: 10.1126/science.1180350; although I also just came across: Loewen et al. (2013), PLoS One, DOI: 10.1371/journal.pone.0079420)

"Tip-dating" is also known as total-evidence dating, or Bayesian phylogenetic dating including fossils as terminal taxa; see Pyron (2011), Ronquist et al. (2012) for the two main methods (MrBayes and BEAST), and for applications, see e.g. Wood et al. (2013), or Alexandrou, Swartz, Matzke & Oakley (2013).

LAGRANGE seems to have been conceived for dated, molecular trees, and thus Python LAGRANGE threw a warning message if a non-ultrametric tree was input. However, the likelihood calculations will work on any tree, including a tree with fossils as terminal taxa. (In trees including fossils, branch lengths are still in units of time. If you have a tree with branch lengths in some other units, e.g. molecular branchlengths, LAGRANGE and BioGeoBEARS will still run, but it is not clear how to interpret the d and e rates.)

Including "direct ancestors" as "hooks"

In rare situations, you might find it justifiable to hypothesize that a fossil/collection of fossils comes from a lineage that is a direct ancestor of an extant clade (e.g., some fossil mammal species are given this interpretation by experts). These can be included in BioGeoBEARS as "hooks", by which I mean ultra-short side branches. BioGeoBEARS distinguishes hooks from regular branches by their length: any branch shorter than the "min_branchlength" input of the calc_loglike_sp function (default min_branchlength=0.000001) is interpreted as a hook, and the program prints a warning to warn users in case this was not intended.

When the likelihood is calculated at the node just below the hook tip, no cladogenesis model is applied. This, and the very short branchlength, effectively multiplies the likelihoods at that node by whatever likelihoods were input for the hook tip. If only one state/geographic range is possible according to the hook tip likelihoods, that state will be forced at the node just below the tip. This effectively means that you think you know the geographic range of the lineage at whatever date you've assigned to the hook. Alternatively, you could code partial range information (see above) into that tip likelihood.

Obviously, whatever decision you make, you should be prepared to defend to reviewers. The cladistic prohibition against ever inferring direct ancestors has become something of a dogma, and one that needs to be questioned in some cases (for starters, anyone who believes that species have stratigraphic ranges has already accepted that some fossils are members of populations ancestral to later populations). And, probably we will soon have proposals for Bayesian inference of direct ancestors, where some fossils will be inferred a nonzero probability of being direct ancestors. However, it is true that in most situations (sparse fossil record, one or a few fossil specimens per species, very incomplete sampling), assumptions about direct ancestry will be difficult to justify, and in those situations you should include fossils as actual side branches if possible.

Validation of hooks with/without time-stratification

In Feb. 2016, I discovered a bug in time-stratified analyses when using hooks to represent direct ancestors. Basically, the time-stratified functions had a lower hard-coded default for direct ancestors (1e-21), thus, even when a non-stratified and a time-stratified analysis *should* have given identical results (e.g., time-stratified dispersal matrix with all 1s), it didn't, because the time-stratified analysis was failing to treat the hook as a direct ancestor in parts of the calculation. This has now been fixed, and the code uploaded to the main BioGeoBEARS page. The validation script is in "Files", below, on this page.

Coding no-data areas as "?"

A common situation with fossil data is that you will have information about presence, but not about absence. If you find a fossil in region A, you know that the geographic range must include region A. However, in the other regions, you may have no fossils of that species, or no fossils at all (perhaps there is no exposed rock of the correct age). One simple way to include this information is to code the no-fossils areas as "?" in your geography data input file. Then, you set useAmbiguities=TRUE when calculating the starting tip likelihoods with tipranges_to_tip_condlikes_of_data_on_each_state.

If you've coded region A as "1", the "?" in other regions mean that any geographic range including "A" will have a tip likelihood of 1 (A, AB, ABC, ABCD, etc.), and any geographic range leaving out "A" (B, BD, etc.) will have a tip likelihood of 0. Thus your fossil represents a positive constraint.

This "positive constraint" strategy is maximally conservative about what your data says, which is an advantage. However, my limited experience with using it, thus far, indicates that it may be too conservative. Assigning likelihood 1 to any range that includes region "A" in effect gives substantial weight to widespread ranges, since there are many more possible ranges that are widespread (A, AB, AC, AD, ABC, ACD, ABD, ABCD) than there are single-area ranges (A). In most analyses, many species live in a single area, and a few are widespread, and it would be nice to have that information included. This would probably require a fully Bayesian model, with a prior on range size, and should be the topic of further work.

In any event, users can run their data with the positive constraint strategy, using "?" for no-fossil areas, and run it again with the assumption that fossil ranges are the true ranges, coding no-fossil areas as "0". The effects of these assumptions can then be compared.

(An aside: one should also be able to include a negative constraint in the geography file. E.g., a range of "0???" would mean that you are certain area "A" was unoccupied, and you have no information about the other areas. This is programmed, but not tested. A negative constraint represents a strong statement about your knowledge, but might be justified in situations where many fossils of the correct size/taphonomy to be detected have been collected, but the taxon in question has not.)

Imperfect detection model and taphonomic controls

I believe that the ideal way to include biogeographic data from fossils would be with a model of imperfect detection, using taphonomic controls to indicate sampling availability/effort. This is the topic of chapter 4 of my Ph.D., and the mf, dp, and fdp parameters in BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table.

This chapter needs revision and submission to a journal. It is on the back-burner until the other chapters are published. And, it would need a bit of a tutorial to make it easy to use. However, if you read Chapter 4 and find the approach interesting and would like to apply it, email me.

Matzke, Nicholas J. (2013). Probabilistic historical biogeography: New models for founder-event speciation, imperfect detection, and fossils allow improved accuracy and model-testing. Department of Integrative Biology, University of California, Berkeley. Order No. 3616487, ProQuest Dissertations and Theses, pp. 1-251. Open-access at:

Random addition of taxa of uncertain date/position

This is work being done in collaboration with Ruud Scharn. It deals with the common situations of fossils with (1) uncertain dating, (2) some constraint on phylogenetic position — the exact phylogenetic position of the fossil is unknown, but it is known to be in a clade's crown, stem, or either.

Basically, the user sets up a list of fossil taxa in Excel, lists the date range for each, specifies the clade and the type of constraint (any of the options described above), specifies whether the fossil's geography information is a constraint or the complete range, etc. Then the R function adds these fossils to a tree randomly, given the constraints. This can be repeated over many trees (e.g., a collection of trees from a BEAST analysis), biogeographical inference is run on each tree, and then summarized on a master tree.

Setting this up was a fairly massive amount of work, and it may be some time before I can write a tutorial that will make it easy (the publication is higher priority). Email me if interested.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License