Beastmaster Test Files

Beast2 XML has a great many moving parts, and it is Very Nontrivial to have an R script like BEASTmasteR correctly read in the dataset and output correct XML that will actually run in Beast2. I usually consider it an achievement when I get BEASTmasteR working on my own dataset. But, the code has now advanced to the point where it is good to have some simple validation examples in addition to the example script on the main page. I am putting these in: BEASTmasteR test files.

The functioning examples (with inputs files and output XMLs) are available as zipfiles in "Files" at the bottom of this page.

My short "readme" files for each run are inside each zipfile. But, I will also post them below for google-ability.

WARNING #1: JUST BECAUSE THE XML RUNS, and BEASTmasteR is "tested" in that sense, DOES NOT MEAN THE SCRIPT HAS INCLUDED EVERYTHING THAT "SHOULD" HAVE BEEN INCLUDED IN THE ANALYSIS. IT IS YOUR RESPONSIBILITY TO CHECK THESE THINGS AS WELL AND THINK FOR YOURSELF

WARNING #2: EVEN IF YOU HAVE A PERFECTLY SET-UP XML FILE, THIS DOES NOT GUARANTEE THAT YOUR BAYESIAN MCMC ANALYSIS WILL GIVE VALID RESULTS. Many of these models are very new and not well explored; combining models in novel ways can sometimes help and sometimes hurt; and as always you (and your reviewers) have to think carefully about run times, number of MCMC samples collected, ESS values, parameter identifiability, priors, the raw data, the models and their assumptions, etc.

LW12.zip: Basic tip-dating (morphology model and BDSKY tree prior)

This shows a basic tip-dating analysis for the morphology-only dataset published by Lee & Worthy (2012), examining the phylogenetic position of Archaeopteryx amongst the theropod dinosaurs.

The tree prior is Birth-Death with Serial Skyline Sampling (BDSKY). This allows you to have different rates of speciation, extinction, and sampling in different time bins. The "treemodel" worksheet shows 2 time bins each for birthRate, deathRate, and samplingRate.

Whether or not you can successfully estimate the rate parameters for multiple time-bins is a different (very important) question. The BDKSY prior is pretty new and most of the interesting questions about how to use it are unexplored.

In some cases, you may want to just fix certain parameters. E.g., perhaps one might fix samplingRate to 0 in a time-bin where no fossils were discoverable / available for your study — note that you should not fix samplingRate to 0 just because there are no fossils! If many people have looked for fossils in that time-bin, but not found any, that is information that is relevant to dating!

You can also put standard prior distributions on these parameters, or, (more interestingly) relative priors. Relative priors are a cool idea but they do increase autocorrelation and thus cause poorer mixing and require substantially longer runs.

I set the starting samplingRate between 0.001-0 mya to 0 to ensure that psi-sampling (sampling through time with samplingRate) was not being confused with rho-sampling (instantaneously sampling many lineages at the same time, i.e. the present). When the starting value is 0, the parameter will stay at 0 throughout the analysis, since the scaling operator just multiplies/divides the value.

REFERENCES

SAMPLED ANCESTOR TREES IN BEAST
1 September 2014 by Remco Bouckaert
http://blog.beast2.org/2014/09/02/sampled-ancestor-trees-in-beast/

Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration
Alexandra Gavryushkina, David Welch, Tanja Stadler, Alexei Drummond
http://arxiv.org/abs/1406.4573

BDSKY tutorial
http://bdssm-beast2.googlecode.com/files/bdsky_tutorial.pdf

And everything by Tanja Stadler

NOTES

- Make sure the originTime is substantially older than the root of your starting tree (or your oldest dating constraint). It is a nuisance parameter that imagines the sampling process starting on the branch below the root; if originTime is too close to the present you will get -Inf log-likelihood and the run will crash.

- Make sure you think about the rho parameter, which represents the proportion of living species which you actually have in the tree. Typically this is a fixed parameter — work by Tanya Stadler states that if rho is free as well as birthRate etc., the parameters become non-identifiable.

- I have not implemented plain-old Birth-Death, Yule priors, etc., but these can be easily generated by setting some BDSKY parameters to 0.

- It appears that anything making use of the time-bins / Skyline feature of the BDSKY prior will crash Beast2.2.0 / Beast2.2.1 (I posted a question on beast-users, but no luck so far: https://groups.google.com/forum/#!topic/beast-users/uuZvsjTGPus ). I am therefore sticking with Beast2.1.3 (and also, Beast2.1.3 is currently what is available on CIPRES).

LW12_SABDSKY.zip: Sampled Ancestor BDSKY process

Here, I have changed the BDSKY tree prior to SABDSKY — a Sampled Ancestor, Birth-Death Skyline with Serial Sampling process (phew!).

The cool thing about SABDSKY is that it allows sampling of trees where a fossil might be a direct ancestor (i.e., a zero-length branch).

This process was promoted in the "fossilized birth-death process" work of e.g. Tracy Heath and John Huelsenbeck and others. The original concept used the fossils, their dates, and information about what clade they are in, but not characters. But, in Beast2, we can add characters.

Again, this is all quite new and the success / utility / etc. of the available implementation has yet to be worked out in any detail. Subjectively, I have noticed that dates can change a lot with the SABDSKY prior. This seems good if you think there is a high probability you have direct ancestors amongst your fossils (perhaps with e.g. well-sampled large mammals), but bad if you don't (almost everything else).

It is also a totally open question of whether the standard summary procedures (e.g. MCC trees) and display programs (e.g. FigTree) will be able to represent the result well in the hypothetical case where there is a high probability of a fossil being a direct ancestor. Think critically about your tools and what you are doing at all times!

REFERENCES

SAMPLED ANCESTOR TREES IN BEAST
1 September 2014 by Remco Bouckaert
http://blog.beast2.org/2014/09/02/sampled-ancestor-trees-in-beast/

Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration
Alexandra Gavryushkina, David Welch, Tanja Stadler, Alexei Drummond
http://arxiv.org/abs/1406.4573

BDSKY tutorial
http://bdssm-beast2.googlecode.com/files/bdsky_tutorial.pdf

And everything by Tanja Stadler

LW12_BDSKY_plus_continuous.zip: Adding continuous characters

This shows the setup for a BDSKY model, with discrete morphological data and 2 continuous characters. As described in:

"_readme_LW12_plus_fakeDNA_AAs_plusContinuous.txt"

…getting the continuous characters to work is nontrivial, and requires non-hard priors. And, when I ran a simulation test, I had the disturbing result that running with 100 continuous traits did much worse than running with 10 traits. I don't know what the issue is, and it would take some serious time to figure out, and it there might be something Fundamentally Bad about combining the continuous-trait and trees-with-fossils code as they currently exist in Beast2. So this should have a great big CAUTION warning all over it at the moment.

LW12_SABDSKY_plus_continuous_DOESNT_WORK.zip: SABSKY+continuous data doesn't work

It appears that combining continuous characters (which involves initializing a tree and then copying it) and the SABDSKY tree process (which has its own special tree structures) cannot be done at the moment. At least, I couldn't figure it out. I haven't bugged the beast-users group about it, as I feel like I'm pushing it already as both should be regarded as experimental by themselves, let alone combined.

But, here are the error messages, for the sake of Google-ability, and if anyone else wants to try playing with the XML further:

Attempt #1:

<tree id="shared_tree" name="stateNode" taxonset="@list_of_OTUs"/>

======================================================
Warning: state contains a node birthRateChangeTimes_tops for which there is no operator.
Warning: state contains a node deathRateChangeTimes_tops for which there is no operator.
Warning: state contains a node samplingRateChangeTimes_tops for which there is no operator.
Warning: state contains a node rho for which there is no operator.
Warning: state contains a node rhoSamplingTimes for which there is no operator.
Warning: state contains a node reverseTimeArrays for which there is no operator.
Start likelihood: -3.958409033737448E10 
Warning: Overwriting file traceLog.txt
         Sample      posterior          prior     likelihood logP(mr__LCA)) mrcatim_p_LCA) monophy_group) mrcatim_group) ucldMea__clock  morph_relRate seqs_mo_lihood seqs_mo_lihood seqs_mo_lihood seqs_mo_lihood BDSKY_p_d_tree     originTime     birthRate1     birthRate2     deathRate1     deathRate2  samplingRate1  samplingRate2
Warning: Overwriting file treeLog.txt
Warning: Overwriting file subsLog.txt
              0 -3.95840903E10 -2.398974346E7 -3.95601005E10        -5.5377       150.5453              0       150.5453         0.01           1.0        -4482.1041     -1404.9168      -411.2176       -33.2041      -855.8409       500.0            0.3            0.3            0.29           0.29           0.001          0.0    --
java.lang.ClassCastException: beast.evolution.tree.Tree cannot be cast to beast.evolution.tree.ZeroBranchSATree
    at beast.evolution.operators.UniformForZeroBranchSATrees.proposal(Unknown Source)
    at beast.core.Operator.proposal(Unknown Source)
    at beast.core.MCMC.doLoop(Unknown Source)
    at beast.core.MCMC.run(Unknown Source)
    at beast.app.BeastMCMC.run(Unknown Source)
    at beast.app.beastapp.BeastMain.<init>(Unknown Source)
    at beast.app.beastapp.BeastMain.main(Unknown Source)

Attempt #2:

<tree id="shared_tree" name="stateNode" taxonset="@list_of_OTUs" spec="beast.evolution.tree.ZeroBranchSATree"/>

======================================================
Warning: state contains a node birthRateChangeTimes_tops for which there is no operator.
Warning: state contains a node deathRateChangeTimes_tops for which there is no operator.
Warning: state contains a node samplingRateChangeTimes_tops for which there is no operator.
Warning: state contains a node rho for which there is no operator.
Warning: state contains a node rhoSamplingTimes for which there is no operator.
Warning: state contains a node reverseTimeArrays for which there is no operator.
Start likelihood: -3.958409033737448E10 
Warning: Overwriting file traceLog.txt
         Sample      posterior          prior     likelihood logP(mr__LCA)) mrcatim_p_LCA) monophy_group) mrcatim_group) ucldMea__clock  morph_relRate seqs_mo_lihood seqs_mo_lihood seqs_mo_lihood seqs_mo_lihood BDSKY_p_d_tree     originTime     birthRate1     birthRate2     deathRate1     deathRate2  samplingRate1  samplingRate2
Warning: Overwriting file treeLog.txt
Warning: Overwriting file subsLog.txt
              0 -3.95840903E10 -2.398974346E7 -3.95601005E10        -5.5377       150.5453              0       150.5453         0.01           1.0        -4482.1041     -1404.9168      -411.2176       -33.2041      -855.8409       500.0            0.3            0.3            0.29           0.29           0.001          0.0    --
java.lang.ClassCastException: beast.evolution.tree.Node cannot be cast to beast.evolution.tree.ZeroBranchSANode
    at beast.evolution.tree.ZeroBranchSATree.getDirectAncestorNodeCount(Unknown Source)
    at beast.evolution.operators.UniformForZeroBranchSATrees.proposal(Unknown Source)
    at beast.core.Operator.proposal(Unknown Source)
    at beast.core.MCMC.doLoop(Unknown Source)
    at beast.core.MCMC.run(Unknown Source)
    at beast.app.BeastMCMC.run(Unknown Source)
    at beast.app.beastapp.BeastMain.<init>(Unknown Source)
    at beast.app.beastapp.BeastMain.main(Unknown Source)

LW12_plus_fakeDNA_AAs.zip - Adding DNA and/or Amino Acids (AAs)

Here, I have added some (fake) DNA and AA NEXUS files. I include the XML for these runs:

discrete morphology + DNA under JC69
discrete morphology + DNA under HKY
discrete morphology + DNA under GTR
discrete morphology + DNA under GTR + Amino Acids (AAs) under the WAG model

The other Amino Acid models should work just as well (all the XML does is switch the name), but I haven't tested them at all. The "ratesQ" option should let users input a manually-specified AA rate matrix, but I haven't tested this either.

These datasets are all added by

- changing the "use" column to "yes" for the relevant rows in the worksheet "data"
- changing the "model" column to "GTR" (or whatever) for the relevant rows in the worksheet "data"

Of course, you will have to make sure your start position (startchar), end position (endchar), filenames, etc., are correct.

Some examples of how to specify partitions are show in the worksheet "data", although these were designed for an unpublished dataset, so you will have to change them to experiment with them on your own dataset.

LW12_plus_fakeDNA_AAs_plusContinuous.zip — All 4 data types

ADDING CONTINUOUS DATA TO A TIP-DATING ANALYSIS

Here, I have turned on some fake continuous traits in addition to the discrete morphology, fake DNA, and fake amino acids. The fake traits are specified in the "traits" column.

Unfortunately, continuous traits currently require that a tree be initialized in the XML and then copied. See discussion with Remco here:
https://groups.google.com/d/msg/beast-users/7KFzWd4PVeQ/cpAdmyIwUhMJ

Presumably this will be improved on in future versions of Beast2, so I have not worried too much beyond getting it to physically run.

EDITS TO THE EXCEL SPREADSHEET

1. In the "data" worksheet, I set "use" to "yes" for row #11. As the word "traits" is there, the script will look in the worksheet called "traits". A tab-delimited file could have also been used.

NOTE: For whatever reason, it appears that the initial tree is random, and thus any hard constraints (monophyly requirements, or uniform priors on ages) will work poorly.

To get it running, therefore, in the Excel settings file I:

2. nodes worksheet: changed "ingroup" monophyly column ("mono") to "no".

3. "OTUs" worksheet: I kept all the Uniform priors, but I set the "convert_to_normal" column to "yes". This column overrides whatever other priors have been specified, and converts the prior into a fairly tight normal distribution with the same mean.

The "convert_to_normal" column can be EXTREMELY helpful for getting an analysis running, either to get a starting tree, or here, when you are forced to use a random starting tree but still want the prior information on the node dates. This is much better than manually editing dozens of dating priors, when the uniform distributions you originally painstakingly entered cause some problem for the starting tree!

There is also a "convert_to_normal" column in the "nodes" worksheet, for the same reasons.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License