Beastmaster
BeastmasteR_cover.jpg

The purpose of "BEASTmasteR" is to convert NEXUS data file(s) (DNA, amino acids, discrete morphological characters, and/or continuous traits), plus an Excel settings file, into Beast2 XML format.

The author is Nicholas J. Matzke, ORCID: http://orcid.org/0000-0002-8698-7656 .

BEASTmasteR is primarily aimed at enabling tip-dating analyses using fossils as dated terminal taxa. The Birth-Death-Skyline-Serial-Sampling tree model is used (or the Sampled-Ancestors variant), enabling different birth, death, and sampling rates through time. However, it may be useful for setting up molecular-only analyses (particularly automated variants), for learning Beast2 XML (since the BEASTmasteR XML output is annotated and much easier for human reading than the BEAUTi output), or for automating such analyses. BEASTmasteR also contains functions for parsing Beast2 output, e.g. plotting a dated tree in R, with posterior probabilities, 95% HPDs on node dates, and also 95% HPDs on tip dates, if available.

If by chance you somehow missed the movie reference, I can recommend this review (warning: some profanity, provoked by various ludicrous scenes) and the original preview on YouTube.

Publications using/citing BEASTmasteR

(link to this section)

As of:

2016-01-25: 3 publications on Google Scholar
2016-02-02: found a fourth

Citation

(link to this section)

BEASTmasteR will eventually become an R package and have a publication associated with it. Until then, please cite:

Matzke, Nicholas J. (2015). "BEASTmasteR: R tools for automated conversion of NEXUS data to BEAST2 XML format, for fossil tip-dating and other uses." Instructions at PhyloWiki, http://phylo.wikidot.com/beastmaster. Accessed (access date).

Matzke, Nicholas J. (2015). BEASTmasteR code archive. Github: https://github.com/nmatzke/BEASTmasteR . Accessed (access date). Release: XX. DOI: XX.

Ideally, there will be a release with a DOI, but it may be you just use the most up-to-date commit. E.g. pre-release 0.1 is at https://github.com/nmatzke/BEASTmasteR/releases/tag/0.1 , and has DOI http://dx.doi.org/10.5281/zenodo.31927 .

Some of the functions in "tree_utils_v1.R", e.g. read_beast_prt, used for extracting the bracketed statistics from BEAST NEXUS tree files (MCC files) are copied/lightly modified/heavily modified from the R package " phyloch", by Christoph Heibl, licensed under GPL (>=2), available at: http://www.christophheibl.de/Rpackages.html. Please cite phyloch also, also if you use this feature.

See Also

(link to this section)

Introduction

(link to this section)

BEASTmasteR was originally aimed at easing my own tip-dating work in Beast2, and at the participants of the Tip-Dating Workshop and Putting Fossils in Trees Symposium at the 2014 meeting of the Society for Vertebrate Paleontology, in Berlin.

I am happy for BEASTmasteR to be a resource for the phylogenetics community, however. Of course, I cannot promise that I can give everyone outside a workshop detailed help — producing an R package, documentation, and a publication about BEASTmasteR would have priority, as would help about my main package, BioGeoBEARS. If you want any serious help/feature additions with BEASTmasteR, for me, given my current commitments, that feels more like a coauthorship-type situation than a generic-free-help situation.

Of course, identified bugs I will always try to fix or post work-arounds. And, feature additions can be suggested, and questions asked, and I will get to these when I can. All such matters are best posted to the BEASTmasteR Google Group. PLEASE don't be shy — it helps everyone to have questions/answers posted publicly. I will also answer questions from workshop participants on the BEASTmasteR Google Group.

Handy Features

(link to this section)

Some of the handy features of BEASTmasteR:

  • Excel settings file. All settings, priors, etc. are contained in an Excel file that is easy to save, copy, and edit.
    • This is particularly handy for complex analyses where you would like to copy/paste lists of OTUs, make group changes to age priors, etc.
  • Columns specifying "use". Very often, you would like to add or remove an OTU, a clade constraint, or a data partition, either for debugging or to see the effect. Rather than re-doing the entire Beast2 setup from scratch (a common situation in BEAUTi, although it depends on whether you can get saving/loading of settings to work in BEAUTi), with BEASTmasteR, you just change the relevant cell in the "use" column to "yes" or "no". This works in the "OTUs", "nodes", and "data" worksheets.
  • Auto-change node-age and tip-age to calibrations to tight "normals" for starting trees. The single most common issue users have with Beast2 is getting the analysis to start. The most common problem is that there is some conflict between the user's starting tree (either randomly generated, or pre-specified) and the user's node age calibrations.
    • If your node-age calibration has "hard" edges (e.g., uniform(5,10)), and your starting tree has that node outside that range (e.g., age=11), then the prior probability of the tree is 0, ln(prob)=-Inf, and the analysis crashes. The solution is to change your age calibration to something like normal(7.5, 0.0075). This has no hard edges, so any starting tree will work. And, the distribution is quite tight, so the analysis should soon converge to a tree with the node age within your original prior range of uniform(5,10).
    • Once this has occurred, you can take one of the sampled posterior trees, and use this as a new starting tree. You can then switch back to your originally-intended prior distributions, with their hard edges, and (hopefully) everything now works.
    • Changing age priors to tight normals, and back. Change the relevant cell in the "convert_to_normal" column in your Excel to "yes". BEASTmasteR will then take your original prior and calculate a tight normal distribution within it. Switching to "no" (or leaving blank) means that your original prior will be used.
    • You could do all of the above by manually changing all of the priors in BEAUTi or via direct editing of the XML, but it was a huge pain.
  • Data types. BEASTmasteR will now read & process:
    • discrete morphology (unordered and ordered, Mk and Mkv),
    • DNA (JC, HKY, GTR),
    • amino acids (all Beast2 models), and
    • continuous data (Brownian motion; experimental, but seems to work for BDSKY; Beast2 seems to crash if continuous data are combined with SABD, however).
    • Different data types should be in different source NEXUS files.
    • Continuous data can be put in the "traits" worksheet or in a tab-delimited text file. Currently, each continuous character is independent and gets its own rate (but is still subject to the treewide relaxed clock). Continuous traits are *very* experimental; but thanks to Remco for help getting them to work at all.
  • Plots of Beast2 MCC trees in R, with node-age and tip-age uncertainty.
  • BDSS, BDSKY, and SABD tree priors made easy. An crucial feature of tip-dating is the tree prior. Beast1 tip-dating analyses originally used inappropriate priors (Yule or BD) as that was all that was available. Beast1 later added BDSS, but is unlikely to add anything else. MrBayes originally had just the "uniform tree prior", which many users have found highly problematic. Beast2 was the first to implement time-constant and time-stratified ("skyline") versions of BDSS and SABD. However, their BEAUTi implementation was set up for pathogen epidemiology analyses, so fossil analyses could only be implemented by hand-editing of the XML, including several cryptically-documented settings.
    • In BEASTmasteR, the choice of prior (BDSKY or SABD) is made in the "treemodels" worksheet of the Excel settings file. If the birthRate, deathRate, or samplingRate are going to change in different time-bins, this is accomplished by adding new lines to the treemodels worksheet.
    • BDSKY = Birth-Death with Serial Sampling, with rate parameters changing in "Skyline" fashion (in discrete time-bins).
    • SABDSKY = Sampled Ancestor, Birth-Death with serial sampling, with rate parameters changing in "Skyline" fashion (in discrete time-bins).
    • BDSS = Birth-Death with Serial Sampling — this is just BDSKY when you use only one time-bin.
    • Note "rho" parameter carefully!! The parameter "rho" is sampling proportion in the present. It is crucial for BDSKY/SABDSKY-type analyses. Make sure you think about this and put in a reasonable value. If you have all currently-living species in your clade present in your data matrix, then rho=1, but if not, put something else.
      • The "rho" parameter is located all the way at the right edge of the "treemodels" worksheet, so it is easy to miss. But don't!
  • My goal is to track the version of Beast2 available on CIPRES. Currently this is version 2.1.3.

Updates

(link to this section)

BEASTmasteR on GitHub

(link to this section)

2015-10-08: The BEASTmasteR code has been moved to GitHub (https://github.com/nmatzke/BEASTmasteR ). Look there for the newest code. I will continue to maintain instructions & updates at PhyloWiki. To source the newest code at any point, use these functions:

# Online (you can also download each and source locally)
library(devtools,httr)
source_gist("https://gist.github.com/nmatzke/b2dbf78532eca734881a")
sources = sourceall_git(repo="nmatzke/BEASTmasteR")

Better "use" columns

(link to this section)

I have improved the treatment of the "use" columns in the "OTUs", "taxa", and "data" worksheets of the Excel settings file. They all should work together now — e.g., you can exclude a species from the analysis by changing "use" to "no" in the OTUs worksheet, and this should be also be removed from the XML clade constraints (and if a clade constraint has no used tips, the whole constraint is left out), and also removed from the DNA/morphology output to the XML.

The point of this, of course, is that you don't have to continually edit and track different NEXUS files, XML files, clade constraints in BEAUTi, etc., as you add/remove taxa. This can be particularly important e.g. when you are building a fossils + DNA analysis, where the taxonomy is often in flux, you or collaborators may discover/recognize new taxa from the literature, etc. It would be a huge pain to re-input everything via BEAUTi every time this happened, but with BEASTmasteR, you should be able to just change the "use" cell for the OTU and have it pop in or out of the analysis.

Tip-dating plots in R: showing uncertainty in tip dates and node dates

(link to this section)

FigTree is a great, fantastic program, but if you do enough BEAST analyses, you will eventually become tired of having to click all of the options every time you load up a new MCC tree. 99% of the time, all you want is:

  • The MCC tree
  • With tip labels that you can read
  • With a reasonable timescale at the bottom, large enough to read
  • With blue node bars showing the 95% HPD of the node dates
  • With posterior probabilities between 0-1 plotted above the branches. With 2 digits (who needs more??)
  • In a PDF for sharing, archiving, or publication
  • Ideally, you could plot the 95% HPD of the tip dates, also, when the fossil tip dates are uncertain.

I have combined all of these in the new function plotMCC, which you can use by source()-ing the script file plotBEAST_v1.R. The resulting tree looks like this:

LW12_treeLog.mcc_tipdates_node_bars_and_tip_bars.png

Figure: Example plot of an MCC tree resulting from a tip-dating analysis of the Lee & Worthy (2012) theropods character matrix, using the plotMCC function.

The tutorial script is here. The zipped directory containing all the inputs is here: http://phylo.wikidot.com/local--files/beastmaster/LW12.zip — see the example_outputs subfolder.

Updates 2015-03-16 and before

(link to this section)

2015-03-16: I have done a pretty massive update to the BEASTmasteR code (see "Files" at the bottom of this page), fixing some bugs, improving error checking (although it's impossible to think of everything), and adding many new features that I, at least, find interesting.

  • Fixed a bug in read_nexus_data2() and parse_datasets() involving the handling of ambiguous characters (e.g. (0 1)) in morphology datasets. You should still spot-check yourself to see if it is being done right with your data, there's always a new weird feature in someone's "simplified-NEXUS"-formatted file I haven't dealt with yet!
  • The Excel settings file now allows the user to set the prior distributions on birthRate, deathRate, samplingRate, clockrate, clockSD, etc. Tip dates can also be uncertain and can be fixed, or be given prior distributions. Node priors work as before. Relative priors (relpriors) can also be experimented with (e.g. the cross-calibration method of Shih & Matzke 2013, PNAS), but this is an advanced topic.
  • In most cases, the priors available are Uniform, Normal, Exponential, and LogNormal. If you need something else, you can edit the XML code manually — accounting for all possible priors at all possible points in the XML is not my goal.
  • BEASTmasteR will now read & process: discrete morphology (unordered and ordered, Mk and Mkv), DNA (JC, HKY, GTR), amino acids (all Beast2 models), and continuous data (Brownian motion). Different data types should be in different source NEXUS files. Continuous data can be put in the "traits" worksheet or in a tab-delimited text file. Currently, each continuous character is independent and gets its own rate (but is still subject to the treewide relaxed clock). Continuous traits are *very* experimental; but thanks to Remco for help getting them to work at all.
  • Most worksheets have a "use" column, allowing you to keep settings, filenames, partitions, OTUs, node calibrations, etc. in the worksheet without using them by specifying "no". Typically "yes" is the default if blank.
  • Some completeness statistics on your morphology dataset
  • Some experimentation with counting fossils in various timebins (e.g. Geological Time Scale 2012 periods)
  • CAUTIONS: I have pushed this about as far as I would like it to go, given the current Beast2 models etc., and my own research goals. This iteration of BEASTmasteR allows pretty complex setups, thus also increasing the potential for disaster. Also, I am one coder who spent one particularly un-fun weekend hashing this out. NO WARRANTY. Testing etc. is limited, let alone the scientific issues involved in tip-dating (mostly unexplored) even if everything is set up correctly in a tip-dating analysis. So, remember: (1) It is your responsibility to *think* about what you are doing at all steps of your analysis, e.g. converting to XML and making sure your data, ideas, etc. are being converted accurately, as well as modeling decisions, validating models, inspecting runs for convergence etc. (2) Making intelligent decisions about the priors (especially the tree model parameters, and the clock model parameters) is one of the most important steps in getting tip-dating analyses that give reasonable results. Defaults may not work well for a given dataset! You need to understand the models and make a deliberate decision! (3) The goal of BEASTmasteR is just to ease the process of tip-dating in Beast2, it is not to solve all problems. (4) It is best to seek help, post bugs, etc. on the BEASTmasteR Google Group. I will do the best I can to help, but my priority at this point in my postdoc is authoring/coauthoring publications rather than user support :-).
  • Twitter announcement: https://twitter.com/NickJMatzke/status/577600138392268801

BEASTmasteR strategy

(link to this section)

Writing an XML file that will actually function in Beast2 is a complex task. Doing it by hand is incredibly tedious. Even doing it with BEAUTI, despite many useful BEAUTI features, can cause numerous headaches. (These include, e.g., not saving your setup; manually entering clades, dates, etc; bewildering variety of options; lack of features needed for fossil tip-dating; and occasional bugs. This is not to criticize the authors of BEAUTI! It is just a very difficult task to write a general-purpose XML writer that will cover everything a user might want, and do so in the face of continuing evolution of needs, Beast code base, packages, etc.).

Writing a generic R program to do EVERYTHING would also be very diffcult. However, for tip-dating, we don't need to do everything — we just need an analysis with a shared tree, a shared relaxed clock with relative rates for each partition, some common character/DNA evolution models, the ability to input tip dates and node dates, and other basic features.

This is the goal of BEASTmasteR. The strategy adopted here is:

  1. Keep the sequence / character data in the original format, NEXUS data files — NOTE: THESE SHOULD BE SIMPLIFIED NEXUS, AS EXPORTED BY MESQUITE AND READ BY E.G. MRBAYES.
  2. Have the run settings, references to files, model parameters, etc., all specified in an Excel file, by default named settings_v1.xlsx
  3. Have an R script that reads the Excel file, then reads in the data specified within the Excel file, and writes out the XML.

The strategy for step 3 is to maintain a list of XML sections (by default called "xml"). These correspond to the major sections of a Beast2 XML input.

To see the major sections, type into R:

xml = setup_master_XML()
print_master_XML_section_headers(xml)

You should see:

  XML SECTION 1: Header and Beast2 Java class references / mappings 
  XML SECTION 2: Taxa and clade definitions (and tip-dates if desired) 
  XML SECTION 3: Sequence alignments (e.g. DNA, morphology); filtered in later section to produce partitions 
  XML SECTION 4: Partitions (filters applied to the sequence alignments produce partitions) 
  XML SECTION 5: Miscellaneous                                         
  XML SECTION 6: Site Models (sequence/morphology evolution models for each partition) 
  XML SECTION 7: Shared clock model (strict, relaxed, etc.) 
  XML SECTION 8: Starting tree (random, user-fixed, etc.) 
  XML SECTION 9: Shared tree model (Yule, Birth-Death (BD), Birth-Death Skyline (BDSKY), etc.) 
  XML SECTION 10: MCMC states: the state of the MCMC chain saved at various timepoints 
  XML SECTION 11: Priors 
  XML SECTION 12: Likelihoods 
  XML SECTION 13: Operators. These specify how the parameter values can change at MCMC time-steps. 
  XML SECTION 14: Trace log. This logs the parameters in a .log file, which may be viewed in Tracer. 
  XML SECTION 15: Screen log. Which parameters print to screen, and how often? 
  XML SECTION 16: Tree log. How often should trees from the MCMC chain be saved, and what data should they have?

Typing:

names(xml)

…will show you the names of each item of the list:

"header" "taxa" "sequences" "partitions" "misc"
"sitemodels" "clock" "starting_tree" "tree" "state"        
"priors" "likes" "operators" "tracelog" "screenlog" "treelog"

The commands in the example script add XML to each of these sections. At the end, the parse_run() command takes all of the XML items and outputs them to an XML file, with some reasonably nice formatting, spacing, and comments (something lacking in BEAUTI output!).

Note that this means that if you re-run commands without starting over, you may well duplicate some XML sections, which will cause Beast2 errors.

Validation and testing

(link to this section)

Beast2 XML has a great many moving parts, and it is Very Nontrivial to have an R script correctly read in the dataset and output correct XML that will actually run in Beast2. I usually consider it an achievement when I get BEASTmasteR working on my own dataset. But, the code has now advanced to the point where it is good to have some simple validation examples in addition to the example script on the main page. I am putting these in: BEASTmasteR test files.

BEASTmasteR workshop setup

(link to this section)

#######################################################
# BEASTmasteR EXAMPLE SETUP
#######################################################
# One-time actions for BEASTmasteR:
#######################################################
#
# 1. Install R on your computer, preferably the newest release: http://www.r-project.org/
#
#   Alternatively, RStudio Desktop is popular: http://www.rstudio.com/products/RStudio/#Desk 
#
# 2. If you are totally new to R, spend a few minutes learning the basics:
#
#    Absolute basics
#    http://www.r-tutor.com/r-introduction
#    
#    R and phylogenetics, starting from scratch
#    http://phylo.wikidot.com/introduction-to-r
# 
# 3. Install some R packages we will use:
# 
#    install.packages("ape")    # for phylogenetics
#    install.packages("XML")    # for reading/writing XML
#    install.packages("gdata")    # for trim
#    install.packages("XLConnect") # for readWorksheetFromFile
#    install.packages(c("devtools", "httr")) # for reading from GitHub and Gist
#
# 4. After issues with gdata's read.xls and the Perl that it uses, I have switched to
#     the R package XLConnect. It seems to work much better.
# 
# 5. Download and unzip the example data (TAKE NOTE of 
#    WHERE YOU DO THIS, AS THIS WILL BE YOUR WORKING
#    DIRECTORY, aka "wd")
#
#    Example data is in the zipfiles in "Files", at the bottom of:
#    http://phylo.wikidot.com/beastmaster
#
#######################################################
# One-time actions for Beast2:
#######################################################
#
# 6. After BEASTmasteR has produced an XML file to run in 
#    Beast2, you will presumably want to run it! So, 
#    download and install Beast2: http://beast2.org/
# 
#    On Macs, it will install into something like:
# 
#    /Applications/BEAST 2.1.3/
#
#    I recommend that you manually remove spaces: 
#    "BEAST 2.1.3" to "BEAST_2.1.3"
#
# 7. Once you've installed BEAST (which also installs BEAUTI),
#    open BEAUTI and go to File->Manage Packages. Install 
#    the following add-on packages, one at a time, in this 
#    order: BEASTlabs, BDSKY, SA, 
#    phylodynamics, CA. It may be we only need BDSKY for 
#    this lab, but the others may be useful for fossil tip-dating
#    in the future.
#
# NOTE: I've taken out BEAST_CLASSIC, it seems to cause problems
# with some versions of Java, and I think we don't
# strictly need it.
#   
# 8. You will also want to download/install:
#    Tracer: http://tree.bio.ed.ac.uk/software/tracer/
#    FigTree: http://tree.bio.ed.ac.uk/software/figtree/
#
# 9. GET A GOOD PLAIN-TEXT EDITOR. Your life, and mine, 
#    will be better if you do this.
#    
#    "Plain-text" means no formatting, nothing but ASCII
#    text. The best ones will recognize R, XML, etc.,
#    and color in the text according to the type 
#    (comments, functions, strings, etc.)
#
#    GOOD:
#    Mac: BBedit, or the free version, TextWrangler
#    Windows: Notepad++ http://notepad-plus-plus.org/
#    Both: RStudio's text editor
#
#    OK:
#    Windows: default Notepad (free)
#
#    BAD: Word, Wordpad, Pages, and other "word processors". 
#         These are for writing, not coding.
#
# 10. The NSF's XSEDE ("exceed", https://www.xsede.org/) supercomputing resource has a phylogenetic portal named 
#     CIPRES (http://www.phylo.org). It can run large Beast2 jobs much faster than a desktop or laptop (and MrBayes 
#     and many other programs). We will demo a CIPRES run, but if you want to do a 
#     CIPRES run yourself, you should register for a free account at:
#     https://www.phylo.org/portal2/login!input.action
#  
#######################################################

#######################################################
# BEASTmasteR EXAMPLE SETUP: Setting the working directory
#######################################################
# Assuming you have done the setup, above, you are 
# ready to run the example script. I have set it up
# so that you can copy-and-paste these commands into
# the R window.
#
# The one command that YOU WILL NEED TO MODIFY is this one:
#
# wd = "something"
# 
# In this script, "wd" means "working directory".  Inside 
# the quotes, put the full path to wherever you unzipped 
# the data files.
#
# BEAST analyses produce many files, particularly as one
# tries different settings, priors, etc. It is easier to
# copy and rename directories than it is to rename all 
# of the input and output files in the scripts and XML.
#
# So, I recommend just changing "something" when you set
# up a new analysis, and leave everything else the same.
########################################################}}

Example: Lee & Worthy (2012), Archaeopteryx and relatives

(link to this section)

Here, we will run a tip-dating analysis using the data from:

Lee, Michael S.Y.; Worthy, Trevor H. (2012). "Likelihood reinstates Archaeopteryx as a primitive bird." Biology Letters, 8(2), 299-303. Published online: October 26, 2011. http://dx.doi.org/10.1098/rsbl.2011.0884 http://rsbl.royalsocietypublishing.org/content/8/2/299

Before running the script:

  1. Make sure you did the one-time setup steps, above.
  2. Get the data and Excel settings file: download and unzip: http://phylo.wikidot.com/local--files/beastmaster/LW12.zip
  3. Note the path of the unzipped directory. You will need to paste this in, in place of the default working directory (wd) below.

Example BEASTmasteR script is below. Copy-and-paste the script into R, a few lines at a time. You can also paste the script into a text editor for further editing / note-taking / experimentation. (The file "make_XML_outline_v1.R" in the zipfile duplicates the script below.)

# Load the R libraries for dealing with phylogenies & XML
library(XML)
library(ape)   # for read.tree
library(gdata) # for trim
library(XLConnect)    # for readWorksheetFromFile (seems to work better than read.xls)
library(devtools, httr) # for GitHub & Gist

#####################################################
# Source BEASTmasteR code via the internet 
# (You could also save the R files locally and source() their locations on your hard disk. 
# This will be especially handy if your internet sucks. I have archived the R code in a 
# dated zipfile, see "Files" at the bottom of the page):
#####################################################

# On Nick's development computer:
# sourceall("/GitHub/BEASTmasteR/")

# Online (you can also download each and source locally)
library(devtools,httr)
source_gist("https://gist.github.com/nmatzke/b2dbf78532eca734881a")
sources = sourceall_git(repo="nmatzke/BEASTmasteR")

#######################################################
# CHANGE SCRIPT HERE: USE YOUR OWN WORKING DIRECTORY
# NOTE: Windows uses "\\" instead of "/"
#######################################################
#wd = "/drives/Dropbox/_njm/__packages/BEASTmasteR_permahelp/examples/ex_basic_venerid_morphDNA_v4/SABD_tipsVary_wOutg_v1/"

wd = "/GDrive/__github/BEASTmasteR/inst/extdata/LW12/"
wd = "~/Downloads/LW12"
setwd(wd)

# The name of the Excel settings file
xlsfn = "settings_v1.xlsx"

#######################################################
# Double-check your working directory with getwd()
#######################################################
getwd()

#######################################################
# Double-check that you have the right files with list.files()
#######################################################
list.files()

#######################################################
# YOU SHOULD BE ABLE TO CUT-N-PASTE CODE FROM HERE DOWN
# (But, do it line-by-line, so you can see what is going on!)
#######################################################

#######################################################
# Required change to R settings (run every time)
#######################################################
# Turn off R's default stringsAsFactors silliness
options(stringsAsFactors = FALSE)

# When you have a starting tree named tree.newick, you could view it by un-commenting the code below:
# trfn = "tree.newick"
# tr = read.tree(trfn)
# tr_table = prt(tr)
# tr_table[,c("label","time_bp")]
# names(tr_table)

#######################################################
# Setup the overall xml structure
#######################################################
xml = setup_master_XML()
print_master_XML_section_headers(xml)

tree_name = "shared_tree"
clockModel_name = "shared_clock"

# XML SECTION 1: Header and Beast2 Java class references / mappings 
# Done below

# XML SECTION 2: Taxa and clade definitions (and tip-dates if desired) 
#######################################################
# Get the OTUs from the Excel settings file
#######################################################
OTUs_df = readWorksheetFromFile(file=xlsfn, sheet="OTUs", startRow=15, startCol=1, header=TRUE)
head(OTUs_df)

# Remove OTUs with "no" in "use" column of "OTUs" worksheet
OTUs_df$use[isblank_TF(OTUs_df$use)] = "yes"
keepTF = OTUs_df$use != "no"
OTUs_df = OTUs_df[keepTF,]
OTUs = OTUs_df$OTUs
OTUs

#######################################################
# Analyze tipDates for gaps in sampling (experimental, just to look at)
#######################################################
# GTS2012 = 2012 Geological Time Scale
gts2012 = get_GTS2012(quiet=TRUE)
head(gts2012)

check_tipdates(OTUs_df)

bintops=seq(0,200,by=10)
find_sampling_gaps_from_tipdates(OTUs_df, bintops=bintops)

bintops = gts2012$tops[gts2012$tops < 200]
find_sampling_gaps_from_tipdates(OTUs_df, bintops=bintops, collapse=TRUE)

# Conversion of nonuniform date priors to (approximate 95%) uniform prior
convert_nonUniform_dates_to_uniform(OTUs_df, CI=0.999)

# Conversion of dates to a reasonable normal (might be useful
# for e.g. a starting tree)
convert_dates_to_normal(OTUs_df)

# The "count_overlap" column is really a sum( max within each sub-bin )
counts_bybin2 = find_bins_sampled_from_tipdate_ranges(OTUs_df, bintops=seq(0,200,10), bot_of_bins=1000, collapse="count_overlap", psiSamplingStopsAt=0.01, min_precision=0.1)
counts_bybin2

# You can see that you get somewhat different timebins when collapsing with tipdates
counts_bybin2 = find_bins_sampled_from_tipdate_ranges(OTUs_df, bintops=seq(0,200,10), bot_of_bins=1000, collapse="count_tipdates", psiSamplingStopsAt=0.01, min_precision=0.1)
counts_bybin2

# E.g. you might chose these bin tops:
counts_bybin2$tops

# Look at your sampling according to 
# Geological Time Scale 2012 (GTS2012) bins
# Since, usually, fossil dates are assigned by bin
# (tight radiometric dates are typically rare)
# Note that your fossil date range can depend on
# what the GTS bins were when it was published!

# Examples
# Count tipdates by bin, AND collapse the timebins into groups that do, 
# or don't, have the starting point tipdates (0 versus some sampling)
# This gives you blocks with or without sampling
counts_bybin2 = find_bins_sampled_from_tipdate_ranges(OTUs_df, bintops=bintops, bot_of_bins=1000, collapse="count_tipdates", psiSamplingStopsAt=0.01, min_precision=0.1)
rownums_bintops = match(counts_bybin2$tops, gts2012$tops)
rownums_binbots = match(counts_bybin2$bots, gts2012$tops) - 1
if (is.na(rownums_bintops[1]))
    {
    rownums_bintops[1] = 1
    } # END if (is.na(rownums_bintops[1]))
rownums_binbots
rownames_bintops = paste(gts2012$Series.Epoch[rownums_bintops], gts2012$Stage.Age[rownums_bintops], sep="/")
rownames_binbots = paste(gts2012$Series.Epoch[rownums_binbots], gts2012$Stage.Age[rownums_binbots], sep="/")
rownames_binbots[length(rownames_binbots)] = "older"
row.names(counts_bybin2) = paste(rownames_bintops, rownames_binbots, sep=" -> ")
counts_bybin2

# Attempt to count starting tipdates, taking into 
# account their approximate time-ranges (hard!)
# The "count_overlap" column is really a sum( max within each sub-bin )
bintops = gts2012$tops[gts2012$tops < 200]
counts_bybin2 = find_bins_sampled_from_tipdate_ranges(OTUs_df, bintops=bintops, bot_of_bins=1000, collapse="count_overlap", psiSamplingStopsAt=0.01, min_precision=0.1)
rownums_bintops = match(counts_bybin2$tops, gts2012$tops)
rownums_binbots = match(counts_bybin2$bots, gts2012$tops) - 1
if (is.na(rownums_bintops[1]))
    {
    rownums_bintops[1] = 1
    } # END if (is.na(rownums_bintops[1]))
rownums_binbots
rownames_bintops = paste(gts2012$Series.Epoch[rownums_bintops], gts2012$Stage.Age[rownums_bintops], sep="/")
rownames_binbots = paste(gts2012$Series.Epoch[rownums_binbots], gts2012$Stage.Age[rownums_binbots], sep="/")
rownames_binbots[length(rownames_binbots)] = "older"
row.names(counts_bybin2) = paste(rownames_bintops, rownames_binbots, sep=" -> ")
counts_bybin2

# Counting starting tipdates in manual bins (collapse=FALSE)
bintops = c(0.01, 41.20, 83.60)
counts_bybin2 = find_bins_sampled_from_tipdate_ranges(OTUs_df, bintops=bintops, bot_of_bins=1000, collapse=FALSE, psiSamplingStopsAt=0.01, min_precision=0.1)
rownums_bintops = match(counts_bybin2$tops, gts2012$tops)
rownums_binbots = match(counts_bybin2$bots, gts2012$tops) - 1
if (is.na(rownums_bintops[1]))
    {
    rownums_bintops[1] = 1
    } # END if (is.na(rownums_bintops[1]))
rownums_binbots
rownames_bintops = paste(gts2012$Series.Epoch[rownums_bintops], gts2012$Stage.Age[rownums_bintops], sep="/")
rownames_binbots = paste(gts2012$Series.Epoch[rownums_binbots], gts2012$Stage.Age[rownums_binbots], sep="/")
rownames_binbots[length(rownames_binbots)] = "older"
row.names(counts_bybin2) = paste(rownames_bintops, rownames_binbots, sep=" -> ")
counts_bybin2
#######################################################
# END Analyze tipDates for gaps in sampling
#######################################################

#######################################################
# Write OTUs to ="list_of_OTUs"
#######################################################

tmpXML = make_XML_tipdate_priors(OTUs_df, min_precision=0.01, tree_name="shared_tree", xml=NULL)
tmpXML$priors
tmpXML$operators
tmpXML$tracelog
tmpXML$tipdatelog

XML_list_of_OTUs = make_XMLs_for_OTUs(OTUs, OTU_idref=FALSE)
list_of_OTUs_XML = make_XML_taxon_block(taxon_name="list_of_OTUs", XML_list_of_OTUs=XML_list_of_OTUs)
list_of_OTUs_XML

# Add to the list of taxa/clades (manually, as an example)
xml$taxa = c(xml$taxa, list_of_OTUs_XML)

#######################################################
# Add taxonSets additional clades/taxa specified in "taxa"
#######################################################
taxa_df = readWorksheetFromFile(file=xlsfn, sheet="taxa", startRow=15, startCol=1, header=TRUE)
head(taxa_df)

# Add the taxon groups (monophyletic clades and perhaps non-monophyletic groups of interest)
xml = make_taxa_groups(taxa_df, OTUs=OTUs, xml=xml)

# Add the genera if desired (for logging monophyly, if possible; if not, log ages)
#genera = get_genera_from_OTUs(OTUs, mintaxa=2, split="_")
#genera
# Don't add genera to XML (you would only do it if you were SURE they were monophyletic)
#xml = make_XML_for_genera_from_OTUs(xml, OTUs, mintaxa=2, split="_")
#xml

#######################################################
# Add priors for the node constraints
#######################################################
nodes_df = readWorksheetFromFile(file=xlsfn, sheet="nodes", startRow=15, startCol=1, header=TRUE)
head(nodes_df)

# Make the clade priors
xml = make_cladePrior_XMLs(nodes_df, xml=xml, list_of_empty_taxa=xml$list_of_empty_taxa)

# Add to the logs
xml = make_cladePrior_logs(nodes_df, xml=xml, tree_name=tree_name, trace_or_screen="both", list_of_empty_taxa=xml$list_of_empty_taxa)

# XML SECTION 3: Sequence alignments (e.g. DNA, morphology); filtered in later section to produce partitions 
seqs_df = readWorksheetFromFile(file=xlsfn, sheet="data", startRow=15, startCol=1, header=TRUE)
seqs_df = seqs_df[seqs_df$use == "yes", ]
head(seqs_df)

# See starting length of the XML
orig_length_xml = length(xml)

#######################################################
# Add priors on the stem ages of terminal branches, if any
# (specified in OTUs_df)
#######################################################
# # Default is no, so only pull out "yes"
# stem_ages_df = OTUs_df[OTUs_df$stem_make_age_prior=="yes",]
# for (i in 1:nrow(stem_ages_df))
#     {
#     dfline = stem_ages_df[i,]
#     param_name = paste0("stem_age_of_", dfline$OTUs)
#     stemXML = make_generic_XML_prior(dfline, colname_prefix="stem", param_name=param_name, header_scheme=1, stem=TRUE, distrib=NULL, param1=NULL, param2=NULL, tmp_offset=NULL, meanInRealSpace=NULL)
#     stemXML
#     }
# stemXML

#######################################################
# Add relative priors, if any
# (specified in OTUs_df and/or nodes_df)
#######################################################
xml = make_generic_difference_statistic(nodes_df=nodes_df, OTUs_df=OTUs_df, xml=xml)
xml

################################################################
# PARSING NEXUS FILES WITH parse_datasets()
################################################################
#
# NOTE: THIS FUNCTION WILL PRODUCE A LOT OF SCREEN
# OUTPUT, UNLESS YOU SET printall="none". THE SCREEN
# OUTPUT IS GOOD, IT SHOWS YOU ALL THE REFORMATTING
# OF THE CHARACTERS TO MAKE THEM ACCEPTABLE TO
# BEAST. 
# 
# E.g., a particular character must have its
# lowest state be state 0, it must not skip states
# (no characters with only states 0, 3, and 4), and
# the observed number of distinct character states
# must match the number of claimed character states.
# These are typically true for original datasets,
# but after researchers cut out taxa, re-code
# certain states, etc., sometimes it is no longer
# true. The script counts the observed character
# states, re-numbers them if needed, classifies them
# by number of character states, and then reformats
# for Beast2 (e.g., for a 2-state character, "?",
# "-", "(0 1)", "(1 0)", "{0 1}", and {1 0} would
# all be converted to "Z").
# 
# For your own data, you may discover yet more weird
# features of NEXUS data file formatting that I have
# not coded for. Email them to the
# beastmaster_package Google Group and I will try to
# fix them.
# 
# NEXUS files must be: (1) Simplified NEXUS format
# (as exported from Mesquite), and (2) the taxon
# names must have no spaces, quotes, or other weird
# characters. Use underscores ("_") instead.
# 
# Tracking characters in the NEXUS file: as a
# double-check, just to make sure the data is not
# being fundamentally changed, I have the script
# write into XML: (1) the original morphology data
# matrix, (2) the modified data matrix, and (3) the
# character numbers (original numbers or indices
# with each data section, depending) that have been
# placed into each morphology data section. This
# occurs because Beast2 requires that the 2-state,
# 3-state, 4-state, etc. characters each be coded in
# the XML in separate sections.
################################################################
# Run the parsing of NEXUS file(s)

# Save old xml
pre_parsing_xml = xml
# xml = pre_parsing_xml

# Once you've run parse_datasets successfully, you can
# change runslow=TRUE for future runs to save time
# (as long as you have data_XML.Rdata in the working
# directory).
runslow = TRUE
data_XML_fn = "data_XML.Rdata"
if (runslow)
    {
    data_XMLs = parse_datasets(seqs_df, xml=NULL, add_morphLength=TRUE, OTUs=OTUs, add_morphList=TRUE, printall="short", xlsfn=xlsfn, return_charsdf=TRUE)
    save(data_XMLs, file=data_XML_fn)
    } else {
    # Loads to "data_XML"
    load(file=data_XML_fn)
    }

# If you have problems, read the help paragraphs above, 
# carefully look at the messages printed to screen, the E R R O R
# messages, and carefully look at your input NEXUS file.
# 
# It can also help to run read_nexus_data2 by itself:
# E.g. get first used NEXUS file in 'data' worksheet:
#
# Un-comment to run:
# nexus_filename_to_read = seqs_df$filename[seqs_df$use==TRUE][1]
# read_nexus_data2(file=nexus_filename_to_read, check_ambig_chars=TRUE, convert_ambiguous_to=NULL, printall="short", convert_ambiguous_to_IUPAC=FALSE) 

# Show characters with <2 morphological states
data_XMLs$chars_wLT_2_states_df_list

# Show the stats
data_XMLs$morphstats

# Show the number of data matrices
length(data_XMLs$charsdf_list)
# Size of matrix #1
data_XMLs$charsdf_list[[1]]
data_XMLs$charsdf_list[[1]][1:5,1:5]
#charslist = read_nexus_data2(file=seqs_df$filename[1], check_ambig_chars=TRUE, convert_ambiguous_to=NULL, printall="short", convert_ambiguous_to_IUPAC=FALSE) 

#######################################################
# Continue with converting morphology
#######################################################
xml$sequences = c(xml$sequences, data_XMLs$data_XML)
xml$misc = c(xml$misc, data_XMLs$misc_XML)
xml$sitemodels = c(xml$sitemodels, data_XMLs$sitemodels_XML)
xml$state = c(xml$state, data_XMLs$state_XML)
xml$priors = c(xml$priors, data_XMLs$prior_XML)
xml$likes = c(xml$likes, data_XMLs$likes_XMLs)
xml$operators = c(xml$operators, data_XMLs$operator_XMLs)
xml$tracelog = c(xml$tracelog, data_XMLs$tracelog_XMLs)
xml$screenlog = c(xml$screenlog, data_XMLs$screenlog_XMLs)

xml$morphLengths = data_XMLs$morphLengths
xml$morphList = data_XMLs$morphList
xml$morphList 

# The number of (used!) characters
# of each morphology DATASET (not partition)
morphLengths = xml$morphLengths
morphLengths

# The number of (used!) characters in each morphology "partition" (section)
morphList = xml$morphList
sum(xml$morphList$numchars)
morphList

# Check that xml changed:
orig_length_xml
length(xml)

# XML SECTION 4: Partitions 
# (filters applied to the sequence alignments produce partitions)

# For non-morphology datasets, extract partitions from the alignments
xml = make_partitions_XML(seqs_df, xml=xml, add_partitionLength=TRUE)
xml$partitions

partitionLengths = xml$partitionLengths
partitionLengths

#######################################################
# Add the tip-dates (assume age = 0 if none)
# All blanks/NAs are converted to 0
#######################################################
OTUs = OTUs_df$OTUs
tipdates = OTUs_df$tipdate

# Get the alignment that will serve as a source of OTUs to reference with "@" 
# in the tipdates 'taxa' tag
alignment_name_w_taxa = pick_a_partitionName_for_taxa_list(seqs_df=seqs_df, morphList=morphList)

# Make the tipdates
OTUs_df$use[isblank_TF(OTUs_df$use)] = "yes"
XML_tipdates = make_XML_tipdates(name="tipDates", OTUs_df, alignment_name_w_taxa=alignment_name_w_taxa, backward=TRUE, xml=NULL)

# Add to the list of taxa/clades (here, manually, just to show the process)
xml$taxa = c(xml$taxa, XML_tipdates)
xml$taxa

#######################################################
# Add a taxonSet for the fossils, if there are any
#######################################################
xml = add_fossils_taxon_to_xml(OTUs=OTUs, tipdates=tipdates, xml=xml)

#######################################################
# Add tipdate uncertainty, sampling, and logging, if desired
#######################################################
xml = make_XML_tipdate_priors(OTUs_df, min_precision=0.01, tree_name=tree_name, xml=xml)

# XML SECTION 5: Miscellaneous
# (none yet)

#####################################################
# SECTIONS 6-7: Partitions and clock models
#####################################################
# Load the sequence partitions and their clocks;
# Remember to subset to just the ones where use=yes
seqs_df = readWorksheetFromFile(file=xlsfn, sheet="data", startRow=15, startCol=1, header=TRUE)
seqs_df = seqs_df[seqs_df$use == "yes", ]

# Get the number of taxa
ntaxa = length(OTUs_df$OTUs)

# XML SECTION 6: Site Models (sequence/morphology evolution models for each partition) 
xml = parse_DNA_AA_siteModels(seqs_df, xml=xml, tree_name=tree_name)
xml$sitemodels

# XML SECTION 7: Shared clock model (relaxed ucld currently; also, the strict clock would be where stdev=~0) 

# Get the names of each clock (usually there is just 1)
clockModel_names = get_clock_names(seqs_df)
i=1
for (i in 1:length(clockModel_names))
    {
    clockModel_name = clockModel_names[i]
    # New, more flexible
    xml = define_a_shared_clock(seqs_df, ntaxa, clockModel_name=clockModel_name, tree_name=tree_name, xml=xml)
    # define_a_shared_clock(seqs_df, ntaxa, clockModel_name=clockModel_name, tree_name=tree_name, xml=NULL)
    # Old
    #xml = define_logNormal_shared_clock(clockModel_name=clockModel_name, tree_name=tree_name, ntaxa=ntaxa, xml=xml)
    }
define_a_shared_clock(seqs_df, ntaxa, clockModel_name=clockModel_name, tree_name=tree_name, xml=NULL)

# Do the XML for relativeMutationRates
# (the relative rate of each partition)

# For multiple clocks: Put this in a loop with 
# different partitionLengths (DNA) and morphLengths (morphology)
# for each clock
partitionLengths = xml$partitionLengths
morphLengths = xml$morphLengths
xml = make_relativeMutationRate_operator(seqs_df, partitionLengths=partitionLengths, morphLengths=morphLengths, clockModel_name=NULL, xml=xml)
xml$operators

# XML for morphology models, priors, likelihoods
if (!is.null(morphList))
    {
    xml = make_Beast2_morph_models(morphList, morphRate_name="morph_relRate", morphGamma_name="morph_gammaShape", clockModel_name=clockModel_name, tree_name=tree_name, xml=xml)

    # For e.g. 2 morphology matrices, try this:
    #xml = make_Beast2_morph_models(morphList[1:4,], morphRate_name="morphPr_relRate", morphGamma_name="morphPr_gammaShape", clockModel_name=clockModel_name, tree_name=tree_name, xml=xml)
    #xml = make_Beast2_morph_models(morphList[5:7,], morphRate_name="morphSl_relRate", morphGamma_name="morphSl_gammaShape", clockModel_name=clockModel_name, tree_name=tree_name, xml=xml)
    }

# XML SECTION 8: Shared tree model (Yule, Birth-Death (BD), Birth-Death Skyline (BDSKY), etc.) 
strat_df = readWorksheetFromFile(file=xlsfn, sheet="treemodel", startRow=15, startCol=1, header=TRUE)

# Get a partition for the taxa when building the starting tree
alignment_name_w_taxa = pick_a_partitionName_for_taxa_list(seqs_df=seqs_df, morphList=morphList)

xmltmp = make_BDSKY_model(strat_df, tree_name=tree_name, clockModel_name=clockModel_name, alignment_name_w_taxa=alignment_name_w_taxa, tipDates_id="tipDates", xml=NULL, XML_mod_for_cont_chars=FALSE)

# Does the dataset include continuous characters?  The XML for the tree needs to 
# be different depending on whether or not this is so!  See: 
# https://groups.google.com/forum/#!topic/beast-users/7KFzWd4PVeQ
TF = seqs_df$use != "no"
if (("continuous" %in% seqs_df$type[TF]) == TRUE)
    {
    XML_mod_for_cont_chars = TRUE
    } else {
    XML_mod_for_cont_chars = FALSE
    }

xml = make_BDSKY_model(strat_df, tree_name=tree_name, clockModel_name=clockModel_name, alignment_name_w_taxa=alignment_name_w_taxa, tipDates_id="tipDates", xml=xml, XML_mod_for_cont_chars=XML_mod_for_cont_chars)

# These are all done above, with further additions by parse_run()
# XML SECTION 9: Starting tree (random, user-fixed, etc.) 
# XML SECTION 10: MCMC states: the state of the MCMC chain is saved at various timepoints, 
# XML SECTION 11: Priors 
# XML SECTION 12: Likelihoods 
# XML SECTION 13: Operators. These specify how the parameter values can change at MCMC time-steps. 
# XML SECTION 14: Trace log. This logs the parameters in a .log file, which may be viewed in Tracer. 
# XML SECTION 15: Screen log. Which parameters print to screen, and how often? 
# XML SECTION 16: Tree log. How often should trees from the MCMC chain be saved, and what data should they have? 

# STICK IT ALL TOGETHER
run_df = readWorksheetFromFile(file=xlsfn, sheet="run", startRow=15, startCol=1, header=TRUE)
run_df

# Save to outfn
# This is where it will get saved:
getwd()
# outfn="run_df" means the output filename will is specified in the run_df worksheet

# Input the dataset source
dataset_source = ' Lee, Michael S.Y.; Worthy, Trevor H. (2012). "Likelihood reinstates Archaeopteryx as a primitive bird." Biology Letters, 8(2), 299-303. Published online: October 26, 2011.  http://dx.doi.org/10.1098/rsbl.2011.0884 http://rsbl.royalsocietypublishing.org/content/8/2/299 '

outfn = parse_run(run_df, xml, outfn="run_df", dataset_source=dataset_source,  printall="short")

# Here is the filename
outfn

# Glance at the XML file in R (will take up many screens)
#moref(outfn)

# Save the morphology matrix stats to text file(s):
outfns = save_data_stats(data_XMLs)

After BEASTmasteR: Running Beast2, TreeAnnotator, etc., from Terminal

(link to this section)

Once you have an XML file, you can run it in Beast2. The "common" way to run Beast2 is by double-clicking on the Beast.app icon (in a Mac; something similar in Windows), hunting through the menus and folders to find your XML file, and running that. It is Objectively Better to just run Beast2 from command line. Reasons:

  1. It's faster to copy-paste one command into Terminal, than to do a bunch of clicking
  2. You don't have to keep remembering to click certain settings (random number seed, write over old files, etc.)
  3. When (not if) Beast2 crashes, the Beast.app version will often crash entirely, so fast that you can't see the error message. Beast2 errors and crashes can be caused by many things — most commonly, your constraints are incompatible with your starting tree (either the one you specified, or a random tree if you specified that). So, often you need to fiddle some with the XML file to get it to work. It gets exhausting to keep clicking around to close Beast2, re-open it, etc. The Terminal command is much easier.
  4. (Also: If you are wondering why a run crashed, the most common problem is Uniform distributions and other date priors that have "hard" edges. E.g., if you have a root date prior of Uniform(30,40) million years ago, but your starting tree root age is only 20 mya, the tree will have a 0 likelihood, i.e. a -Inf log-likelihood (LnL), so the analysis will crash. Topology constraints can cause a similar problem, if the starting tree doesn't obey those constraints.)
  5. Note: In BEASTmasteR, you can easily convert Uniform prior distributions etc. to a tight Normal distribution (which has no hard edges), in order to get a starting tree. This is accomplished by changing the relevant line in the "convert_to_normal" column to "yes". Topology priors can be turned off by changing "use" to "no" or "mono" to "no". This keeps your original setup saved, but allows for the easy generation of analyses that run, so that hopefully you can do a short run to generate a starting tree, then do a full run with your preferred priors. This was mostly motivated by the maddening experience of trying to set up slightly different analyses in BEAUTi and remember everything you did, save/load old BEAUTi setups, etc.
  6. All of the above saves effort, time, and carpal tunnel.

Here are example Terminal (the operating system command line, not inside of R) commands. You may have to modify the working directory (cd = change directory), the location of the executable program files, and you may have to change the filenames etc. depending on what you want the files named (note: it is easiest to keep default filenames, and just have different directories with different names for different variants of an analysis).

The commands may be slightly different on e.g. Windows; google it or check the beast-help Google Group if you can't figure it out.

==================================================
Note: BEAST analysis gets easier when you learn the command-line commands.

Here are some likely commands, assuming you have Mac OS X 10.9, and are using Terminal (usually found in: /Applications/Utilities/Terminal.app). These may not be perfect -- it all depends on the locations of each file/program. And YMMV (your mileage may vary) on Windows etc.

(1) cd (change directory) to wherever you saved outfn. Probably your working directory:
>>> cd /NIMBioS_projects/example/

(2) Run Beast2.
>>> java -Xms512m -Xmx512m -jar /Applications/BEAST_2.1.3/lib/beast.jar -java -seed 754321 -overwrite runthis_v1.xml

-java means you don't need Beagle, but makes it slower.
-Xms512m and -Xmx512 control how much RAM is used; double etc., if you need more.
-seed is the starting seed so you can replicate the analysis.
-overwrite means you will overwrite the output file. Rename it if you don't want to do this!

(3) This runs the help for Beast. See also google and the Beast2 website and beast-users google group!
>>> java -Xms512m -Xmx512m -jar /Applications/BEAST_2.1.3/lib/beast.jar -help

(4) This runs TreeAnnotator, which goes through your post-burnin tree samples and calculates an MCC (Maximum Clade Credibility) tree. Note that this is not the same thing as e.g. a majority-rule 'consensus' tree. See: http://en.wikipedia.org/wiki/Maximum_clade_credibility_tree
>>> java -Xms512m -Xmx512m -jar /Applications/BEAST_2.1.3/treeannotator.jar -heights median -burnin 1000 -limit 0 treeLog.txt treeLog.mcc

(5) This runs the help for TreeAnnotator. See also google and the Beast2 website and beast-users google group!
>>> java -Xms512m -Xmx512m -jar /Applications/BEAST_2.1.3/lib/treeannotator.jar -help

(6) This opens the MCC tree in FigTree. You should also inspect the trace log files with Tracer.app!!
>>> open -n /Applications/beast/FigTree_v1.4.0.app --args treeLog.mcc
==================================================

==================================================
All commands, without comments:
cd /NIMBioS_projects/example/
java -Xms512m -Xmx512m -jar /Applications/BEAST_2.1.3/lib/beast.jar -java -seed 754321 -overwrite runthis_v1.xml
java -Xms512m -Xmx512m -jar /Applications/BEAST_2.1.3/lib/beast.jar -help
java -Xms512m -Xmx512m -jar /Applications/BEAST_2.1.3/treeannotator.jar -heights median -burnin 1000 -limit 0 treeLog.txt treeLog.mcc
java -Xms512m -Xmx512m -jar /Applications/BEAST_2.1.3/lib/treeannotator.jar -help
open -n /Applications/beast/FigTree_v1.4.0.app --args treeLog.mcc
==================================================

==================================================
BEASTmasteR is writing Beast2 XML to:

runthis_v1.xml 

Your current working directory (getwd()) is:

 /drives/SkyDrive/NIMBioS_projects/2015-03-18_Tumamoc/doggies/r2 
==================================================

====================================================================================================
====================================================================================================
IF YOU USE BEASTmasteR IN YOUR PUBLICATIONS, PLEASE CITE IT: http://phylo.wikidot.com/beastmaster#citation
====================================================================================================
====================================================================================================

Example script: Extracting statistics from a Beast2 MCC tree

(link to this section)

This code chunk should run "out of the box" (just cut and paste; you will have to modify the file location for your own files, of course).

This script reads in the MCC tree, and outputs a table containing all of the stored statistics for each node/branch.

The comments also contain hints, and links, on (a) plotting APE node numbers, (b) interpreting the tree table, and (c) plotting HPDs, posterior probabilities, etc., on a plot.

########################################################
# Extracting HPDs etc. to a table, from a Beast MCC tree
########################################################

# Some of the functions in "tree_utils_v1.R", e.g. read_beast_prt, used for extracting the 
# bracketed statistics from BEAST NEXUS tree files (MCC files) are copied/lightly modified/heavily 
# modified from the R package " phyloch", by Christoph Heibl, licensed under GPL (>=2), available 
# at: http://www.christophheibl.de/Rpackages.html. Please cite phyloch also, also if you use this 
# feature.

# Load the R libraries for dealing with phylogenies & XML
library(XML)
library(ape)   # for read.tree
library(gdata) # for trim
library(XLConnect)    # for readWorksheetFromFile (seems to work better than read.xls)
library(devtools, httr) # for GitHub & Gist
library(BioGeoBEARS) # for sourceall

# Online source for BEASTmasteR (yes, I will make an official R package eventually, 
# but I am busy)
# Online (you can also download each and source locally)
library(devtools,httr)
source_gist("https://gist.github.com/nmatzke/b2dbf78532eca734881a")
sources = sourceall_git(repo="nmatzke/BEASTmasteR")

#########################################################
# BACKGROUND: the function 'read_beast_prt'
#########################################################
# read_beast_prt uses:
#
# - the BioGeoBEARS 'prt' (print tree to table) function, 
#   ...and combines it with...
# - 'extractBEASTstats2', heavily modified from phyloch
# 
# ...to produce a table (an R data.frame).
# 
# This table contains the tree in table format, 
# with each branch as a row.  The rows are in the same 
# order as the default node numbers in R -- that is, in the table:
# 
# - Rows 1-ntips are the tip nodes (and the branches below each tip node)
# - Row (ntips+1) is the root node. 
#      (Note: Typically, the root node is the only one that 
#             doesn't have a branch below it. It can have a branch, 
#             if root.edge is specified in the APE tree structure. 
#             Typically, MCC trees don't have this.)
# - The rest of the rows are internal nodes.
#

#########################################################
# SEEING THE 'APE' NODE NUMBERING SCHEME IN A PLOT
#########################################################
# See the first part of this BioGeoBEARS script:
#
# http://phylo.wikidot.com/example-biogeobears-scripts#node_numbering
# 

#########################################################
# EXAMPLE TREE FILE: 
#########################################################
#
# 'treeLog.mcc', in
# http://phylo.wikidot.com/local--files/beastmaster/LW12.zip
#
# ...or...
#
# http://phylo.wdfiles.com/local--files/beastmaster/dino_treeLog.mcc

# NEXUS tree file name (fn)
# E.g., an MCC tree from Beast/TreeAnnotator

# Example of a locally-stored MCC file
# nexfn = "treeLog.mcc"

# Note: check your working directory with
# getwd()
# list.files()

# Here, we'll use the remotely-stored MCC file
nexfn = "http://phylo.wdfiles.com/local--files/beastmaster/dino_treeLog.mcc"

# Read the tree to an APE tree object
tr = read.nexus(nexfn)

# Here's a tree table
trtable = prt(tr, printflag=FALSE)
head(trtable)

# What fields are there in the table?
names(prt)

# Here's the part of the table with the last tip nodes, the root node, and the
# beginning of internal nodes
trtable[87:93,]

# The tree has 89 tips and 88 internal nodes, so the table should have 89+88=177 nodes (rows)
dim(trtable)
# yep

###############################################
# EXTRACTING STATISTICS FROM A BEAST2 MCC TREE
###############################################
# Getting the tree table, with 
output = read_beast_prt(file=nexfn, digits = NULL, get_tipnames=TRUE, printflag=FALSE, include_rates=FALSE) 

# What's in output?
names(output)

# Here's the tree again
output$tr

# Here's the tree table, with stats added
head(output$prt_beast_nodestats)

# The tree has 89 tips and 88 internal nodes, so the stats table should have 89+88=177 nodes (rows)
prt_beast_nodestats = output$prt_beast_nodestats

# What fields are there?
names(prt_beast_nodestats)

# Q: But I don't see HPDs, or posterior probabilities,
#    when I use the head() command!
head(prt_beast_nodestats)

# A: TreeAnnotator only records HPD and posterior probabilities 
#    for internal branches/nodes.
# 
#    The posterior probability of a terminal branch (one immediately
#    below a tip node) is always 1, obviously.
#
#    The date of a tip node can vary (*if* you are doing tip-dating), 
#    but you can only access this tip-date and build a 95% HPD if you 
#    store the tip-dates in your Beast2 logfile, and then extract
#    them with some other BEASTmasteR functions. This is done 
#    in the example at: 
#    http://phylo.wikidot.com/beastmaster#tipdate_plots_script
# 

# See the stats appear for the internal nodes:
prt_beast_nodestats[87:93,]

###############################################
# BUT I WANT TO PLOT THE HPDs ON A TREE!
###############################################
# See this script / example:
# http://phylo.wikidot.com/beastmaster#tipdate_plots_script

###############################################
# Using the extracted data: example
###############################################

# Example plot: It is often the case that dating uncertainty
# increases as the median node age gets older. You can 
# check this with a command such as:

# internal node numbers
internal_nodenums = (length(tr$tip.label)+1) : (length(tr$tip.label)+tr$Nnode)
internal_nodenums

# Plot:
# x: time_bp (node time before present) 
# y: height_HPD_width (upper 95% cred. interval, minus lower 95% cred. interval)
#
# We are adding 66 my to the node ages, since the "top" of this tree is at the K-T boundary
plot(x=66+prt_beast_nodestats$time_bp[internal_nodenums], y=prt_beast_nodestats$height_HPD_width[internal_nodenums], xlab="node age (Ma)", ylab="HPD width", ylim=c(0, 50), xlim=c(250,65))
title("Theropod tree: Median node age vs. 95% HPD width")

# Uncertainty increases somewhat as we go back in time, but here the 
# effect is less than usual, probably because we have many fossils 
# throughout the study group

##############################################################################################
# IF YOU USE BEASTmasteR IN YOUR PUBLICATIONS, 
# PLEASE CITE IT: http://phylo.wikidot.com/beastmaster#citation
##############################################################################################

Example script: Tip-dating plots in R: showing uncertainty in tip dates and node dates

(link to this section)

FigTree is a great, fantastic program, but if you do enough BEAST analyses, you will eventually become tired of having to click all of the options every time you load up a new MCC tree. 99% of the time, all you want is:

  • The MCC tree
  • With tip labels that you can read
  • With a reasonable timescale at the bottom, large enough to read
  • With blue node bars showing the 95% HPD of the node dates
  • With posterior probabilities between 0-1 plotted above the branches. With 2 digits (who needs more??)
  • In a PDF for sharing, archiving, or publication
  • Ideally, you could plot the 95% HPD of the tip dates, also, when the fossil tip dates are uncertain.

I have combined all of these in the new function plotMCC, which you can use by source()-ing the script file plotBEAST_v1.R. The resulting tree looks like this:

LW12_treeLog.mcc_tipdates_node_bars_and_tip_bars.png

Figure: Example plot of an MCC tree resulting from a tip-dating analysis of the Lee & Worthy (2012) theropods character matrix, using the plotMCC function.

The tutorial script is below. The zipped directory containing all the inputs is here: http://phylo.wikidot.com/local--files/beastmaster/LW12.zip (see the example_outputs subfolder.

Example code

(link to this section)

# Hint: you could save this script in a file like plotMCC_Rscript.R 
# and then use "Rscript plotMCC_Rscript.R" from the Terminal command line.

# Hint: See the CHANGE bits for...bits you should change.

# plotMCC_Rscript.R

library(ape)
library(BioGeoBEARS)

# CHANGE: sourceall() is a fast way to source everything, but it assumes you've saved to hard disk
# CHANGE- all of the source code listed at: http://phylo.wikidot.com/beastmaster
# CHANGE- You have to change the directory to match whereever you downloaded.
# CHANGE- Or, you can paste in the source() commands and source directly from the website. This is
# CHANGE- slower and you'd have to do it each time.
sourceall("/GitHub/BEASTmasteR/R/")

#######################################################
# Plot the FigTree-type tree
#######################################################
defaults = '
nexfn
pdffn=TRUE
tipnames_right_justified=TRUE
plot_node_heights=TRUE
plotPP=TRUE
minage=0
ladderize_tree=TRUE
fliptree=FALSE
tips_to_rotate=NULL
tipcex=1
italics_tiplabels = TRUE
digits=2
xmin=-5
xmax_mult=1.3
pdfheight=11
pdfwidth=9
newick=FALSE
tipdates_table=NULL
xtext="millions of years ago"
vlines=NULL
vline_col="grey50"
vline_type="dotted"
xtext="years ago"
space_tipnames=NULL
space_spaces=NULL
'

###########################################
# Inputs for the plot
###########################################
# CHANGE: change nexfn to match your MCC tree's filename. Also, check your
# CHANGE: working directory with getwd(), and set it with setwd("whatever/directory/you/like")
nexfn = "treelog.mcc"    # MCC tree from TreeAnnotator (NEXUS format)

# CHANGE: Any of these options, as you see fit.
titletxt = "draft dated phylogeny"
pdffn=TRUE              # If TRUE, output will be nexfn.pdf; 
                        # if FALSE, plot to screen; 
                        # anything else interpreted as output filename
tipnames_right_justified=TRUE    # Plot tipnames at 0 mya instead of at branch tips;
                                 # connected by dotted grey lines
plot_node_heights=TRUE  # Plot blue bars for node height variation
plotPP=TRUE             # Print posterior probabilities at middle of branches
minage=0                # Should the right edge of the plot be 0 mya or some other value?
                        # NOTE: For calendar years, e.g. "2013 2014 2015 2016", 
                        # set minage=-2016.
ladderize_tree=TRUE     # ladderize the tree, e.g. as in FigTree "increasing"
fliptree=FALSE          # flip the whole tree? (e.g. FigTree "decreasing")
tips_to_rotate = NULL   # If you want to manually rotate some nodes, put the 
                        # tipnames here. The first 2 will specify the first node
                        # to rotate, the next 2 the next node, etc. 
                        # Obviously, you must have an even number of tipnames.
                        # e.g.: tips_to_rotate = c("sp1", "sp2")
tipcex = 0.50           # cex (character expansion) on the tipnames
italics_tiplabels=TRUE  # When plotting tipnames, remove underscores and plot in italics?
digits=2                # Number of digits for e.g. Bayesian posterior probabilities
xmin=-5                 # Minimum of the x-axis (typically, xmin=0=root node age, which 
                        # cuts off the bar on the bottom/root node. If so, make 
                        # xmin more negative
xmax_mult=1.3           # Extend the x-axis maximum by multiplying the tree height by
                        # this value (helps avoid getting the tipnames cut off)
pdfheight=16            # PDF height (in inches) ('Merica!!)
pdfwidth=8.5            # PDF width (in inches) ('Merica!!)
newick=FALSE            # Set newick=TRUE if nexfn is Newick instead of NEXUS. Node bars
                        # not available for Newick unless you write your own function
tipdates_table=NULL     # If you want bars on the tipdates also, give the data.frame here.
                        # See formatting below. Use get_tipdates_from_logfile() to get it.
xtext="millions of years ago"  # If you want something other than the default x-axis
                               # label. E.g., "mya", "years ago", "mega-annum", etc.
vlines=NULL             # Vertical lines, if desired
vline_col="grey50"      # Color of the vertical lines
vline_type="dashed"     # lty (line type) of the vertical lines
xtext="years ago"       # If you want something other than the default x-axis
                        # label. E.g., "mya", "years ago", "mega-annum", etc.
space_tipnames=NULL     # tipnames to which to add extra spaces
space_spaces=NULL       # number of spaces for each tipname

# Manual setup of tipdates_table:
# 
# row1 = c("outgroup", 35, 45)
# row2 = c("spA", 32, 36)
# row3 = c("spB", 2, 5)
# tipdates_table = as.data.frame(rbind(row1, row2, row3), stringsAsFactors=FALSE, row.names=FALSE)
# names(tipdates_table) = c("tipname", "min_tipage", "max_tipage")
# tipdates_table

# Automatic extraction of the tipdates from the logfile:
logfn = "tracelog.txt"    # Filename of the Beast2 log file
tr = read.nexus(nexfn)    # Tree object required (for the tipnames)
burnin_skipnum=500        # Number of trees to skip as burnin
sample_every=1            # After excluding burnin, subsample if desired

# Run it:
res = get_tipdates_from_logfile(logfn, tr, burnin_skipnum, sample_every)

# Check out the summary!
res$summary_tipdates

# Extract from results (res)
sampled_tipdates = res$sampled_tipdates
summary_tipdates = res$summary_tipdates
tipdates_table = res$tipdates_table

plotMCC(nexfn=nexfn, 
titletxt=titletxt, pdffn=pdffn, 
tipnames_right_justified=tipnames_right_justified, 
plot_node_heights=plot_node_heights, 
plotPP=plotPP, 
minage=minage, 
ladderize_tree=ladderize_tree, 
fliptree=fliptree, 
tips_to_rotate=tips_to_rotate, 
tipcex=tipcex, 
italics_tiplabels=italics_tiplabels, 
digits=digits,
xmin=xmin,
xmax_mult=xmax_mult,
pdfheight=pdfheight,
pdfwidth=pdfwidth,
newick=newick,
tipdates_table=tipdates_table,
vlines=vlines,
vline_col=vline_col,
vline_type=vline_type,
xtext=xtext,
space_tipnames=space_tipnames,
space_spaces=space_spaces)

As noted in the script, you could save this script in a file like plotMCC_Rscript.R and then use "Rscript plotMCC_Rscript.R" from the Terminal command line. (You would still have to set whatever settings you want by editing plotMCC_Rscript.R).

Many more options could be imagined, but programming every imaginable thing (as in FigTree!) is a huge pain. If you need something specific, let me know — I am more likely to put work in for collaborations, of course.

BEASTmasteR current and planned features

(link to this section)

Update 2015-04-28: All of the below is now done, except Mk-parsinf, about which I am 90% sure is Probably Mostly A Bad Idea for several reasons, and at least naively it appears that the problem blows up computationally, even if it is a good idea; email me if interested.

As the main goal was just to produce a functioning analysis, BEASTmasteR does not do everything everyone might want. Some features will be added in the future, but others you might have to produce by manually editing the XML code, based on sample XML produced by BEAUTi, other Beast2 example XML files, etc.

I'll keep a list of what is implemented and what is planned:

Site models, DNA: HKY. Plan to add JC, GTR. +I (invariant sites) I may not add, it mostly seems to cause problems.
Site models, Amino acids: WAG. May add others.
Site models, Morphology: Mk. May add: Mk-v, Mk-parsinf

Base frequencies, DNA: Estimated. Plan to add equal.
Base frequencies, Amino acids: Estimated. Plan to add equal.
Base frequencies, Morphology: Equal. May add the dirichlet prior if that becomes available.

Clock models: Lognormal relaxed clock, one clock shared across all partitions. Plan to add strict clock. May add multiple clocks, other models.

Tree models: BDSKY. Really, most of the other phylogenetic models are just special cases of this one, so there is not too much point in having separate models for them, just fix the parameters accordingly. May add BDSKY with Sampled Ancestors, though.

Tip dates: Fixed. Plan to add variable tip dates.

I may add linked node dates (Shih & Matzke 2013) and tip dates (Wood, Matzke et al. 2012), if I can figure out how to do it in Beast2 — these require putting priors on the differences in dates, which have to be calculated at each step of the MCMC, requiring some math calculator that produces a result that can then be assessed with a prior. May require BEAST_CLASSIC XML.

BEASTmasteR errors and warnings

(link to this section)

Some errors and warning messages are predictable, so I might as well list them and how to fix or ignore them.

Errors and fixes

(link to this section)

Problem: java.lang.OutOfMemoryError: Java heap space

If you get something like this:

TreeLikelihood uses beast.evolution.likelihood.BeerLikelihoodCore
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at beast.evolution.likelihood.BeerLikelihoodCore.createNodePartials(Unknown Source)
    at beast.evolution.likelihood.TreeLikelihood.initCore(Unknown Source)
    at beast.evolution.likelihood.TreeLikelihood.initAndValidate(Unknown Source)
    at beast.util.XMLParser.initPlugins(Unknown Source)
    at beast.util.XMLParser.parse(Unknown Source)
    at beast.util.XMLParser.parseFile(Unknown Source)
    at beast.app.BeastMCMC.parseArgs(Unknown Source)
    at beast.app.beastapp.BeastMain.main(Unknown Source)

…then increase the memory by e.g. doubling the RAM you allow BEAST to use. E.g., change java -Xms512m -Xmx512m… to java -Xms1024m -Xmx1024m… in the Terminal command. For huge datasets (100,000 nucleotides+) you may need something more than a desktop computer to effectively run it. (The simplest option is the CIPRES server; but be sure to follow their guidelines etc.)

Problem: no Perl. If you get this:

> OTUs_df = read.xls(xls=xlsfn, sheet="OTUs", skip=14, header=TRUE, stringsAsFactors=FALSE, fill=TRUE, blank.lines.skip=FALSE)

Error in findPerl(verbose = verbose) :
  perl executable not found. Use perl= argument to specify the correct path.
Error in file.exists(tfn) : invalid 'file' argument

It means that you don't have Perl installed, probably because you are on a Windows machine, and Windows (Unlike Mac and Linux) doesn't come with Perl automatically installed. Download/install a Perl binary from: https://www.perl.org/get.html . When done, type "perl -v" in a command-line window (not in R, this is a terminal command, not an R command) to double-check that Perl now works.

An imaginable workaround if you can't get read.xls to work would be to save each Excel worksheet to a separate text file and then read them into R with read.table. However, I'm not sure how this would work (the Excel worksheets have a lot of empty lines / rows / columns, for formatting and variable entries). And, Perl is used elsewhere in BEASTmasteR anyway, e.g. to convert DNA entries from non-IUPAC (e.g. "(A G)") to IUPAC ("R"), and to convert "<!— blank line —>" to an actual blank line in the XML, increasing XML readability.

Problem: 50 or more warnings, stringsAsFactors. If you get this:

There were 50 or more warnings (use warnings() to see the first 50)

> warnings()

Warning messages:
1: In `[<-.factor`(`*tmp*`, iseq, value = "4") :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, iseq, value = "2") :
  invalid factor level, NA generated
 [...]

This probably means that R's "stringsAsFactors" setting is set to TRUE, meaning that strings are being read as factors, which are really numbers, which leads to mistakes when they are translated back to strings etc., resulting in NAs, which is Very Bad. The solution should be to make sure you run this at the top of your script:

# Turn off R's default stringsAsFactors silliness
options(stringsAsFactors = FALSE)

Check the setting with:

options("stringsAsFactors")

You should get FALSE:

> options("stringsAsFactors")
$stringsAsFactors
[1] FALSE

Warnings to ignore

(link to this section)

Non-problem: treeLog.mcc: No such file or directory. This warning can be ignored:

Warning message:
In normalizePath(path, ...) :
  path[1]="/Users/matzke/Downloads/Lee_Worthy_2012_theropods_to_BEASTmasteR/treeLog.mcc": No such file or directory

…because all that it means is that you haven't yet made an MCC tree from the Beast2 posterior sample of trees.

Non-problem: state contains a node birthRateChangeTimes_tops for which there is no operator

During the Beast2 run, you will see these warnings:

===============
Warning: state contains a node birthRateChangeTimes_tops for which there is no operator.
Warning: state contains a node deathRateChangeTimes_tops for which there is no operator.
Warning: state contains a node samplingRateChangeTimes_tops for which there is no operator.
Warning: state contains a node rho for which there is no operator.
Warning: state contains a node rhoSamplingTimes for which there is no operator.
Warning: state contains a node reverseTimeArrays for which there is no operator.
===============

These warnings can be ignored. In Beast2 Skyline models, the time-borders between time-bins are "parameters", but I have only ever seen people used fixed borders for time-bins (presumably reflecting known geological events, etc.). In BEASTmasteR, if you didn't do anything special, you by default just have one time-bin, i.e. the only time border is at time 0.

(For the sampling-through-time process, i.e. psi-sampling, I actually have 2 time bins by default, just to make absolutely sure that sampling in the Recent is separate from sampling in the past — this can get confused if e.g. a starting tree has living tips that don't quite come up to time=0 because of minor issues. Anyways…)

It is imaginable that one could have the time-borders between time bins be free parameters that are estimated, so the Beast2 developers programmed it that way. But at the moment, everyone is still trying to figure out when/under what conditions tip-dating gives non-crazy results, so for now BEASTmasteR let users have fixed time-borders if they want, but not estimated time-borders. Since they are fixed, they are parameters without operators to change them, thus the warning. Which you can ignore.

BEASTmasteR Google Group

(link to this section)

To join the BEASTmasteR Google Group, please request to be added here: https://groups.google.com/forum/?hl=en#!forum/beastmaster_package

License

(link to this section)

BEASTmasteR is released under GPL-3: http://cran.r-project.org/web/licenses/GPL-3 . A May 14 code archive is in this zipfile: http://phylo.wikidot.com/local--files/beastmaster/2015-05-14_BEASTmasteR_archive.zip or, more conveniently, the latest code can be accessed via the source() commands in the example script below. It is in development, so use at your own risk, your mileage may vary, etc. etc.

Some of the functions in "tree_utils_v1.R", e.g. read_beast_prt, used for extracting the bracketed statistics from BEAST NEXUS tree files (MCC files) are copied/lightly modified/heavily modified from the R package " phyloch", by Christoph Heibl, licensed under GPL (>=2), available at: http://www.christophheibl.de/Rpackages.html. Please cite phyloch also, also if you use this feature.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License