BioGeoBEARS
Table of Contents

Introduction

(link to this section)

BioGeoBEARS is an R package, authored by Nicholas J. Matzke, that is designed to perform inference of biogeographic history on phylogenies, and also model testing and model choice of the many different possible models of how biogeography may evolve on a phylogeny (dispersal, vicariance, founder-event speciation, DEC, DIVA, BAYAREA, etc.)

"BioGeoBEARS" is short for "BioGeography with Bayesian (and likelihood) Evolutionary Analysis in R Scripts". It implements the LAGRANGE (Ree & Smith 2008) DEC model (2 free parameters) as well as models with fewer or more free parameters. Standard model-testing procedures may then be applied.

The default implementation is actually ML (maximum likelihood), for comparison with LAGRANGE, but there is a Bayesian option making use of LaplacesDemon — contact me at gro.soibmin|ekztam#gro.soibmin|ekztam, if you can't figure it out from the functions.

See Matzke (2013), Frontiers of Biogeography, for an introduction. See also my IBS poster.

Author: Nicholas J. Matzke

See also

(link to this section)

Most important: Example script and most-recent updates/fixes to the R code.
Other example BioGeoBEARS scripts.
See also: Per-event weights of cladogenetic range inheritance events in BioGeoBEARS -- the parameters y, s, v, and j
See also: Biogeographical Stochastic Mapping, w. example code.
See also: Advice on statistical model comparison in BioGeoBEARS.
See also: BioGeoBEARS mistakes to avoid.
See also: BioGeoBEARS warnings to ignore, mostly
See also: BioGeoBEARS errors to NOT ignore
See also: DECstar and related models, excluding the null range from the state space
See also: Testing for, and fixing, common tree file problems in BioGeoBEARS
See also: BioGeoBEARS stuff that mostly doesn't work, yet
See also: Publications using BioGeoBEARS.
See also: BioGeoBEARS workshops
See also: BioGeoBEARS validation

License

(link to this section)

The license for BioGeoBEARS and all updates / additions / example scripts is the same as for the package on CRAN:

License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

The GPL-3 license can be found at: http://cran.r-project.org/web/licenses/GPL-3

Installation

(link to this section)

To install and run BioGeoBEARS, and the required supporting packages rexpokit and cladoRcpp, use install.packages as follows:

#######################################################
# Installing BioGeoBEARS
#######################################################
# Notes:
# 1. Lines staring with "#" are comments
# 2. I request that you use the "0-cloud" R repository at "http://cran.rstudio.com" as it is 
# the only one that keeps download statistics. However, this is up to you.
#######################################################

1. Before starting package installation, please make sure your version of R is relatively up-to-date (hopefully, R 3.3.2 or higher).  Google "install R" to find the appropriate current version for your machine.

2. Try the simple setup first. Paste these commands into R, one line at a time.  Look for errors (a "warning" is OK, an "error" is not):

SIMPLE BIOGEOBEARS SETUP
======================================================
# Install optimx
install.packages("optimx", dependencies=TRUE, repos="http://cran.rstudio.com")

# Also get snow (for parallel processing)
install.packages("snow")
library(snow)

# Install phylobase
install.packages("phylobase", dependencies=TRUE, repos="http://cran.rstudio.com")

# From October 2018 onwards, install BioGeoBEARS from GitHub:
# https://github.com/nmatzke/BioGeoBEARS
install.packages("devtools")
library(devtools)
devtools::install_github(repo="nmatzke/BioGeoBEARS", INSTALL_opts="--byte-compile", dependencies=FALSE)
======================================================

3. If you have an issue with the simple setup, try the more step-by-step setup below.

In my experience, the most common problems occur with getting phylobase or one of its dependencies to install.  If your install crashes on a dependency, try to install that dependency from source:

install.packages("package_that_caused_problem", type="source")

Or, google the package, go to its CRAN page, download a recent version, and try to install that. If you get another dependency problem, repeat until you've worked through them all.

INSTALLING A DOWNLOADED PACKAGE SOURCE FILE (*.gz)
======================================================
install.packages(pkgs="full/path/to/packagename.gz", repos=NULL, type='source', configure.args=list(R_KEEP_PKG_SOURCE=TRUE))
======================================================

STEP-BY-STEP INSTALLATION OF DEPENDENCIES FOR BIOGEOBEARS
======================================================
# Install optimx
install.packages("optimx", dependencies=TRUE, repos="http://cran.rstudio.com")

# Also get snow (for parallel processing)
install.packages("snow")
library(snow)

# Install BioGeoBEARS from CRAN 0-cloud:
install.packages("BioGeoBEARS", dependencies=TRUE, repos="http://cran.rstudio.com")

# If default install for BioGeoBEARS doesn't work, install dependencies in this order, 
# then install BioGeoBEARS from source:
install.packages("Rcpp", dependencies=TRUE)
install.packages("RcppArmadillo", dependencies=TRUE)
install.packages("gdata", dependencies=TRUE)
install.packages("gtools", dependencies=TRUE)
install.packages("xtable", dependencies=TRUE)
install.packages("plotrix", dependencies=TRUE)
install.packages("vegan", dependencies=TRUE)
install.packages("FD", dependencies=TRUE)
install.packages("SparseM", dependencies=TRUE)
install.packages("ape", dependencies=TRUE)
install.packages("phylobase", dependencies=TRUE)
install.packages("rexpokit", dependencies=TRUE)
install.packages("cladoRcpp", dependencies=TRUE)

# From October 2018 onwards, install BioGeoBEARS from GitHub:
# https://github.com/nmatzke/BioGeoBEARS
install.packages("devtools")
library(devtools)
devtools::install_github(repo="nmatzke/BioGeoBEARS", INSTALL_opts="--byte-compile", dependencies=FALSE)
======================================================

4. If you are a total novice at R, please try to work through this quick intro-to-R tutorial:
http://phylo.wikidot.com/2014-summer-research-experiences-sre-at-nimbios-for-undergra

#######################################################
#
# IMPORTANT NOTE: ONCE BIOGEOBEARS IS INSTALLED, 
# USE THE SOURCE() COMMANDS IN THE EXAMPLE SCRIPT AT
#
# http://phylo.wikidot.com/biogeobears#script
#
# TO GET BUG FIXES AND UPDATES.
#
# YOU WILL HAVE TO RUN THESE SOURCE COMMANDS 
# AFTER EACH TIME YOU RUN library(BioGeoBEARS). 
# 
# (you can either source them online, or download the .R files 
#  and source them from your hard drive)
#
#######################################################

The packages can also be downloaded from the CRAN links below. R can be freely downloaded from: R-Project.org

rexpokit: http://cran.r-project.org/web/packages/rexpokit/index.html

cladoRcpp: http://cran.r-project.org/web/packages/cladoRcpp/index.html

BioGeoBEARS: http://cran.r-project.org/web/packages/BioGeoBEARS/index.html

The PhyloWiki page you are reading (http://phylo.wikidot.com/biogeobears) is currently the one I keep most up-to-date. To get the most up-to-date fixes and additions to the code, be sure to run the source() commands in the example script at the bottom of this page. Alternatively, you can save the .R files from those URLs and source() them locally.

At some point (after a major new CRAN update) I will move updates to github (when that is the case, see https://github.com/nmatzke/BioGeoBEARS; it is easier to use library(devtools) and install it directly within R (see the install_github command).

How to get help

(link to this section)

After reading this page and trying the example script, researchers sometimes still have difficulties getting their data to run. This is usually due to either unfamiliarity with R or issues with formatting the input files.

Many of these questions are easily answered, and I have created a listserv for BioGeoBEARS users to ask questions of me and each other. While for the moment I am happy to answer questions on email as well, it is more efficient for everybody if this happens on the listserv, where answers are archived for future users. So don't be shy! Any question or difficulty you have encountered has likely been encountered by someone else also. I will sometimes repost email answers to the google group for the benefit of other users.

Joining the list requires my approval (this avoids spambots). You should be able to request to join the group via google groups, or if not, email me at gro.soibmin|ekztam#gro.soibmin|ekztam to be added. Once you are added, you can post by emailing: biogeobearsATgooglegroups.com .

The listserv is publicly-readable at: https://groups.google.com/forum/#!forum/biogeobears

The introductory welcome message is: https://groups.google.com/forum/#!topic/biogeobears/H6PpmRJNxeg

In short, I recommend:

  1. Work through the example script at the bottom of this page, using the default example files, and reading the comments
  2. Then, copy and modify the script to use your own working directory and data files.
  3. If you have a problem, search the google group.
  4. If you don't find an answer, ask the google group or email me. First, double-check the most common problems: (1) Make sure your R version and packages are up-to-date; (2) use the most recent source() commands in the example script, and (2)Make sure you've included all key commands in the example script. When you post about errors/crashes, it is helpful to give the a description of the problem and the data/setup, the error messages, your operating system, and the results of typing sessionInfo() into R (this prints out the package versions etc. that you have installed).
  5. If none of this works, email me your input files. I have gotten quite good at finding the (sometimes subtle) issues with the input files that can cause problems.

Trait-based dispersal model (added Feb. 5, 2016)

(link to this section)

February 2016: I have recently invented a trait-based dispersal model (for non-stratified analyses only, at the moment). The idea is that a trait like flight/flightlessness should influence dispersal rates (and jump dispersal weights), for obvious reasons. The basics of how it works are described in my poster, listed in my Presentations page as:

Matzke, Nicholas J. (2016). Trait-dependent dispersal models for phylogenetic biogeography, in the R package BioGeoBEARS. Speciation and Biogeography session, Society of Integrative and Comparative Biology Annual Meeting, Jan. 3-8, 2016. Wednesday, January 6, 3:30-4:30 pm. 3:30 pm, Wed. Jan. 6, Poster P3-12. (Presentations page | Poster PDF)

I am still working on the model, so I have not made an example script. However, if you have a dataset that you would be interested in trying, and would like to collaborate, feel free to email me.

Getting the trait-based dispersal model to work requires that both the anagenetic and cladogenetic transition matrices be doubled in size (for a 2-state trait), or tripled (for a 3-state trait), etc. This means a change to cladoRcpp. So, I've added a new source() command for a revised cladoRcpp.R, as well as one for the traits model, BioGeoBEARS_traits_v1.R. IF YOU FAIL TO SOURCE cladoRcpp.R, and do so BEFORE the other updates, YOU MAY GET CRASHES DUE TO FUNCTIONS WITH MISSING OPTIONS FOR FUNCTIONS. Therefore, I've updated the lists of source commands everywhere on the main BioGeoBEARS example script and home page: http://phylo.wikidot.com/biogeobears.

Update: Trait-based dispersal model

Dispersal probability a function of distance (the "+x" model) (updated Jan. 2016)

(link to this section)

January 2016: The first manuscript using the distance-based model has been accepted. Please cite:

  • Van Dam, Matthew; Matzke, Nicholas J. (2016). Evaluating the influence of connectivity and distance on biogeographic patterns in the south-western deserts of North America. Journal of Biogeography. Special paper, published online 3 March 2016.

http://dx.doi.org/10.1111/jbi.12727 http://dx.doi.org/10.1111/jbi.12727

April 2015 update: there are now more options for exponents on distance, environmental distance, or on the dispersal multipliers; see this BioGeoBEARS Google Group post: BioGeoBEARS new stuff

Many have noticed that the BioGeoBEARS supermodel contains an "x" parameter that can be used to estimate dispersal probability as a function of distance. The model works, computationally, but such models and their properties have not yet been thoroughly explored, so here are some notes/cautions to guide researchers who wish to try out the model.

  • If a distance matrix is specified, the rates/weights of dispersal processes (e.g., d, j) between two areas (e.g., A, B) will be multiplied by (distance{A to B})^x.
  • By default, x is fixed to 0, which is equivalent to distance having no effect (any distance^0 = 1).
  • To use the distances model, set parameter "x" to be "free". If "x" ends up being e.g. -1, then this amounts to an inverse distance model for dispersal probability. If it is e.g. -2, this would be sometimes-suggested inverse-square law.
  • While the distances model is intuitively very appealing, getting evidence for it might not be so easy. In likelihood-based discrete historical biogeography analyses, we are limited to a relatively small number of areas, and thus to a relatively small number of distances. If you only have e.g. 5 different distances in your geographic setup, you will probably have limited ability estimate the distance-dispersal function. (Note that Landis, Matzke, Moore & Huelsenbeck (2013) did find support for a distance effect in southeast asian Rhododendrons, in a 20-area analysis, using the BayArea program.)
  • So, as always, model-choice procedures (LRT, AIC, AICc, etc.) should be used to determine if your data supports the more complex "+x" model. (See advice on statistical model choice in biogeography, in the last section of this script.)
  • Note that if the distance between A and B is set to 0, and x is e.g. 01, then the modifier is 0.0^-1 = Inf. So, if some areas are touching and others are not, you have to make some decision about what the correct "distance" should be between the touching areas. Your guess is as good as mine. An appealing model would be that there is a dispersal rate between touching areas, and then a modifier that is applied as distance increases above 0. You would get this if you set the distance between touching areas to "1 unit" (since 1 to any power still equals 1).
  • Note the units you use for distance may be important. If distance is in km, and x=-1, then a 100 km distance adjusts dispersal probability by 1/100. If distance was in meters, 100,000 meters would cause an adjustment of 1/100,000. Now, because d, e, etc. are free parameters, in theory they should increase to account for this big decline in dispersal probability, and you should end up with the same likelihoods regardless of the units of distance.
  • However, because we have default limits on d and e, it's possible that some combinations of distance and x would require ML values of d & e that are outside of the limits. And, the fact that d and x probably interact may make the ML search more difficult (although in my experience the +x models do seem to converge — usually.
  • Short version: you should be especially careful about these issues with +x models. Options to reduce the risk of optimization failure include: (1) rescaling your distances, e.g. divide all distances by the largest or smallest distance, so that the largest or smallest distance is 1. (2) widening the limits on d, e, etc. (3) repeating ML searches with different starting points. (4) repeating ML searches with different units on distance.
  • Note that "j" event probabilities are adjusted the same as "d" with the distances model, even though "j" is a weight rather than a rate. j has intrinsic limits (depending on the weights regime set up for other cladogenetic range-changing processes), and changes to j automatically effect the probability of v, s, and y events. So, j may have a more limited ability to adjust in +J+x models than d and e, and rescaling distances may be even more important (rescaling distance so that the smallest distance is 1, i.e. dividing all distances by the smallest nonzero distance in the distance matrix; and, you should have zeros only on the diagonal).
  • You should probably always set "speedup" equal to FALSE with +x models, so that you do the full default optimx search.
  • The format for the distance matrix file is the same as for manual dispersal multipliers. given here: https://groups.google.com/d/msg/biogeobears/XXaJqmI232o/0OOsXBaqNeIJ
  • As you can see from the above, the +x model is new and not yet well-tested, so it should be viewed as experimental. Researchers should THINK about what the +x model is doing and what the results are saying, and do a variety of analyses to test their intuitions. I do believe the model is valid, and people should publish with it if they like, it just should not yet be viewed as a standard model that can be run blindly on defaults.
  • The distances model is being proposed in Matt van Damm & Nicholas J. Matzke, in preparation, email us if interested.

Fossils in BioGeoBEARS (update June 2014)

(link to this section)

As BioGeoBEARS has developed, I have programmed a number of methods for including fossil data. The initial steps are described in chapter 4 of my Ph.D., and additional methods have been developed to help collaborators deal with the specific fossil data available to them. It will be some time before all of these methods are formally described and published, as first I have to get all of my other Ph.D. chapters accepted. However, the methods exist, so they should be used if needed. (Whether or not they are easy to figure out and use is another question — in some cases it will be awhile before I can work up proper tutorials/examples, and sometimes the methods may need further refinement to deal with particular datasets; I am open to collaborations if a researcher finds that useful.)

In the meantime, I will briefly describe the options for fossil data that currently exist in the BioGeoBEARS package / the additional source files which will soon be added to the package. See: Fossil data in biogeographical analysis in BioGeoBEARS

Biogeographic Stochastic Mapping (BSM), under various biogeographical models (updated May 2016)

(link to this section)

May 2016 update: Example tutorial, supplemented with more detailed explanations of the outputs, has been posted here: Biogeographical Stochastic Mapping Example Script

Validation was posted here in 2015: BioGeoBEARS validation

Note: The stochastic mapping code assumes that all of the areas have single-letter codes — without that assumption, there is no efficient way to store/list ranges of multiple areas. So: *always* use single-letter codes for areas, if you ever plan to do stochastic mapping.

I have created functions for stochastic mapping under the various BioGeoBEARS models, and for plotting the resulting simulated histories. Stochastic mapping simulations are simulations conditional on the observed tree and a given model+model parameters. If you average many stochastic mapping simulations, you should get the same probabilities of each state that are produced in ancestral states estimation. However, the individual stochastic maps constitute possible histories, and it can be useful to count stochastically mapped events, their dates, etc., as it is more straightforward to compare counts and dates of events between analyses, than to compare the ancestral state probabilities.

Here are small preview graphics:

DEC vs. DEC+J ML models on unconstrained ("M0") Psychotria analysis:

stochastic_maps_DEC_vs_DECj_M0_v1_WORKED_ARCHIVE_50dpi.gif

DEC vs. DEC+J ML models on a time-stratified ("M3") Psychotria analysis. Here, lineages cannot occupy islands before they rise above the ocean.

stochastic_maps_DEC_vs_DECj_M3_v1_WORKED_ARCHIVE_yes_dotted_lines_50dpi.gif

Caption: Stochastic mapping of approximately equiprobable alternative histories under each model. Left: DEC model. Right: DEC+J, which includes founder-event speciation. Key: Blue, K=Kauai. Green: O=Oahu. Yellow: M=Maui-Nui. Red: H=Hawaii Big Island. Kauai is the oldest high island (~5.2 Ma), the Big Island is the youngest island (~0.5 Ma).

Here are larger versions of the graphics:

http://phylo.wdfiles.com/local--files/biogeobears/stochastic_maps_DEC_vs_DECj_M0_v1_WORKED_ARCHIVE_100dpi.gif

http://phylo.wdfiles.com/local--files/biogeobears/stochastic_maps_DEC_vs_DECj_M3_v1_WORKED_ARCHIVE_yes_dotted_lines_100dpi.gif

Note 2015-04-15: The stochastic mapping algorithm has been validated (see BioGeoBEARS validation) and has been added to the source() commands list. The best example code to use is in the validation of DEC and DEC+J BSMs, available at: http://phylo.wikidot.com/biogeobears-validation#BGB_nonstrat_DEC_DECj

Note 2014-05-28: Fixed issue with stochastic mapping not happening on the root branch below subtrees within a stratum. So, adding areas through time works. Removing areas through time may be a slightly different case and should be double-checked at some point.

Citation of Biogeographical Stochastic Mapping

(link to this section)

Note 2016: The best paper to cite regarding Biogeographical Stochastic Mapping is Dupin et al. (2016, //Journal of Biogeography//), where we first describe the basics of the method, and its utility. Eventually there may be a more specialist application note on the method, but this is behind several other projects.

Citation of Biogeographical Stochastic Mapping:

1. Dupin, Julia; Matzke, Nicholas J.; Sarkinen, Tiina; Knapp, Sandra; Olmstead, Richard; Bohs, Lynn; Smith, Stacey (2016).
"Bayesian estimation of the global biogeographic history of the Solanaceae. Journal of Biogeography, 44(4), 887-899.
http://dx.doi.org/10.1111/jbi.12898

2. Matzke, Nicholas J. (2016). "Stochastic mapping under biogeographical models." PhyloWiki BioGeoBEARS website, 2016,
http://phylo.wikidot.com/biogeobears#stochastic_mapping.

The code is now in the source() commands at the beginning of the example script. See also Biogeographical Stochastic Mapping example script.

SSE Simulation results and code (update September 2014)

(link to this section)

DEC, DEC+J, and most methods in use in historical biogeography implicitly assume that the observed tree is the complete tree, and that lineage-level speciation and extinction are independent of the range evolution process*. Models that correct for these factors are SSE (state-dependent speciation-extinction models), such as BiSSE, GeoSSE, and ClaSSE, but the SSE methods have not yet been widely applied in biogeography, probably because of the large number of parameters they involve (which will increase dramatically as the number of areas and the number of ranges increase), and the typical small size of biogeographic datasets.

Nevertheless, it is interesting to ask whether or not SSE-type processes, as well as missing speciation & extinction events in the observed tree, might cause a bias in inference under DEC or DEC+J, or a problem in distinguishing between these models.

It is also interesting to ask under what values of d, e, and j parameter inference, model choice, and ancestral range inference are reliable.

Therefore, in Matzke (2014), I performed forward simulations under the following macroevolutionary models:

1. Yule, birthrate=0.3
2. BD, birthrate=0.3, deathrate=0.1
3. BD, birthrate=0.3, deathrate=0.3 (this one dies out a lot)
4. SSE, birthrate=0.3, deathrate=0, prob(birth)~rangesize^1
5. SSE, birthrate=0.3, deathrate=0.1, prob(birth)~rangesize^1, prob(death)~rangesize^-1
6. SSE, birthrate=0.3, deathrate=0.3, prob(birth)~rangesize^1, prob(death)~rangesize^-1

The latter three models make speciation more probable in lineages with large geographic ranges, and make extinction less probable in lineages with large geographic ranges.

For each of these macroevolutionary models, geographic range evolved simultaneously with the lineage birth-death process, under a variety of values of d, e and j:

d = 0, 0.03, 0.15
e = 0, 0.03, 0.15
j = 0, 0.02, 0.1, 0.3 (these are per-event weights, not rates)

Parameter combinations were excluded when: deathrate > birthrate, and when e > d. This left 138 parameter combinations. For each of these combinations, the phylogeny+geographical history was simulated 100 times, with each simulation running until the clade had 50 living species.

The complete trees (including extinct lineages) and complete simulated geographic histories were saved, and then the trees were pruned to be "observed" trees. Inference was then conducted under DEC and DEC+J.

These results are discussed in Matzke (2014), Systematic Biology, but the short version is:

1. DEC and DEC+J are highly distinguishable at moderate values** of d and e (0, 0.03) which are often observed in empirical inference. At very high values (d, e=0.15, i.e. on average a d or e event on every branch), this breaks down somewhat, especially for low j, although high j can often still be detected. However, even DEC inference breaks down at high d and e, as would any phylogenetic method where the rates of change are high per branch.

2. Inference of e is always poor, under both DEC+J and DEC (as shown in the original DEC papers)

3. Inference of d and inference of j do not interfere with each other under reasonable parameter values. They are distinguishable. If anything, DEC infers a d too high when the generating model is DEC+J.

4. The effects of BD or SSE on inference appear moderate in most situations.

PDF Files of simulation results

(link to this section)

A. DEC_DECJ_inf_boxplots_ALL.pdf — Simulations when starting range is a single area

B. startABCD_DEC_DECJ_inf_boxplots_ALL.pdf — Simulations when the starting range is all areas

C. simtree_stats_boxplots_ALL2.pdf — Tree statistics, range sizes, number of events, etc., under the simulations. These are in the same order as in A, and correspond to the same parameter combinations.

R functions for SSE simulations

(link to this section)

I thought I had put the SSE simulation code in the Matzke (2014) Dryad archive, but apparently this got dropped between the various rounds of revisions. The code is here as a zipfile: http://phylo.wikidot.com/local--files/biogeobears/Matzke_2014_simcode.zip . Eventually it will be included with documentation in the next major CRAN BioGeoBEARS update. Until then, you can (a) try and figure the functions out (it shouldn't be too hard), or (b) try to get me to help you (more likely if we are collaborating).

Notes

Note * — DEC & DEC+J models do have the process where, if e is positive, range can evolve to 0 size through a range-contraction event. This is usually not inferred, because usually all species in an analysis are living and occupy a nonzero range. Whether or not a continuous-time range-contraction process really causes extinction in real life is debatable. In my simulations, extinction occurs according to the background deathrate (mu), and e can cause additional extinction through the range-contraction process.

So, a "high" rate of d or e is e.g. 50% of the speciation rate. A "low" rate might be 1/10 the speciation rate.

Note ** — The values of d and e only have meaning in relation to the speciation rate. For Hawaiian Psychotria this is ~0.33 (geiger package, birthdeath function). To compare the simulation results to your own data, you would want to estimate your speciation rate, and scale everything mentally. E.g., if your speciation rate is 0.6, your equivalent values of d and e would be doubled from the simulations.

Abbreviations in plots:

birthrate = lambda
deathrate = mu
exponent on rangesize multiplier for speciation = alpha
exponent on rangesize multiplier for extinction = omega

Update to simulation results: Animations

(link to this section)

I have developed plotting routines (not in the package yet) for the above simulations. These plot the biogeographic ranges on top of the tree. Here, line thickness = range size (from 1 to 4 areas). The colors are the same as for the stochastic maps, and in this version of the simulations, the simulation always starts in area "A", or, if you like, "K" for "Kauai", colored blue. There is no constraint or stratigraphy in these simulations, however. The color light grey = all 4 areas occupied.

simhistory_allsims_observed_nostates_ALL_ARCHIVE_WORKED_50dpi.gif

Here is a PDF, and a zoomed-in GIF, of the simulations. Because some of the BD trees are very large, all plots have an x-axis of 100 million years for comparison. This does indicate real variability in the simulations, and thus robustness in the conclusions about accurately inferring DEC versus DEC+J models.

http://phylo.wikidot.com/local--files/biogeobears/simhistory_allsims_observed_nostates_ALL_ARCHIVE_WORKED.pdf

http://phylo.wikidot.com/local--files/biogeobears/simhistory_allsims_observed_nostates_ALL_ARCHIVE_WORKED_100dpi.gif

Citing BioGeoBEARS, DEC+J, and other models (update September 27, 2014)

(link to this section)

Citation for the DEC+J model/founder-event speciation, and statistical model selection in historical biogeography:

  1. Matzke, N.J. (2014). "Model Selection in Historical Biogeography Reveals that Founder-event Speciation is a Crucial Process in Island Clades." Systematic Biology. 63(6): 951-970. http://dx.doi.org/10.1093/sysbio/syu056 http://sysbio.oxfordjournals.org/content/early/2014/09/21/sysbio.syu056.full — Supplementary data on Dryad: http://datadryad.org/resource/doi:10.5061/dryad.2mc1t

Citation for the +x model (distance-based dispersal):

  1. Van Dam, Matthew; Matzke, Nicholas J. (2016). Evaluating the influence of connectivity and distance on biogeographic patterns in the south-western deserts of North America. Journal of Biogeography. Special paper, published online 3 March 2016.

http://dx.doi.org/10.1111/jbi.12727 http://dx.doi.org/10.1111/jbi.12727

On BioGeoBEARS:

  1. Matzke, Nicholas J. (2013). "Probabilistic Historical Biogeography: New Models for Founder-Event Speciation, Imperfect Detection, and Fossils Allow Improved Accuracy and Model-Testing." Frontiers of Biogeography. 5(4), 242-248. (Scholar | Journal)
  2. Also cite R packages that you use (see below).

Other models and pieces of BioGeoBEARS are found in my Ph.D. and various articles in various stages of submission. These are listed below.

I have posted the approved version of my Ph.D. in "Files", below (scroll to the bottom of the page and click "Files"). Here is the reference, for citation purposes. (Note: I've also added the ProQuest version, which is the permanent repository, and open-access. However, email me for supplemental material etc.; I believe I uploaded this to ProQuest, but I don't see how to access it.)

  • Matzke, Nicholas Joseph (2013). Probabilistic Historical Biogeography: New models for founder-event speciation, imperfect detection, and fossils allow improved accuracy and model-testing. Ph.D. thesis, Department Integrative Biology and Designated Emphasis in Computational and Genomic Biology, University of California, Berkeley. Pages 1-240. August 2013. Available at: http://phylo.wikidot.com/local--files/biogeobears/Matzke_2013_PhD_Proquest_biogeography_wSigs.pdf
  • Matzke, Nicholas J. (2013). Probabilistic Historical Biogeography: New models for founder-event speciation, imperfect detection, and fossils allow improved accuracy and model-testing. Department of Integrative Biology, University of California, Berkeley. August 2013. Order No. 3616487, ProQuest Dissertations and Theses, pp. 1-240. Open-access at: http://search.proquest.com/docview/1526024556

Which items you decide to cite depends on which you used, and what your journal's rules are for citing dissertations, articles in review, etc. Ideally, I would eventually like to see people cite at least the Methods in Ecology and Evolution article for the R package, the Systematic Biology article for the DEC+J model, and the Journal of Biogeography article for the likelihood versions of DIVA, DIVA+J, BAYAREA, and BAYAREA+J, but it is hard to predict when each article will be accepted. In the meantime, users can cited whatever has been published, the thesis itself, or articles in in review/in prep (I am happy to send these if you don't have them).

Apart from journals, it has become common practice to cite R packages that are important in an analysis. Every R package has a CITATION file. Those citations can be found with e.g. "citation("BioGeoBEARS")", and are given below.

The articles and their statuses are:

  1. Matzke, Nicholas J. (2014). "Stochastic mapping in historical biogeography." Bioinformatics. In revision. (draft PDF)
  2. Matzke, Nicholas J. (2014). "BioGeoBEARS: an R package for model testing and ancestral state estimation in historical biogeography." Methods in Ecology and Evolution, revising & improving package documentation for resubmission.
  3. Matzke, Nicholas J. (2014). "Model selection reveals differences in cladogenesis processes operating in island versus continental clades." Journal of Biogeography, in revision.

For the distance-dependent dispersal model:

  1. Van Dam, Matthew; Matzke, Nicholas J. (2014). Evaluating the influence of connectivity and distance on biogeographic patterns in the Southwestern deserts of North America. Journal of Biogeography, in review.

Each article has various supplemental materials etc. I am happy to share them, email me at gro.soibmin|ekztam#gro.soibmin|ekztam if you would like to see those. Thanks! Nick

Citation of R packages

(link to this section)

Apart from journals, every R package has a CITATION file. Those citations can be found with "citation("BioGeoBEARS")", and are given below. To get the current version and date, type e.g. "library(help = BioGeoBEARS)".

Matzke, Nicholas J. (2013). BioGeoBEARS: BioGeography with Bayesian (and Likelihood) Evolutionary Analysis in R Scripts. R package, version 0.2.1, published July 27, 2013 at: http://CRAN.R-project.org/package=BioGeoBEARS

Matzke, Nicholas J., and Sidje, Roger B. (2013). rexpokit: R wrappers for EXPOKIT. R package, version 0.24.2, published July 15, 2013 at: http://CRAN.R-project.org/package=rexpokit

Matzke, Nicholas J. (2013). cladoRcpp: C++ implementations of phylogenetic calculations. R package, version 0.14.2, published July 15, 2013 at: http://CRAN.R-project.org/package=cladoRcpp

Update November 12, 2013: BioGeoBEARS in one slide

(link to this section)

Here is an attempt to represent the BioGeoBEARS supermodel in one slide, comparing it to other historical biogeography methods in use.

NOTE: On smaller/laptop screens, some people aren't seeing the Parameters column. Zoom out if necessary (Apple-minus on a Mac, or View menu) to see the whole graphic. Or go to the graphic by itself.

BioGeoBEARS_supermodel.png

Figure: The processes assumed by different historical biogeography methods. Each of these processes is controlled by the specified parameter(s) in the BioGeoBEARS supermodel, allowing them to be turned on or off, or estimated from the data. Note that whether or not the data support a particular free parameter is an empirical question that should be tested with model choice procedures. Note also that this graphic deals only with the range-changing processes assumed by the different methods. BioGeoBEARS does not attempt to replicate e.g. the parsimony aspect of DIVA, just the processes allowed by DIVA (the BioGeoBEARS "DIVA" model could be called "DIVALIKE" to emphasize that it is a likelihood implementation of the processes assumed by DIVA). Similarly, BioGeoBEARS does not yet implement the "SSE" (state-based speciation and estimation rates) features of the GeoSSE model (Goldberg et al. 2011) of diversity. The ClaSSE model (Goldberg & Igić, 2012) can in theory use a parameter to represent the probability of each possible combination of ancestor range, left descendant range, and right descendant range. In that sense ClaSSE is the ultimate supermodel, although users would have to develop their own parameterizations to produce a reasonable biogeographic model, and the number of parameters inflates dramatically with number of areas — on defaults, 9 areas means 29-1=511 possible ranges, and this means 511x511x511 = 133,432,831 possible combinations of ancestor/left descendant/right descendant. The cladoRcpp R package, a dependency of BioGeoBEARS, is designed to efficiently calculate probabilities for these combinations, under the implemented biogeography models.

Time-stratification, dispersal multiplier matrices, distance matrices, areas allowed matrices, and area-of-areas matrices

(link to this section)

I introduce how to run analyses including time-stratification, dispersal multiplier matrices, distance matrices, areas allowed matrices, and area-of-areas matrices [in this post on the BioGeoBEARS Google Group].

I provide a basic description of the different matrices, and what they do below. Below that, there are example files.

Manual dispersal multipliers (as people did with Lagrange)

(link to this section)

With manual dispersal multipliers, dispersal probability (the dispersal rate for d, the dispersal weight for j) is multiplied by the manual_dispersal_multiplier.

This feature was available in Lagrange (for parameter d only).

NOTE CAREFULLY that changing dispersal multipliers is NOT the same thing as changing which areas are allowed.

An additional option in BioGeoBEARS is that, if desired, the user can set the "w" parameter to be free, and then:

d_actual = d_base * manual_dispersal_multiplier^w
j_actual = j_base * manual_dispersal_multiplier^w

By default, w is fixed to 1, but by setting w to be free, the user can attempt to infer the optimal dispersal_multiplier matrix. There are still various issues* with this strategy, but it might be better than just deciding manual_dispersal_multipliers subjectively, which is the traditional strategy.

Good points to have w be free:

  • You can subject your manual dispersal multiplier matrix to model testing
  • Your data can tell you if your manual dispersal multiplier matrix is worthless (if w is inferred to be 0, then the

dispersal matrix has no effect, as anything^0 = 1)

  • You can attempt to infer the optimal dispersal multipliers

However, there are some issues I have noticed:

  • You have to be much more careful than usual to ensure the maximum likelihood optimization is actually finding the maximum likelihood. Manual dispersal multipliers are often a poor fit to the data, and the scaling of w may be different than other parameters. So, watch like a hawk and follow the advice at: http://phylo.wikidot.com/advice-on-statistical-model-comparison-in-biogeobears#ML
  • Although +w eliminates some of the subjectivity, there is still subjectivity in choosing the levels of the manual dispersal multiplier matrix. E.g., should you have base dispersal multipliers of 1, 0.5, and 0.1 (for easy, medium, and hard dispersal), or should they be 1, 0.5, and 0.001? The +w model doesn't completely fix this issue. (One could easily start to use more matrices, though, and give them each an exponent parameter.)

Distance matrix

(link to this section)

With distance matrices, dispersal probability is multiplied by distance^x.

d_actual = d_base * distance^x
j_actual = j_base * distance^x

When x is 0 (the default), distance has no effect. When x is inferred to be negative This allows you to set up statistical model choice on distances, infer the influence of distance, etc. If you have 2 areas with zero distance between them, I recommend setting that distance to be 1.

The minimum distance should NEVER be 0, because this would mean 0 distance = 0 dispersal probability, which is probably not what you want. (Distance matrices can have 0s on the diagonal (distance-to-self), these are ignored.)

It is probably best to re-scale all of your distances so that the you do not have huge numbers. Theoretically, the model could adjust to any scaling of distance, but because we have default min/max values on the parameters, it is much safer to scale the distances to something reasonable (numbers less than 100 or 1000). One way to do this is to divide everything by the shortest distance, so that the shortest distance becomes 1 and gets d_base.

I have sometimes seen ML optimization issues with the +x issue, mostly when x and d are interacting (indicating a likelihood ridge). This can be improved by setting limits on x (e.g., set max x=0 and min x=-2), and re-scaling the distances.

There is no particular reason that a power function (distance^x) is going to be the best dispersal-distance function. Others should be tried (but would take new programming).

Environmental distance matrix

(link to this section)

With environmental distance matrices, dispersal probability is multiplied by environmental distance^n.

d_actual = d_base * environmental_distance^n
j_actual = j_base * environmental_distance^n

This model works just like the distance model. Of course, BioGeoBEARS doesn't care what distances you put in, so you could use +x and +n models to test the effect of two different kinds of geographical distance, or two environmental distances, or whatever. You could even use the manual_dispersal_multipliers matrix to make a third distance if you wanted.

Areas-allowed matrix

(link to this section)

The areas_allowed matrix DOES NOT modify dispersal rates. Instead, it CHANGES THE STATE SPACE ("state space" = list of allowed states. In a DNA analysis, the states are A, C, G, T. In a biogeographical analysis, the states are the ranges NULL, A, B, AB, ABC, etc.). It disallows certain areas at certain timepoints. The classic case is Hawaiian islands sinking below the ocean as you go back in time. See the example areas_allowed.txt file for how this works.

For historical reasons, this is specified as a matrix of 1s and 0s (a list of 1s and 0s would have made more sense, but we aren't changing it now).

As discussed below for the areas-adjacency matrix, only simple scenarios (islands disappearing back in time, for instance) can be specified with the areas_allowed strategy. For anything more complex, use the custom-modified states_list strategy, described here: http://phylo.wikidot.com/biogeobears#better_than_adjacency

Areas-adjacency matrix

(link to this section)

The areas_adjacency matrix DOES NOT modifiy dispersal rates. Instead, it CHANGES THE STATE SPACE ("state space" = list of allowed states. In a DNA analysis, the states are A, C, G, T. In a biogeographical analysis, the states are the ranges NULL, A, B, AB, ABC, etc.).

Specifically, the areas-adjacency matrix tries to describe which ranges (which combinations of areas) are allowed states in the state space. The usual idea is that a species can live in multiple areas *if* those areas are adjacent.

NOTE CAREFULLY: BioGeoBEARS doesn't "know" what you "mean" by adjacency. It is perfectly possible to have two areas be adjacent in the areas_adjacency matrix, but have a huge geographical distance in the distances matrix.

NOTE CAREFULLY: For historical reasons, areas_adjacency was set up as a matrix, for some simple cases. But it is impossible to specify all of the possible combinations of allowed/disallowed areas/states that someone might want to do, with just square 1/0 matrices. You can manually edit the states_list, however, to set up whatever list of allowed states you like.

This would be done in the R script and not in a text file. See various discussions on the BioGeoBEARS google group, e.g.
https://groups.google.com/forum/#!searchin/biogeobears/areas_allowed$20adjacency/biogeobears/7e7U9NtPTBU/pnNtR9M7migJ

Manually modifying the list of states/ranges — A BETTER SOLUTION than the areas-adjacency matrix

(link to this section)

Except for the simplest cases, manual modification of the states list will work better/be more flexible than attempting to use the areas_adjacency matrix strategy. These links explain how to do it:

Manual modification of states_list (list of geographic ranges, a.k.a. the list of states in the state space)
http://phylo.wikidot.com/example-biogeobears-scripts#manual_modify_states

Re-using the manually modified of states_list
http://phylo.wikidot.com/example-biogeobears-scripts#reuse_modified_states

Manually modify the states list in time-stratified analyses
http://phylo.wikidot.com/example-biogeobears-scripts#modified_states_stratified

EXAMPLE FILES

(link to this section)

How to find example files

(link to this section)

You can see example input files in several ways:

  1. Example files are available in the extdata directory of the BioGeoBEARS installation. In R, type np(system.file("extdata", package="BioGeoBEARS")) to see where this is.
    1. Then navigate to that directory with your file browser (Windows Explorer in Windows — often WindowsKey+E; or Finder on a Mac).
    2. Alternatively, go to a terminal or command prompt and type "open directoryname" (replace "directoryname" with the path).
    3. You can also try, in R, pathname=np(system.file("extdata", package="BioGeoBEARS")); system(paste0("open ", pathname))
    4. All of the above could be screwed up by space or differences in operating systems. Note that Windows uses backslashes in directory paths (\like\this\), Macs and UNIX/Linux use forward (/like/this/) — although Windows may cope with forward slashes now. Also, spaces can cause problems, you might have to backslash spaces: /like\ this/seendttoohard/. These things are "obvious" to experienced computational people, but can be mystifying to beginners, and it is difficult to think of all of the possible issues on all of the different computer setups people might have. So: focus on figuring out your computer's directory structure, and how to navigate it.
  2. I have put the example files in "Files" at the VERY BOTTOM of the PhyloWiki page you are reading right not. Scroll to the bottom, click "Files". Direct links are here:

Text files should be viewed/edited WITH A REAL PLAIN-TEXT EDITOR

(link to this section)

In computational biology, data, inputs, code, etc., are typically contained in PLAIN-TEXT (ASCII) files. These files have *no* fonts, text-sizes, invisible formatting, weird character codes, etc.

DO NOT EXPECT TO BE ABLE TO READ/WRITE YOUR DATA/INPUT FILES IN WORD, OR WHATEVER, AND HAVE THE FILES COME OUT CORRECTLY FORMATTED AND COMPUTER-READABLE BY BioGeoBEARS.

Plain-text files are best viewed in an ASCII text viewer, so that you can see the difference between spaces and tabs, you can see blank lines at the end of some files, etc.

Note also that THE DIFFERENCE BETWEEN TABS AND SPACES IS IMPORTANT IN THESE TEXT FILES. COMMON PROBLEMS INCLUDE:

  • Tabs / spaces at the end of a line, after any text. Delete these in the text-editor.
  • A line in the text file that looks blank, but is actually 8 tabs (or whatever). This can be produced by e.g. copying/pasting a series of distance matrices from Excel to text. When you do this, just check the blank lines for tabs, and delete these tabs in the text-editor so the line is actually blank.

Good programs:
TextWranger (mac), BBedit (mac), Notetab (windows), Rstudio or R.app editors, any code editor

Bad (bad Bad BAD!!) programs:
Word/Office (mac/windows), WordPad (windows), TextEdit (Mac)

Marginal:
Notepad (windows)

Links to files

(link to this section)

A Newick-formatted tree of Hawaiian Psychotria:
http://phylo.wdfiles.com/local--files/biogeobears/Psychotria_5.2.newick

A PHYLIP-formatted geography file (C++ Lagrange format)
http://phylo.wdfiles.com/local--files/biogeobears/Psychotria_geog.data

For a time-stratified analysis, you have to add a timeperiods file:
Note: the oldest age much be below the age of the root of your tree,
and there can only be one such age
http://phylo.wdfiles.com/local--files/biogeobears/timeperiods.txt

Manual dispersal multipliers (as people did with Lagrange)
http://phylo.wdfiles.com/local--files/biogeobears/manual_dispersal_multipliers_with_0s.txt

Manual dispersal multipliers (no time-stratification)
http://phylo.wdfiles.com/local--files/biogeobears/manual_dispersal_multipliers_with_0s_single_timebin.txt

Sometimes 0s can create "impossible" data (likelihood 0, log-likelihood -Inf) or underflow problems
that crash the run. If that happens, an equivalent strategy is to use very small
values instead of 0
http://phylo.wdfiles.com/local--files/biogeobears/manual_dispersal_multipliers_without_0s.txt

With the "x" parameter turned on, dispersal probability will be modified by distance^x
Here is a stratified distances file for Hawaii
(Note: I did not check this rigorously for accuracy in Hawaii)
You can get distances from a map with various web tools:
https://www.google.com.au/search?q=distances+between+points+on+a+map&oq=distances+between+points+on+a+map&aqs=chrome..69i57j0l5.7536j0j7&sourceid=chrome&es_sm=119&ie=UTF-8

Example distances file (in km)
http://phylo.wdfiles.com/local--files/biogeobears/distances.txt

NOTE HOW THE COLUMNS GET HEADERS, BUT THE ROWS DO NOT. The rows are assumed to have the same order as the columns.

NOTE: THE AREAS IN THE DISTANCES FILE SHOULD BE IN THE SAME ORDER AS IN THE HEADER OF THE GEOGRAPHY FILE. That means *both* the columns *and* rows should be in the order of the areas in the geography file.

You probably want to consider rescaling distance so that the smallest distance is 1
(and thus dispersal between these regions equals the base dispersal rate, d,
and/or the base jump dispersal weight, j)
http://phylo.wdfiles.com/local--files/biogeobears/distances_rescaled.txt

Distances can also be used in non-stratified analyses:
http://phylo.wdfiles.com/local--files/biogeobears/distances_nonstratified.txt
http://phylo.wdfiles.com/local--files/biogeobears/distances_nonstratified_rescaled.txt

Example areas_allowed matrices:
http://phylo.wdfiles.com/local--files/biogeobears/areas_allowed.txt

An areas-allowed matrix of "all 1s" is equivalent to specifying no areas_allowed matrix (and should return identical likelihoods
http://phylo.wdfiles.com/local--files/biogeobears/areas_allowed_all1s.txt

One might also want to specify an areas_adjacency matrix (which combinations of areas are allowed states; note that this is a DIFFERENT thing than modifying dispersal probabilities):
http://phylo.wdfiles.com/local--files/biogeobears/areas_adjacency.txt

NOTE CAREFULLY: It is impossible to specify all of the possible combinations of allowed/disallowed areas/states that someone might want to do, with just square 1/0 matrices. These are intended for simple cases (like islands sinking as you go back in time).

You can manually edit the states_list, however, to set up whatever list of allowed states you like. This would be done in the R script and not in a text file. See various discussions on the BioGeoBEARS google group, e.g.
https://groups.google.com/forum/#!searchin/biogeobears/areas_allowed$20adjacency/biogeobears/7e7U9NtPTBU/pnNtR9M7migJ

How to decide your geographic areas

(link to this section)

The geography files look like this:

http://phylo.wikidot.com/biogeobears#links_to_files

Specifically:
http://phylo.wdfiles.com/local--files/biogeobears/Psychotria_geog.data

How to decide on your geographic areas: I recommend that the geographic regions be "biogeographically significant" — i.e., not political boundaries. Hopefully the discrete areas you choose reflect some kind of real discreteness out there in the world that the organisms are responding to — geographical, environmental, etc. This is admittedly somewhat subjective, but thinking about it, with the goal of capturing the main features of the biogeographical history, at the scale you are looking at, is probably still the best strategy.

If it's a global analysis, the regions are typically continents or large ecoregions. At smaller scales, they may be ecoregions or physiographic provinces within a continent, island groups, or islands.

The computational limit is the "state space" — the number of states in the transition rate matrix. DNA transition rate matrices are 4x4, because DNA has 4 states (A, C, G, T). Protein analyses have 20 states, because there are 20 amino acids. However, biogeographical analyses can have hundreds of states, and working with these larger state spaces becomes slower and slower for the computer.

In a biogeographical analysis, each "state" corresponds to a "geographic range" — a a specific combination of presences and absences in the geographic areas that have been set up.

Computationally, the number of states/ranges should be roughly:

  • less than 1000 if you want the analysis to run in under a day
  • less than 1500 if you want the analysis to run in under a week, and
  • less than 2500 if you want it to run at all

(the details might depend on the amount of computation available — if you have thousands of cores available, you might be able to push it a bit)

The number of states depends on the number of areas in the analysis, and the maximum range size (the number of areas any species is allowed to occupy). You can check this with numstates_from_numareas():

library(BioGeoBEARS)
numstates_from_numareas(numareas=3, maxareas=3, include_null_range=TRUE)
numstates_from_numareas(numareas=10, maxareas=10, include_null_range=TRUE)
numstates_from_numareas(numareas=12, maxareas=12, include_null_range=TRUE)
numstates_from_numareas(numareas=12, maxareas=2, include_null_range=TRUE)

Results:

> numstates_from_numareas(numareas=3, maxareas=3, include_null_range=TRUE)
[1] 8
> numstates_from_numareas(numareas=10, maxareas=10, include_null_range=TRUE)
[1] 1024
> numstates_from_numareas(numareas=12, maxareas=12, include_null_range=TRUE)
[1] 4096
> numstates_from_numareas(numareas=12, maxareas=2, include_null_range=TRUE)
[1] 79

An analysis with 12 areas (and 12 max) and thus 4096 states will never complete in a reasonable time, but the others are fine.

SCRIPT: EXAMPLE THAT SHOULD RUN OUT OF THE BOX: Hawaiian Psychotria (revised and improved, 2015-04-15)

(link to this section)

#######################################################
# This is an introductory example script for the 
# R package "BioGeoBEARS" by Nick Matzke
# 
# All scripts are copyright Nicholas J. Matzke, 
# please cite if you use. License: GPL-3
# http://cran.r-project.org/web/licenses/GPL-3
# 
# I am happy to answer questions at matzke@nimbios.org, but
# I am more happy to answer questions on the 
# BioGeoBEARS google group
#
# The package is designed for ML and Bayesian inference
# of 
# 
# (a) ancestral geographic ranges, and 
# 
# (b) perhaps more importantly, models for the 
#     evolution of geographic range across a phylogeny.
#
# The example below implements and compares:
# 
# (1) The standard 2-parameter DEC model implemented in 
#     the program LAGRANGE (Ree & Smith 2008); users will
#     notice that the ML parameter inference and log-
#     likelihoods are identical
#
# (2) A DEC+J model implemented in BioGeoBEARS, wherein
#     a third parameter, j, is added, representing the 
#     relative per-event weight of founder-event / jump
#     speciation events at cladogenesis events.  The 
#     higher j is, the more probability these events have,
#     and the less probability the standard LAGRANGE
#     cladogenesis events have.
#
# (3) Some standard model-testing (LRT and AIC) is 
#     implemented at the end so that users may compare models
#
# (4) The script does similar tests of a DIVA-like model (Ronquist 1997)
#     and a BAYAREA-like model (Landis, Matzke, Moore, & Huelsenbeck, 2013)
# 
#######################################################

#######################################################
# Installing BioGeoBEARS from GitHub latest version
#######################################################
# CUT 2018: The old instructions to source() online upgrade .R files have been deleted,
#         all updates are now on the GitHub version of the package, version 1.1+
#######################################################
# Paste the stuff below, INSIDE the single-quote (') marks
# but NOT the single-quote marks themselves.
install_cmds_that_work_as_of_2023='

# Installation-from-scratch commands
install.packages("devtools")
install.packages("ape")
install.packages("optimx")
install.packages("GenSA")
install.packages("rexpokit")   
install.packages("cladoRcpp")
install.packages("snow")
install.packages("MultinomialCI")

library(devtools)
devtools::install_github(repo="nmatzke/BioGeoBEARS")

# NOTE: If you get a message like this
# * select "2. CRAN packages only" on "3. None"
# * If you get asked about "binaries" vs. "source", choose "binaries" 
#   (binaries are precompiled and easy to install; installing from source
#    requires that your computer have the correct compilers, which can be
#    challenging if you are not fairly expert)
' # END install_cmds_that_work_as_of_2023

#######################################################
#######################################################

#######################################################
# SETUP -- libraries/BioGeoBEARS updates
#######################################################

# Load the package (after installation, see above).
library(ape)
library(optimx)   # optimx seems better than R's default optim()
library(GenSA)    # GenSA seems better than optimx (but slower) on 5+ parameters, 
                  # seems to sometimes fail on simple problems (2-3 parameters)
library(rexpokit)
library(cladoRcpp)
library(snow)     # (if you want to use multicore functionality; some systems/R versions prefer library(parallel), try either)
library(parallel)
library(BioGeoBEARS)

#######################################################
# SETUP: YOUR WORKING DIRECTORY
#######################################################
# You will need to set your working directory to match your local system

# Note these very handy functions!
# Command "setwd(x)" sets your working directory
# Command "getwd()" gets your working directory and tells you what it is.
# Command "list.files()" lists the files in your working directory
# To get help on any command, use "?".  E.g., "?list.files"

# Set your working directory for output files
# default here is your home directory ("~")
# Change this as you like
wd = np("~")
setwd(wd)

# Double-check your working directory with getwd()
getwd()

#######################################################
# SETUP: Extension data directory
#######################################################
# When R packages contain extra files, they are stored in the "extdata" directory 
# inside the installed package.
#
# BioGeoBEARS contains various example files and scripts in its extdata directory.
# 
# Each computer operating system might install BioGeoBEARS in a different place, 
# depending on your OS and settings. 
# 
# However, you can find the extdata directory like this:
extdata_dir = np(system.file("extdata", package="BioGeoBEARS"))
extdata_dir
list.files(extdata_dir)

# "system.file" looks in the directory of a specified package (in this case BioGeoBEARS)
# The function "np" is just a shortcut for normalizePath(), which converts the 
# path to the format appropriate for your system (e.g., Mac/Linux use "/", but 
# Windows uses "\\", if memory serves).

# Even when using your own data files, you should KEEP these commands in your 
# script, since the plot_BioGeoBEARS_results function needs a script from the 
# extdata directory to calculate the positions of "corners" on the plot. This cannot
# be made into a straight up BioGeoBEARS function because it uses C routines 
# from the package APE which do not pass R CMD check for some reason.

#######################################################
# SETUP: YOUR TREE FILE AND GEOGRAPHY FILE
#######################################################
# Example files are given below. To run your own data,
# make the below lines point to your own files, e.g.
# trfn = "/mydata/frogs/frogBGB/tree.newick"
# geogfn = "/mydata/frogs/frogBGB/geog.data"

#######################################################
# Phylogeny file
# Notes: 
# 1. Must be binary/bifurcating: no polytomies
# 2. No negative branchlengths (e.g. BEAST MCC consensus trees sometimes have negative branchlengths)
# 3. Be careful of very short branches, as BioGeoBEARS will interpret ultrashort branches as direct ancestors
# 4. You can use non-ultrametric trees, but BioGeoBEARS will interpret any tips significantly below the 
#    top of the tree as fossils!  This is only a good idea if you actually do have fossils in your tree,
#    as in e.g. Wood, Matzke et al. (2013), Systematic Biology.
# 5. The default settings of BioGeoBEARS make sense for trees where the branchlengths are in units of 
#    millions of years, and the tree is 1-1000 units tall. If you have a tree with a total height of
#    e.g. 0.00001, you will need to adjust e.g. the max values of d and e, or (simpler) multiply all
#    your branchlengths to get them into reasonable units.
# 6. DON'T USE SPACES IN SPECIES NAMES, USE E.G. "_"
#######################################################
# This is the example Newick file for Hawaiian Psychotria
# (from Ree & Smith 2008)
# "trfn" = "tree file name"
trfn = np(paste(addslash(extdata_dir), "Psychotria_5.2.newick", sep=""))

# Look at the raw Newick file:
moref(trfn)

# Look at your phylogeny (plots to a PDF, which avoids issues with multiple graphics in same window):
pdffn = "tree.pdf"
pdf(file=pdffn, width=9, height=12)

tr = read.tree(trfn)
tr
plot(tr)
title("Example Psychotria phylogeny from Ree & Smith (2008)")
axisPhylo() # plots timescale

dev.off()
cmdstr = paste0("open ", pdffn)
system(cmdstr)

#######################################################
# Geography file
# Notes:
# 1. This is a PHYLIP-formatted file. This means that in the 
#    first line, 
#    - the 1st number equals the number of rows (species)
#    - the 2nd number equals the number of columns (number of areas)
#    - after a tab, put the areas in parentheses, with spaces: (A B C D)
#
# 1.5. Example first line:
#    10    4    (A B C D)
# 
# 2. The second line, and subsequent lines:
#    speciesA    0110
#    speciesB    0111
#    speciesC    0001
#         ...
# 
# 2.5a. This means a TAB between the species name and the area 0/1s
# 2.5b. This also means NO SPACE AND NO TAB between the area 0/1s.
# 
# 3. See example files at:
#    http://phylo.wikidot.com/biogeobears#files
# 
# 4. Make you understand what a PLAIN-TEXT EDITOR is:
#    http://phylo.wikidot.com/biogeobears#texteditors
#
# 3. The PHYLIP format is the same format used for C++ LAGRANGE geography files.
#
# 4. All names in the geography file must match names in the phylogeny file.
#
# 5. DON'T USE SPACES IN SPECIES NAMES, USE E.G. "_"
#
# 6. Operational taxonomic units (OTUs) should ideally be phylogenetic lineages, 
#    i.e. genetically isolated populations.  These may or may not be identical 
#    with species.  You would NOT want to just use specimens, as each specimen 
#    automatically can only live in 1 area, which will typically favor DEC+J 
#    models.  This is fine if the species/lineages really do live in single areas,
#    but you wouldn't want to assume this without thinking about it at least. 
#    In summary, you should collapse multiple specimens into species/lineages if 
#    data indicates they are the same genetic population.
######################################################

# This is the example geography file for Hawaiian Psychotria
# (from Ree & Smith 2008)
geogfn = np(paste(addslash(extdata_dir), "Psychotria_geog.data", sep=""))

# Look at the raw geography text file:
moref(geogfn)

# Look at your geographic range data:
tipranges = getranges_from_LagrangePHYLIP(lgdata_fn=geogfn)
tipranges

# Maximum range size observed:
max(rowSums(dfnums_to_numeric(tipranges@df)))

# Set the maximum number of areas any species may occupy; this cannot be larger 
# than the number of areas you set up, but it can be smaller.
max_range_size = 4

####################################################
####################################################
# KEY HINT: The number of states (= number of different possible geographic ranges)
# depends on (a) the number of areas and (b) max_range_size.
# If you have more than about 500-600 states, the calculations will get REALLY slow,
# since the program has to exponentiate a matrix of e.g. 600x600.  Often the computer
# will just sit there and crunch, and never get through the calculation of the first
# likelihood.
# 
# (this is also what is usually happening when LAGRANGE hangs: you have too many states!)
#
# To check the number of states for a given number of ranges, try:
numstates_from_numareas(numareas=4, maxareas=4, include_null_range=TRUE)
numstates_from_numareas(numareas=4, maxareas=4, include_null_range=FALSE)
numstates_from_numareas(numareas=4, maxareas=3, include_null_range=TRUE)
numstates_from_numareas(numareas=4, maxareas=2, include_null_range=TRUE)

# Large numbers of areas have problems:
numstates_from_numareas(numareas=10, maxareas=10, include_null_range=TRUE)

# ...unless you limit the max_range_size:
numstates_from_numareas(numareas=10, maxareas=2, include_null_range=TRUE)
####################################################
####################################################

#######################################################
#######################################################
# DEC AND DEC+J ANALYSIS
#######################################################
#######################################################
# NOTE: The BioGeoBEARS "DEC" model is identical with 
# the Lagrange DEC model, and should return identical
# ML estimates of parameters, and the same 
# log-likelihoods, for the same datasets.
#
# Ancestral state probabilities at nodes will be slightly 
# different, since BioGeoBEARS is reporting the 
# ancestral state probabilities under the global ML
# model, and Lagrange is reporting ancestral state
# probabilities after re-optimizing the likelihood
# after fixing the state at each node. These will 
# be similar, but not identical. See Matzke (2014),
# Systematic Biology, for discussion.
#
# Also see Matzke (2014) for presentation of the 
# DEC+J model.
#######################################################
#######################################################

#######################################################
#######################################################

#######################################################
# Run DEC
#######################################################

# Intitialize a default model (DEC model)
BioGeoBEARS_run_object = define_BioGeoBEARS_run()

# Give BioGeoBEARS the location of the phylogeny Newick file
BioGeoBEARS_run_object$trfn = trfn

# Give BioGeoBEARS the location of the geography text file
BioGeoBEARS_run_object$geogfn = geogfn

# Input the maximum range size
BioGeoBEARS_run_object$max_range_size = max_range_size

BioGeoBEARS_run_object$min_branchlength = 0.000001    # Min to treat tip as a direct ancestor (no speciation event)
BioGeoBEARS_run_object$include_null_range = TRUE    # set to FALSE for e.g. DEC* model, DEC*+J, etc.
# (For DEC* and other "*" models, please cite: Massana, Kathryn A.; Beaulieu, 
#  Jeremy M.; Matzke, Nicholas J.; O’Meara, Brian C. (2015). Non-null Effects of 
#  the Null Range in Biogeographic Models: Exploring Parameter Estimation in the 
#  DEC Model. bioRxiv,  http://biorxiv.org/content/early/2015/09/16/026914 )
# Also: search script on "include_null_range" for other places to change

# Set up a time-stratified analysis:
# 1. Here, un-comment ONLY the files you want to use.
# 2. Also un-comment "BioGeoBEARS_run_object = section_the_tree(...", below.
# 3. For example files see (a) extdata_dir, 
#  or (b) http://phylo.wikidot.com/biogeobears#files
#  and BioGeoBEARS Google Group posts for further hints)
#
# Uncomment files you wish to use in time-stratified analyses:
#BioGeoBEARS_run_object$timesfn = "timeperiods.txt"
#BioGeoBEARS_run_object$dispersal_multipliers_fn = "manual_dispersal_multipliers.txt"
#BioGeoBEARS_run_object$areas_allowed_fn = "areas_allowed.txt"
#BioGeoBEARS_run_object$areas_adjacency_fn = "areas_adjacency.txt"
#BioGeoBEARS_run_object$distsfn = "distances_matrix.txt"
# See notes on the distances model on PhyloWiki's BioGeoBEARS updates page.

# Speed options and multicore processing if desired
BioGeoBEARS_run_object$on_NaN_error = -1e50    # returns very low lnL if parameters produce NaN error (underflow check)
BioGeoBEARS_run_object$speedup = TRUE          # shorcuts to speed ML search; use FALSE if worried (e.g. >3 params)
BioGeoBEARS_run_object$use_optimx = TRUE    # if FALSE, use optim() instead of optimx();
# if "GenSA", use Generalized Simulated Annealing, which seems better on high-dimensional
# problems (5+ parameters), but seems to sometimes fail to optimize on simple problems
BioGeoBEARS_run_object$num_cores_to_use = 1
# (use more cores to speed it up; this requires
# library(parallel) and/or library(snow). The package "parallel" 
# is now default on Macs in R 3.0+, but apparently still 
# has to be typed on some Windows machines. Note: apparently 
# parallel works on Mac command-line R, but not R.app.
# BioGeoBEARS checks for this and resets to 1
# core with R.app)

# Sparse matrix exponentiation is an option for huge numbers of ranges/states (600+)
# I have experimented with sparse matrix exponentiation in EXPOKIT/rexpokit,
# but the results are imprecise and so I haven't explored it further.
# In a Bayesian analysis, it might work OK, but the ML point estimates are
# not identical.
# Also, I have not implemented all functions to work with force_sparse=TRUE.
# Volunteers are welcome to work on it!!
BioGeoBEARS_run_object$force_sparse = FALSE    # force_sparse=TRUE causes pathology & isn't much faster at this scale

# This function loads the dispersal multiplier matrix etc. from the text files into the model object. Required for these to work!
# (It also runs some checks on these inputs for certain errors.)
BioGeoBEARS_run_object = readfiles_BioGeoBEARS_run(BioGeoBEARS_run_object)

# Divide the tree up by timeperiods/strata (uncomment this for stratified analysis)
#BioGeoBEARS_run_object = section_the_tree(inputs=BioGeoBEARS_run_object, make_master_table=TRUE, plot_pieces=FALSE, fossils_older_than=0.001, cut_fossils=FALSE)
# The stratified tree is described in this table:
#BioGeoBEARS_run_object$master_table

# Good default settings to get ancestral states
BioGeoBEARS_run_object$return_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_TTL_loglike_from_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_ancprobs = TRUE    # get ancestral states from optim run

# Set up DEC model
# (nothing to do; defaults)

# Look at the BioGeoBEARS_run_object; it's just a list of settings etc.
BioGeoBEARS_run_object

# This contains the model object
BioGeoBEARS_run_object$BioGeoBEARS_model_object

# This table contains the parameters of the model 
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table

# Run this to check inputs. Read the error messages if you get them!
BioGeoBEARS_run_object = fix_BioGeoBEARS_params_minmax(BioGeoBEARS_run_object=BioGeoBEARS_run_object)
check_BioGeoBEARS_run(BioGeoBEARS_run_object)

# For a slow analysis, run once, then set runslow=FALSE to just 
# load the saved result.
runslow = TRUE
resfn = "Psychotria_DEC_M0_unconstrained_v1.Rdata"
if (runslow)
    {
    res = bears_optim_run(BioGeoBEARS_run_object)
    res    

    save(res, file=resfn)
    resDEC = res
    } else {
    # Loads to "res"
    load(resfn)
    resDEC = res
    }

#######################################################
# Run DEC+J
#######################################################
BioGeoBEARS_run_object = define_BioGeoBEARS_run()
BioGeoBEARS_run_object$trfn = trfn
BioGeoBEARS_run_object$geogfn = geogfn
BioGeoBEARS_run_object$max_range_size = max_range_size
BioGeoBEARS_run_object$min_branchlength = 0.000001    # Min to treat tip as a direct ancestor (no speciation event)
BioGeoBEARS_run_object$include_null_range = TRUE    # set to FALSE for e.g. DEC* model, DEC*+J, etc.
# (For DEC* and other "*" models, please cite: Massana, Kathryn A.; Beaulieu, 
#  Jeremy M.; Matzke, Nicholas J.; O’Meara, Brian C. (2015). Non-null Effects of 
#  the Null Range in Biogeographic Models: Exploring Parameter Estimation in the 
#  DEC Model. bioRxiv,  http://biorxiv.org/content/early/2015/09/16/026914 )
# Also: search script on "include_null_range" for other places to change

# Set up a time-stratified analysis:
#BioGeoBEARS_run_object$timesfn = "timeperiods.txt"
#BioGeoBEARS_run_object$dispersal_multipliers_fn = "manual_dispersal_multipliers.txt"
#BioGeoBEARS_run_object$areas_allowed_fn = "areas_allowed.txt"
#BioGeoBEARS_run_object$areas_adjacency_fn = "areas_adjacency.txt"
#BioGeoBEARS_run_object$distsfn = "distances_matrix.txt"
# See notes on the distances model on PhyloWiki's BioGeoBEARS updates page.

# Speed options and multicore processing if desired
BioGeoBEARS_run_object$on_NaN_error = -1e50    # returns very low lnL if parameters produce NaN error (underflow check)
BioGeoBEARS_run_object$speedup = TRUE          # shorcuts to speed ML search; use FALSE if worried (e.g. >3 params)
BioGeoBEARS_run_object$use_optimx = TRUE    # if FALSE, use optim() instead of optimx();
# if "GenSA", use Generalized Simulated Annealing, which seems better on high-dimensional
# problems (5+ parameters), but seems to sometimes fail to optimize on simple problems
BioGeoBEARS_run_object$num_cores_to_use = 1
BioGeoBEARS_run_object$force_sparse = FALSE    # force_sparse=TRUE causes pathology & isn't much faster at this scale

# This function loads the dispersal multiplier matrix etc. from the text files into the model object. Required for these to work!
# (It also runs some checks on these inputs for certain errors.)
BioGeoBEARS_run_object = readfiles_BioGeoBEARS_run(BioGeoBEARS_run_object)

# Divide the tree up by timeperiods/strata (uncomment this for stratified analysis)
#BioGeoBEARS_run_object = section_the_tree(inputs=BioGeoBEARS_run_object, make_master_table=TRUE, plot_pieces=FALSE, fossils_older_than=0.001, cut_fossils=FALSE)
# The stratified tree is described in this table:
#BioGeoBEARS_run_object$master_table

# Good default settings to get ancestral states
BioGeoBEARS_run_object$return_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_TTL_loglike_from_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_ancprobs = TRUE    # get ancestral states from optim run

# Set up DEC+J model
# Get the ML parameter values from the 2-parameter nested model
# (this will ensure that the 3-parameter model always does at least as good)
dstart = resDEC$outputs@params_table["d","est"]
estart = resDEC$outputs@params_table["e","est"]
jstart = 0.0001

# Input starting values for d, e
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","init"] = dstart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","est"] = dstart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","init"] = estart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","est"] = estart

# Add j as a free parameter
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = jstart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = jstart

BioGeoBEARS_run_object = fix_BioGeoBEARS_params_minmax(BioGeoBEARS_run_object=BioGeoBEARS_run_object)
check_BioGeoBEARS_run(BioGeoBEARS_run_object)

resfn = "Psychotria_DEC+J_M0_unconstrained_v1.Rdata"
runslow = TRUE
if (runslow)
    {
    #sourceall("/Dropbox/_njm/__packages/BioGeoBEARS_setup/")

    res = bears_optim_run(BioGeoBEARS_run_object)
    res    

    save(res, file=resfn)

    resDECj = res
    } else {
    # Loads to "res"
    load(resfn)
    resDECj = res
    }

#######################################################
# PDF plots
#######################################################
pdffn = "Psychotria_DEC_vs_DEC+J_M0_unconstrained_v1.pdf"
pdf(pdffn, height=6, width=6)

#######################################################
# Plot ancestral states - DEC
#######################################################
analysis_titletxt ="BioGeoBEARS DEC on Psychotria M0_unconstrained"

# Setup
results_object = resDEC
scriptdir = np(system.file("extdata/a_scripts", package="BioGeoBEARS"))

# States
res2 = plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="text", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

# Pie chart
plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="pie", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

#######################################################
# Plot ancestral states - DECJ
#######################################################
analysis_titletxt ="BioGeoBEARS DEC+J on Psychotria M0_unconstrained"

# Setup
results_object = resDECj
scriptdir = np(system.file("extdata/a_scripts", package="BioGeoBEARS"))

# States
res1 = plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="text", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

# Pie chart
plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="pie", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

dev.off()  # Turn off PDF
cmdstr = paste("open ", pdffn, sep="")
system(cmdstr) # Plot it

#######################################################
#######################################################
# DIVALIKE AND DIVALIKE+J ANALYSIS
#######################################################
#######################################################
# NOTE: The BioGeoBEARS "DIVALIKE" model is not identical with 
# Ronquist (1997)'s parsimony DIVA. It is a likelihood
# interpretation of DIVA, constructed by modelling DIVA's
# processes the way DEC does, but only allowing the 
# processes DIVA allows (widespread vicariance: yes; subset
# sympatry: no; see Ronquist & Sanmartin 2011, Figure 4).
#
# DIVALIKE is a likelihood interpretation of parsimony
# DIVA, and it is "like DIVA" -- similar to, but not
# identical to, parsimony DIVA.
#
# I thus now call the model "DIVALIKE", and you should also. ;-)
#######################################################
#######################################################

#######################################################
# Run DIVALIKE
#######################################################
BioGeoBEARS_run_object = define_BioGeoBEARS_run()
BioGeoBEARS_run_object$trfn = trfn
BioGeoBEARS_run_object$geogfn = geogfn
BioGeoBEARS_run_object$max_range_size = max_range_size
BioGeoBEARS_run_object$min_branchlength = 0.000001    # Min to treat tip as a direct ancestor (no speciation event)
BioGeoBEARS_run_object$include_null_range = TRUE    # set to FALSE for e.g. DEC* model, DEC*+J, etc.
# (For DEC* and other "*" models, please cite: Massana, Kathryn A.; Beaulieu, 
#  Jeremy M.; Matzke, Nicholas J.; O’Meara, Brian C. (2015). Non-null Effects of 
#  the Null Range in Biogeographic Models: Exploring Parameter Estimation in the 
#  DEC Model. bioRxiv,  http://biorxiv.org/content/early/2015/09/16/026914 )
# Also: search script on "include_null_range" for other places to change

# Set up a time-stratified analysis:
#BioGeoBEARS_run_object$timesfn = "timeperiods.txt"
#BioGeoBEARS_run_object$dispersal_multipliers_fn = "manual_dispersal_multipliers.txt"
#BioGeoBEARS_run_object$areas_allowed_fn = "areas_allowed.txt"
#BioGeoBEARS_run_object$areas_adjacency_fn = "areas_adjacency.txt"
#BioGeoBEARS_run_object$distsfn = "distances_matrix.txt"
# See notes on the distances model on PhyloWiki's BioGeoBEARS updates page.

# Speed options and multicore processing if desired
BioGeoBEARS_run_object$on_NaN_error = -1e50    # returns very low lnL if parameters produce NaN error (underflow check)
BioGeoBEARS_run_object$speedup = TRUE          # shorcuts to speed ML search; use FALSE if worried (e.g. >3 params)
BioGeoBEARS_run_object$use_optimx = TRUE    # if FALSE, use optim() instead of optimx();
# if "GenSA", use Generalized Simulated Annealing, which seems better on high-dimensional
# problems (5+ parameters), but seems to sometimes fail to optimize on simple problems
BioGeoBEARS_run_object$num_cores_to_use = 1
BioGeoBEARS_run_object$force_sparse = FALSE    # force_sparse=TRUE causes pathology & isn't much faster at this scale

# This function loads the dispersal multiplier matrix etc. from the text files into the model object. Required for these to work!
# (It also runs some checks on these inputs for certain errors.)
BioGeoBEARS_run_object = readfiles_BioGeoBEARS_run(BioGeoBEARS_run_object)

# Divide the tree up by timeperiods/strata (uncomment this for stratified analysis)
#BioGeoBEARS_run_object = section_the_tree(inputs=BioGeoBEARS_run_object, make_master_table=TRUE, plot_pieces=FALSE, fossils_older_than=0.001, cut_fossils=FALSE)
# The stratified tree is described in this table:
#BioGeoBEARS_run_object$master_table

# Good default settings to get ancestral states
BioGeoBEARS_run_object$return_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_TTL_loglike_from_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_ancprobs = TRUE    # get ancestral states from optim run

# Set up DIVALIKE model
# Remove subset-sympatry
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ysv","type"] = "2-j"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ys","type"] = "ysv*1/2"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["y","type"] = "ysv*1/2"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "ysv*1/2"

# Allow classic, widespread vicariance; all events equiprobable
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01v","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01v","init"] = 0.5
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01v","est"] = 0.5

# No jump dispersal/founder-event speciation
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = 0.01
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = 0.01

BioGeoBEARS_run_object = fix_BioGeoBEARS_params_minmax(BioGeoBEARS_run_object=BioGeoBEARS_run_object)
check_BioGeoBEARS_run(BioGeoBEARS_run_object)

runslow = TRUE
resfn = "Psychotria_DIVALIKE_M0_unconstrained_v1.Rdata"
if (runslow)
    {
    res = bears_optim_run(BioGeoBEARS_run_object)
    res    

    save(res, file=resfn)
    resDIVALIKE = res
    } else {
    # Loads to "res"
    load(resfn)
    resDIVALIKE = res
    }

#######################################################
# Run DIVALIKE+J
#######################################################
BioGeoBEARS_run_object = define_BioGeoBEARS_run()
BioGeoBEARS_run_object$trfn = trfn
BioGeoBEARS_run_object$geogfn = geogfn
BioGeoBEARS_run_object$max_range_size = max_range_size
BioGeoBEARS_run_object$min_branchlength = 0.000001    # Min to treat tip as a direct ancestor (no speciation event)
BioGeoBEARS_run_object$include_null_range = TRUE    # set to FALSE for e.g. DEC* model, DEC*+J, etc.
# (For DEC* and other "*" models, please cite: Massana, Kathryn A.; Beaulieu, 
#  Jeremy M.; Matzke, Nicholas J.; O’Meara, Brian C. (2015). Non-null Effects of 
#  the Null Range in Biogeographic Models: Exploring Parameter Estimation in the 
#  DEC Model. bioRxiv,  http://biorxiv.org/content/early/2015/09/16/026914 )
# Also: search script on "include_null_range" for other places to change

# Set up a time-stratified analysis:
#BioGeoBEARS_run_object$timesfn = "timeperiods.txt"
#BioGeoBEARS_run_object$dispersal_multipliers_fn = "manual_dispersal_multipliers.txt"
#BioGeoBEARS_run_object$areas_allowed_fn = "areas_allowed.txt"
#BioGeoBEARS_run_object$areas_adjacency_fn = "areas_adjacency.txt"
#BioGeoBEARS_run_object$distsfn = "distances_matrix.txt"
# See notes on the distances model on PhyloWiki's BioGeoBEARS updates page.

# Speed options and multicore processing if desired
BioGeoBEARS_run_object$on_NaN_error = -1e50    # returns very low lnL if parameters produce NaN error (underflow check)
BioGeoBEARS_run_object$speedup = TRUE          # shorcuts to speed ML search; use FALSE if worried (e.g. >3 params)
BioGeoBEARS_run_object$use_optimx = TRUE    # if FALSE, use optim() instead of optimx();
# if "GenSA", use Generalized Simulated Annealing, which seems better on high-dimensional
# problems (5+ parameters), but seems to sometimes fail to optimize on simple problems
BioGeoBEARS_run_object$num_cores_to_use = 1
BioGeoBEARS_run_object$force_sparse = FALSE    # force_sparse=TRUE causes pathology & isn't much faster at this scale

# This function loads the dispersal multiplier matrix etc. from the text files into the model object. Required for these to work!
# (It also runs some checks on these inputs for certain errors.)
BioGeoBEARS_run_object = readfiles_BioGeoBEARS_run(BioGeoBEARS_run_object)

# Divide the tree up by timeperiods/strata (uncomment this for stratified analysis)
#BioGeoBEARS_run_object = section_the_tree(inputs=BioGeoBEARS_run_object, make_master_table=TRUE, plot_pieces=FALSE, fossils_older_than=0.001, cut_fossils=FALSE)
# The stratified tree is described in this table:
#BioGeoBEARS_run_object$master_table

# Good default settings to get ancestral states
BioGeoBEARS_run_object$return_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_TTL_loglike_from_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_ancprobs = TRUE    # get ancestral states from optim run

# Set up DIVALIKE+J model
# Get the ML parameter values from the 2-parameter nested model
# (this will ensure that the 3-parameter model always does at least as good)
dstart = resDIVALIKE$outputs@params_table["d","est"]
estart = resDIVALIKE$outputs@params_table["e","est"]
jstart = 0.0001

# Input starting values for d, e
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","init"] = dstart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","est"] = dstart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","init"] = estart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","est"] = estart

# Remove subset-sympatry
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ysv","type"] = "2-j"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ys","type"] = "ysv*1/2"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["y","type"] = "ysv*1/2"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "ysv*1/2"

# Allow classic, widespread vicariance; all events equiprobable
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01v","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01v","init"] = 0.5
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01v","est"] = 0.5

# Add jump dispersal/founder-event speciation
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = jstart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = jstart

# Under DIVALIKE+J, the max of "j" should be 2, not 3 (as is default in DEC+J)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","min"] = 0.00001
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","max"] = 1.99999

BioGeoBEARS_run_object = fix_BioGeoBEARS_params_minmax(BioGeoBEARS_run_object=BioGeoBEARS_run_object)
check_BioGeoBEARS_run(BioGeoBEARS_run_object)

resfn = "Psychotria_DIVALIKE+J_M0_unconstrained_v1.Rdata"
runslow = TRUE
if (runslow)
    {
    #sourceall("/Dropbox/_njm/__packages/BioGeoBEARS_setup/")

    res = bears_optim_run(BioGeoBEARS_run_object)
    res    

    save(res, file=resfn)

    resDIVALIKEj = res
    } else {
    # Loads to "res"
    load(resfn)
    resDIVALIKEj = res
    }

pdffn = "Psychotria_DIVALIKE_vs_DIVALIKE+J_M0_unconstrained_v1.pdf"
pdf(pdffn, height=6, width=6)

#######################################################
# Plot ancestral states - DIVALIKE
#######################################################
analysis_titletxt ="BioGeoBEARS DIVALIKE on Psychotria M0_unconstrained"

# Setup
results_object = resDIVALIKE
scriptdir = np(system.file("extdata/a_scripts", package="BioGeoBEARS"))

# States
res2 = plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="text", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

# Pie chart
plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="pie", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

#######################################################
# Plot ancestral states - DIVALIKE+J
#######################################################
analysis_titletxt ="BioGeoBEARS DIVALIKE+J on Psychotria M0_unconstrained"

# Setup
results_object = resDIVALIKEj
scriptdir = np(system.file("extdata/a_scripts", package="BioGeoBEARS"))

# States
res1 = plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="text", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

# Pie chart
plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="pie", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

dev.off()
cmdstr = paste("open ", pdffn, sep="")
system(cmdstr)

#######################################################
#######################################################
# BAYAREALIKE AND BAYAREALIKE+J ANALYSIS
#######################################################
#######################################################
# NOTE: As with DIVA, the BioGeoBEARS BayArea-like model is 
# not identical with the full Bayesian model implemented 
# in the "BayArea" program of Landis et al. (2013). 
#
# Instead, this is a simplified likelihood interpretation
# of the model.  Basically, in BayArea and BioGeoBEARS-BAYAREALIKE, 
# "d" and "e" work like they do in the DEC model of Lagrange 
# (and BioGeoBEARS), and then BayArea's cladogenesis assumption
# (which is that nothing in particular happens at cladogenesis) is 
# replicated by BioGeoBEARS.
#
# This leaves out 3 important things that are in BayArea:
# 1. Distance dependence (you can add this with a distances 
#    matrix + the "x" parameter in BioGeoBEARS, however)
# 2. A correction for disallowing "e" events that drive
#    a species extinct (a null geographic range)
# 3. The neat Bayesian sampling of histories, which allows
#    analyses on large numbers of areas.
#
# The main purpose of having a "BAYAREALIKE" model is 
# to test the importance of the cladogenesis model on 
# particular datasets. Does it help or hurt the data 
# likelihood if there is no special cladogenesis process?
# 
# BAYAREALIKE is a likelihood interpretation of BayArea,
# and it is "like BayArea" -- similar to, but not
# identical to, Bayesian BayArea.
# I thus now call the model "BAYAREALIKE", and you should also. ;-)
#######################################################
#######################################################

#######################################################
# Run BAYAREALIKE
#######################################################
BioGeoBEARS_run_object = define_BioGeoBEARS_run()
BioGeoBEARS_run_object$trfn = trfn
BioGeoBEARS_run_object$geogfn = geogfn
BioGeoBEARS_run_object$max_range_size = max_range_size
BioGeoBEARS_run_object$min_branchlength = 0.000001    # Min to treat tip as a direct ancestor (no speciation event)
BioGeoBEARS_run_object$include_null_range = TRUE    # set to FALSE for e.g. DEC* model, DEC*+J, etc.
# (For DEC* and other "*" models, please cite: Massana, Kathryn A.; Beaulieu, 
#  Jeremy M.; Matzke, Nicholas J.; O’Meara, Brian C. (2015). Non-null Effects of 
#  the Null Range in Biogeographic Models: Exploring Parameter Estimation in the 
#  DEC Model. bioRxiv,  http://biorxiv.org/content/early/2015/09/16/026914 )
# Also: search script on "include_null_range" for other places to change

# Set up a time-stratified analysis:
#BioGeoBEARS_run_object$timesfn = "timeperiods.txt"
#BioGeoBEARS_run_object$dispersal_multipliers_fn = "manual_dispersal_multipliers.txt"
#BioGeoBEARS_run_object$areas_allowed_fn = "areas_allowed.txt"
#BioGeoBEARS_run_object$areas_adjacency_fn = "areas_adjacency.txt"
#BioGeoBEARS_run_object$distsfn = "distances_matrix.txt"
# See notes on the distances model on PhyloWiki's BioGeoBEARS updates page.

# Speed options and multicore processing if desired
BioGeoBEARS_run_object$on_NaN_error = -1e50    # returns very low lnL if parameters produce NaN error (underflow check)
BioGeoBEARS_run_object$speedup = TRUE          # shorcuts to speed ML search; use FALSE if worried (e.g. >3 params)
BioGeoBEARS_run_object$use_optimx = TRUE    # if FALSE, use optim() instead of optimx();
# if "GenSA", use Generalized Simulated Annealing, which seems better on high-dimensional
# problems (5+ parameters), but seems to sometimes fail to optimize on simple problems
BioGeoBEARS_run_object$num_cores_to_use = 1
BioGeoBEARS_run_object$force_sparse = FALSE    # force_sparse=TRUE causes pathology & isn't much faster at this scale

# This function loads the dispersal multiplier matrix etc. from the text files into the model object. Required for these to work!
# (It also runs some checks on these inputs for certain errors.)
BioGeoBEARS_run_object = readfiles_BioGeoBEARS_run(BioGeoBEARS_run_object)

# Divide the tree up by timeperiods/strata (uncomment this for stratified analysis)
#BioGeoBEARS_run_object = section_the_tree(inputs=BioGeoBEARS_run_object, make_master_table=TRUE, plot_pieces=FALSE, fossils_older_than=0.001, cut_fossils=FALSE)
# The stratified tree is described in this table:
#BioGeoBEARS_run_object$master_table

# Good default settings to get ancestral states
BioGeoBEARS_run_object$return_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_TTL_loglike_from_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_ancprobs = TRUE    # get ancestral states from optim run

# Set up BAYAREALIKE model
# No subset sympatry
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0

# No vicariance
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","est"] = 0.0

# No jump dispersal/founder-event speciation
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = 0.01
# BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = 0.01

# Adjust linkage between parameters
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ysv","type"] = "1-j"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ys","type"] = "ysv*1/1"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["y","type"] = "1-j"

# Only sympatric/range-copying (y) events allowed, and with 
# exact copying (both descendants always the same size as the ancestor)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","init"] = 0.9999
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","est"] = 0.9999

# Check the inputs; fixing any initial ("init") values outside min/max
BioGeoBEARS_run_object = fix_BioGeoBEARS_params_minmax(BioGeoBEARS_run_object=BioGeoBEARS_run_object)
check_BioGeoBEARS_run(BioGeoBEARS_run_object)

runslow = TRUE
resfn = "Psychotria_BAYAREALIKE_M0_unconstrained_v1.Rdata"
if (runslow)
    {
    res = bears_optim_run(BioGeoBEARS_run_object)
    res    

    save(res, file=resfn)
    resBAYAREALIKE = res
    } else {
    # Loads to "res"
    load(resfn)
    resBAYAREALIKE = res
    }

#######################################################
# Run BAYAREALIKE+J
#######################################################
BioGeoBEARS_run_object = define_BioGeoBEARS_run()
BioGeoBEARS_run_object$trfn = trfn
BioGeoBEARS_run_object$geogfn = geogfn
BioGeoBEARS_run_object$max_range_size = max_range_size
BioGeoBEARS_run_object$min_branchlength = 0.000001    # Min to treat tip as a direct ancestor (no speciation event)
BioGeoBEARS_run_object$include_null_range = TRUE    # set to FALSE for e.g. DEC* model, DEC*+J, etc.
# (For DEC* and other "*" models, please cite: Massana, Kathryn A.; Beaulieu, 
#  Jeremy M.; Matzke, Nicholas J.; O’Meara, Brian C. (2015). Non-null Effects of 
#  the Null Range in Biogeographic Models: Exploring Parameter Estimation in the 
#  DEC Model. bioRxiv,  http://biorxiv.org/content/early/2015/09/16/026914 )
# Also: search script on "include_null_range" for other places to change

# Set up a time-stratified analysis:
#BioGeoBEARS_run_object$timesfn = "timeperiods.txt"
#BioGeoBEARS_run_object$dispersal_multipliers_fn = "manual_dispersal_multipliers.txt"
#BioGeoBEARS_run_object$areas_allowed_fn = "areas_allowed.txt"
#BioGeoBEARS_run_object$areas_adjacency_fn = "areas_adjacency.txt"
#BioGeoBEARS_run_object$distsfn = "distances_matrix.txt"
# See notes on the distances model on PhyloWiki's BioGeoBEARS updates page.

# Speed options and multicore processing if desired
BioGeoBEARS_run_object$on_NaN_error = -1e50    # returns very low lnL if parameters produce NaN error (underflow check)
BioGeoBEARS_run_object$speedup = TRUE          # shorcuts to speed ML search; use FALSE if worried (e.g. >3 params)
BioGeoBEARS_run_object$use_optimx = "GenSA"
BioGeoBEARS_run_object$num_cores_to_use = 1
BioGeoBEARS_run_object$force_sparse = FALSE    # force_sparse=TRUE causes pathology & isn't much faster at this scale

# This function loads the dispersal multiplier matrix etc. from the text files into the model object. Required for these to work!
# (It also runs some checks on these inputs for certain errors.)
BioGeoBEARS_run_object = readfiles_BioGeoBEARS_run(BioGeoBEARS_run_object)

# Divide the tree up by timeperiods/strata (uncomment this for stratified analysis)
#BioGeoBEARS_run_object = section_the_tree(inputs=BioGeoBEARS_run_object, make_master_table=TRUE, plot_pieces=FALSE, fossils_older_than=0.001, cut_fossils=FALSE)
# The stratified tree is described in this table:
#BioGeoBEARS_run_object$master_table

# Good default settings to get ancestral states
BioGeoBEARS_run_object$return_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_TTL_loglike_from_condlikes_table = TRUE
BioGeoBEARS_run_object$calc_ancprobs = TRUE    # get ancestral states from optim run

# Set up BAYAREALIKE+J model
# Get the ML parameter values from the 2-parameter nested model
# (this will ensure that the 3-parameter model always does at least as good)
dstart = resBAYAREALIKE$outputs@params_table["d","est"]
estart = resBAYAREALIKE$outputs@params_table["e","est"]
jstart = 0.0001

# Input starting values for d, e
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","init"] = dstart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","est"] = dstart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","init"] = estart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","est"] = estart

# No subset sympatry
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0

# No vicariance
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","init"] = 0.0
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","est"] = 0.0

# *DO* allow jump dispersal/founder-event speciation (set the starting value close to 0)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = jstart
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = jstart

# Under BAYAREALIKE+J, the max of "j" should be 1, not 3 (as is default in DEC+J) or 2 (as in DIVALIKE+J)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","max"] = 0.99999

# Adjust linkage between parameters
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ysv","type"] = "1-j"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["ys","type"] = "ysv*1/1"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["y","type"] = "1-j"

# Only sympatric/range-copying (y) events allowed, and with 
# exact copying (both descendants always the same size as the ancestor)
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","type"] = "fixed"
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","init"] = 0.9999
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["mx01y","est"] = 0.9999

# NOTE (NJM, 2014-04): BAYAREALIKE+J seems to crash on some computers, usually Windows 
# machines. I can't replicate this on my Mac machines, but it is almost certainly
# just some precision under-run issue, when optim/optimx tries some parameter value 
# just below zero.  The "min" and "max" options on each parameter are supposed to
# prevent this, but apparently optim/optimx sometimes go slightly beyond 
# these limits.  Anyway, if you get a crash, try raising "min" and lowering "max" 
# slightly for each parameter:
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","min"] = 0.0000001
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["d","max"] = 4.9999999

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","min"] = 0.0000001
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["e","max"] = 4.9999999

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","min"] = 0.00001
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","max"] = 0.99999

BioGeoBEARS_run_object = fix_BioGeoBEARS_params_minmax(BioGeoBEARS_run_object=BioGeoBEARS_run_object)
check_BioGeoBEARS_run(BioGeoBEARS_run_object)

resfn = "Psychotria_BAYAREALIKE+J_M0_unconstrained_v1.Rdata"
runslow = TRUE
if (runslow)
    {
    res = bears_optim_run(BioGeoBEARS_run_object)
    res    

    save(res, file=resfn)

    resBAYAREALIKEj = res
    } else {
    # Loads to "res"
    load(resfn)
    resBAYAREALIKEj = res
    }

pdffn = "Psychotria_BAYAREALIKE_vs_BAYAREALIKE+J_M0_unconstrained_v1.pdf"
pdf(pdffn, height=6, width=6)

#######################################################
# Plot ancestral states - BAYAREALIKE
#######################################################
analysis_titletxt ="BioGeoBEARS BAYAREALIKE on Psychotria M0_unconstrained"

# Setup
results_object = resBAYAREALIKE
scriptdir = np(system.file("extdata/a_scripts", package="BioGeoBEARS"))

# States
res2 = plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="text", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

# Pie chart
plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="pie", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

#######################################################
# Plot ancestral states - BAYAREALIKE+J
#######################################################
analysis_titletxt ="BioGeoBEARS BAYAREALIKE+J on Psychotria M0_unconstrained"

# Setup
results_object = resBAYAREALIKEj
scriptdir = np(system.file("extdata/a_scripts", package="BioGeoBEARS"))

# States
res1 = plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="text", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

# Pie chart
plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="pie", label.offset=0.45, tipcex=0.7, statecex=0.7, splitcex=0.6, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE, tr=tr, tipranges=tipranges)

dev.off()
cmdstr = paste("open ", pdffn, sep="")
system(cmdstr)

#########################################################################
#########################################################################
#########################################################################
#########################################################################
# 
# CALCULATE SUMMARY STATISTICS TO COMPARE
# DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J
# 
#########################################################################
#########################################################################
#########################################################################
#########################################################################

#########################################################################
#########################################################################
# REQUIRED READING:
#
# Practical advice / notes / basic principles on statistical model 
#    comparison in general, and in BioGeoBEARS:
# http://phylo.wikidot.com/advice-on-statistical-model-comparison-in-biogeobears
#########################################################################
#########################################################################

# Set up empty tables to hold the statistical results
restable = NULL
teststable = NULL

#######################################################
# Statistics -- DEC vs. DEC+J
#######################################################
# We have to extract the log-likelihood differently, depending on the 
# version of optim/optimx
LnL_2 = get_LnL_from_BioGeoBEARS_results_object(resDEC)
LnL_1 = get_LnL_from_BioGeoBEARS_results_object(resDECj)

numparams1 = 3
numparams2 = 2
stats = AICstats_2models(LnL_1, LnL_2, numparams1, numparams2)
stats

# DEC, null model for Likelihood Ratio Test (LRT)
res2 = extract_params_from_BioGeoBEARS_results_object(results_object=resDEC, returnwhat="table", addl_params=c("j"), paramsstr_digits=4)
# DEC+J, alternative model for Likelihood Ratio Test (LRT)
res1 = extract_params_from_BioGeoBEARS_results_object(results_object=resDECj, returnwhat="table", addl_params=c("j"), paramsstr_digits=4)

# The null hypothesis for a Likelihood Ratio Test (LRT) is that two models
# confer the same likelihood on the data. See: Brian O'Meara's webpage:
# http://www.brianomeara.info/tutorials/aic
# ...for an intro to LRT, AIC, and AICc

rbind(res2, res1)
tmp_tests = conditional_format_table(stats)

restable = rbind(restable, res2, res1)
teststable = rbind(teststable, tmp_tests)

#######################################################
# Statistics -- DIVALIKE vs. DIVALIKE+J
#######################################################
# We have to extract the log-likelihood differently, depending on the 
# version of optim/optimx
LnL_2 = get_LnL_from_BioGeoBEARS_results_object(resDIVALIKE)
LnL_1 = get_LnL_from_BioGeoBEARS_results_object(resDIVALIKEj)

numparams1 = 3
numparams2 = 2
stats = AICstats_2models(LnL_1, LnL_2, numparams1, numparams2)
stats

# DIVALIKE, null model for Likelihood Ratio Test (LRT)
res2 = extract_params_from_BioGeoBEARS_results_object(results_object=resDIVALIKE, returnwhat="table", addl_params=c("j"), paramsstr_digits=4)
# DIVALIKE+J, alternative model for Likelihood Ratio Test (LRT)
res1 = extract_params_from_BioGeoBEARS_results_object(results_object=resDIVALIKEj, returnwhat="table", addl_params=c("j"), paramsstr_digits=4)

rbind(res2, res1)
conditional_format_table(stats)

tmp_tests = conditional_format_table(stats)

restable = rbind(restable, res2, res1)
teststable = rbind(teststable, tmp_tests)

#######################################################
# Statistics -- BAYAREALIKE vs. BAYAREALIKE+J
#######################################################
# We have to extract the log-likelihood differently, depending on the 
# version of optim/optimx
LnL_2 = get_LnL_from_BioGeoBEARS_results_object(resBAYAREALIKE)
LnL_1 = get_LnL_from_BioGeoBEARS_results_object(resBAYAREALIKEj)

numparams1 = 3
numparams2 = 2
stats = AICstats_2models(LnL_1, LnL_2, numparams1, numparams2)
stats

# BAYAREALIKE, null model for Likelihood Ratio Test (LRT)
res2 = extract_params_from_BioGeoBEARS_results_object(results_object=resBAYAREALIKE, returnwhat="table", addl_params=c("j"), paramsstr_digits=4)
# BAYAREALIKE+J, alternative model for Likelihood Ratio Test (LRT)
res1 = extract_params_from_BioGeoBEARS_results_object(results_object=resBAYAREALIKEj, returnwhat="table", addl_params=c("j"), paramsstr_digits=4)

rbind(res2, res1)
conditional_format_table(stats)

tmp_tests = conditional_format_table(stats)

restable = rbind(restable, res2, res1)
teststable = rbind(teststable, tmp_tests)

#########################################################################
# ASSEMBLE RESULTS TABLES: DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J
#########################################################################
teststable$alt = c("DEC+J", "DIVALIKE+J", "BAYAREALIKE+J")
teststable$null = c("DEC", "DIVALIKE", "BAYAREALIKE")
row.names(restable) = c("DEC", "DEC+J", "DIVALIKE", "DIVALIKE+J", "BAYAREALIKE", "BAYAREALIKE+J")
restable = put_jcol_after_ecol(restable)
restable

# Look at the results!!
restable
teststable

#######################################################
# Save the results tables for later -- check for e.g.
# convergence issues
#######################################################

# Loads to "restable"
save(restable, file="restable_v1.Rdata")
load(file="restable_v1.Rdata")

# Loads to "teststable"
save(teststable, file="teststable_v1.Rdata")
load(file="teststable_v1.Rdata")

# Also save to text files
write.table(restable, file="restable.txt", quote=FALSE, sep="\t")
write.table(unlist_df(teststable), file="teststable.txt", quote=FALSE, sep="\t")

#######################################################
# Model weights of all six models
#######################################################
restable2 = restable

# With AICs:
AICtable = calc_AIC_column(LnL_vals=restable$LnL, nparam_vals=restable$numparams)
restable = cbind(restable, AICtable)
restable_AIC_rellike = AkaikeWeights_on_summary_table(restable=restable, colname_to_use="AIC")
restable_AIC_rellike = put_jcol_after_ecol(restable_AIC_rellike)
restable_AIC_rellike

# With AICcs -- factors in sample size
samplesize = length(tr$tip.label)
AICtable = calc_AICc_column(LnL_vals=restable$LnL, nparam_vals=restable$numparams, samplesize=samplesize)
restable2 = cbind(restable2, AICtable)
restable_AICc_rellike = AkaikeWeights_on_summary_table(restable=restable2, colname_to_use="AICc")
restable_AICc_rellike = put_jcol_after_ecol(restable_AICc_rellike)
restable_AICc_rellike

# Also save to text files
write.table(restable_AIC_rellike, file="restable_AIC_rellike.txt", quote=FALSE, sep="\t")
write.table(restable_AICc_rellike, file="restable_AICc_rellike.txt", quote=FALSE, sep="\t")

# Save with nice conditional formatting
write.table(conditional_format_table(restable_AIC_rellike), file="restable_AIC_rellike_formatted.txt", quote=FALSE, sep="\t")
write.table(conditional_format_table(restable_AICc_rellike), file="restable_AICc_rellike_formatted.txt", quote=FALSE, sep="\t")

#######################################################
# Your results should be:
#######################################################

# > restable
# 
#                 LnL numparams            d            e         j
# DEC           -34.5         2 3.504546e-02 2.835632e-02 0.0000000
# DEC+J         -20.9         3 1.000000e-12 1.000000e-12 0.1142811
# DIVALIKE      -33.1         2 4.474416e-02 1.000000e-12 0.0000000
# DIVALIKE+J    -21.1         3 2.001000e-09 1.000000e-12 0.1157199
# BAYAREALIKE   -40.3         2 1.738085e-02 3.040188e-01 0.0000000
# BAYAREALIKE+J -21.6         3 1.000000e-12 1.000000e-12 0.1081158

#######################################################
# The p-value of the LRT (Likelihood Ratio Test) tells you whether or not you can reject the
# null hypothesis that DEC and DEC+J confer equal likelihoods on the data
#
# AIC and AICc model weights are also shown, giving a sense of the relative probability of the two models.
#
# (One could easily do model weights between all 6 models, but this is not done here; s
#  see Brian O'Meara's AIC webpage: http://www.brianomeara.info/tutorials/aic )
#######################################################

# > teststable
# 
#              alt        null LnLalt LnLnull DFalt DFnull DF Dstatistic    pval        test       tail  AIC1  AIC2 AICwt1  AICwt2 AICweight_ratio_model1 AICweight_ratio_model2
# 1          DEC+J         DEC -20.95  -34.54     3      2  1      27.19 1.8e-07 chi-squared one-tailed  47.9 73.08   1.00 3.4e-06                 294893                3.4e-06
# 11    DIVALIKE+J    DIVALIKE -21.09  -33.15     3      2  1      24.13 9.0e-07 chi-squared one-tailed 48.17  70.3   1.00 1.6e-05                  63797                1.6e-05
# 12 BAYAREALIKE+J BAYAREALIKE -21.09  -33.15     3      2  1      24.13 9.0e-07 chi-squared one-tailed 48.17  70.3   1.00 1.6e-05                  63797                1.6e-05
#

Old updates

(link to this section)

Manual construction of BioGeoBEARS legends (update February 2014)

(link to this section)

I have added a script on how to build your own legend after running the example script. This could then be manually added to a graphic e.g. in Powerpoint, Photoshop or some other graphics editor. here is the link.

BioGeoBEARS Google Group (update October 1, 2013)

(link to this section)

There was substantial interest in the listserv, so I've created one, and added those who were interested who I remembered. It is invite-only so that we don't get robospam. If I forgot you, or you are interested, email me at gro.soibmin|ekztam#gro.soibmin|ekztam to be added. Once you are added, you can post by emailing: biogeobearsATgooglegroups.com .

The listserv is publicly-readable at: https://groups.google.com/forum/#!forum/biogeobears

The introductory welcome message is: https://groups.google.com/forum/#!topic/biogeobears/H6PpmRJNxeg

Interested in listserv? (update September 10, 2013)

(link to this section)

Even while I've been moving, I have gotten many emails about BioGeoBEARS. I am happy to answer them individually, but it might benefit people if there was an email listserv. Email me (gro.soibmin|ekztam#gro.soibmin|ekztam) if you would like me to put you on such a listserv. It would be used for (1) BioGeoBEARS updates and announcements and (2) answers to questions.

Solved issue with "optimx" dependency (resolved Jan. 2014)

(link to this section)

The optimx package, which BioGeoBEARS uses, has been updated and this breaks a final step in the BioGeoBEARS function that assembles the output from optimx. I will fix this when I am able (I have just moved from California to a postdoc at NIMBioS, see National Institute of Mathematical and Biological Synthesis, www.nimbios.org). For now, if you hit an error message (not just a warning) after the maximum likelihood search has run, do the following:

  • 2. Exit R and restart R.
  • 3. Type:
remove.packages("optimx")

…to get rid of the old version of the package.

  • 4. Either (a) use the clickable menus to install the 2012 version of optimx (e.g. in the Macintosh R.app, do: Menu bar —> Packages & Data —> Package Installer —> change "CRAN (binaries)" to "Local Source Package" (or to "Local Package Directory" if you unzipped it) —> click "Install" —> navigate to the 2012 optimx download and click "Open"

…or…

(b) Via command-line, try something like:

install.packages(pkgs="PUT_THE_FULL_PATH_AND_FILENAME_OF_2012_OPTIMX_HERE", repos=NULL, type='source')

library(optimx)
  • 5. You can check which version of optimx you have library-ed by typing:
sessionInfo()

For now: optimx_2012.04.01=GOOD, optimx_2013.08.06=BAD.

The fix is not hard, I will do it once I move into my new office.

UPDATE 7/31/2013: Version 0.2.1 is on CRAN, version 0.2.2-2 is available on github or at the bottom of this page.

UPDATE 7/31/2013: I have updated the Psychotria example to show these models: DEC, DEC+J, DIVA-like, DIVA-like+J, BAYAREA-like, and BAYAREA-like+J.

UPDATE 6/23/2013: I have just uploaded my Evolution 2013 talk, which was Sunday, 11:45 am at the Ernst Mayr symposium. The PDF can be found at the bottom of the file. I have also uploaded a zipped source package for BioGeoBEARS. This is more up-to-date than github. It is the same as what I just submitted to CRAN as 0.1.1.

See also by iEvoBio Lightning Talk, 1:30 pm Tuesday.

temp_run

File nameFile typeSize
2015-01-10_Matzke_IBS_stochastic_mapping_v3.pdfPDF document7.59 MBInfo
2017-01-08_Matzke_BioGeoBEARS_traits_v5.pdfPDF document23.19 MBInfo
anolis_morph_BDSKY_1birthRate_WORKS.xmlXML document text32.56 kBInfo
anolis_morph_BDSKY_2birthRates_BROKEN.xmlXML document text32.56 kBInfo
areas_adjacency.txtASCII text209 BytesInfo
areas_allowed_all1s.txtASCII text209 BytesInfo
areas_allowed.txtASCII text209 BytesInfo
BioGeoBEARS_add_fossils_randomly_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_basics_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_calc_transition_matrices_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_classes_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_detection_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_DNA_cladogenesis_sim_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_extract_Qmat_COOmat_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_generics_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_models_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_on_multiple_trees_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_plots_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_readwrite_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_simulate_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_SSEsim_makePlots_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_SSEsim_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_stochastic_mapping_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_stratified_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_supermodel.pngPNG image data92.67 kBInfo
BioGeoBEARS_traits_v1.RASCII English text1.97 kBInfo
BioGeoBEARS_univ_model_v1.RASCII English text1.97 kBInfo
calc_loglike_sp_v01.RASCII English text1.97 kBInfo
calc_uppass_probs_v1.RASCII English text1.97 kBInfo
cladoRcpp.RASCII English text1.97 kBInfo
DEC_DECJ_inf_boxplots_ALL.pdfPDF document129.31 kBInfo
distances_nonstratified_rescaled.txtASCII text53 BytesInfo
distances_nonstratified.txtASCII text63 BytesInfo
distances_rescaled.txtASCII text249 BytesInfo
distances.txtASCII text299 BytesInfo
get_stratified_subbranch_top_downpass_likelihoods_v1.RASCII English text1.97 kBInfo
Hawaii_abbreviations_table.txtASCII text94 BytesInfo
manual_dispersal_multipliers_with_0s_single_timebin.txtASCII text45 BytesInfo
manual_dispersal_multipliers_with_0s.txtASCII text209 BytesInfo
manual_dispersal_multipliers_without_0s.txtASCII text529 BytesInfo
_matrix_utils_v1.RASCII English text1.97 kBInfo
Matzke_2013_PhD_Proquest_biogeography_wSigs.pdfPDF document8.16 MBInfo
Matzke_2014_simcode.zipZip archive data42.81 kBInfo
Psychotria_5.2.newickASCII text816 BytesInfo
Psychotria_geog.dataASCII text441 BytesInfo
runBSM_v1.RASCII English text1.97 kBInfo
running_BioGeoBEARS_multiple_trees.zipZip archive data8.49 MBInfo
simhistory_allsims_observed_nostates_ALL_ARCHIVE_WORKED_100dpi.gifGIF image data16 MBInfo
simhistory_allsims_observed_nostates_ALL_ARCHIVE_WORKED_50dpi.gifGIF image data8.17 MBInfo
simhistory_allsims_observed_nostates_ALL_ARCHIVE_WORKED.pdfPDF document9.19 MBInfo
simtree_stats_boxplots_ALL2.pdfPDF document546.95 kBInfo
startABCD_DEC_DECJ_inf_boxplots_ALL.pdfPDF document295.92 kBInfo
stochastic_map_given_inputs.RASCII English text1.97 kBInfo
stochastic_maps_DEC_vs_DECj_M0_v1_WORKED_ARCHIVE_100dpi.gifGIF image data1.86 MBInfo
stochastic_maps_DEC_vs_DECj_M0_v1_WORKED_ARCHIVE_50dpi.gifGIF image data739.39 kBInfo
stochastic_maps_DEC_vs_DECj_M3_v1_WORKED_ARCHIVE_yes_dotted_lines_100dpi.gifGIF image data1.92 MBInfo
stochastic_maps_DEC_vs_DECj_M3_v1_WORKED_ARCHIVE_yes_dotted_lines_50dpi.gifGIF image data839.17 kBInfo
summarize_BSM_tables_v1.RASCII English text1.97 kBInfo
timeperiods.txtASCII text19 BytesInfo
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License