Table of Contents

Statistical estimation of models via maximum likelihood (ML), statistical model choice using the likelihood ratio test, AIC, AICc, etc., have become ubiquitous in phylogenetics and many other fields, but they are still somewhat new in historical biogeography. Here I summarize of the basic principles and address a few common misunderstandings.
I do not introduce any equations here, rather, my goal is to briefly explain the terminology and concepts as they are used in discussions of BioGeoBEARS.
NOTE: I have also added an Excel spreadsheet, with the basic AIC, AICc, and LRT calculations done by hand, to compare to an example BioGeoBEARS output. See "Files" at the bottom of the page, or this link: http://phylo.wikidot.com/localfiles/adviceonstatisticalmodelcomparisoninbiogeobears/LRT_AIC_tables_v2.xlsx
Terminology
^{(link to this section)}
Here are the basic terms used in discussions of likelihoodbased estimation and modeltesting.
 Likelihood = P(datamodel) = probability of the data, given a model. Note that this is a technical definition and not identical with colloquial understandings of the word "likelihood".
 LnL = log of the likelihood. Note this is always the NATURAL log.
 Maximum Likelihood (ML) = a statistical technique for estimating model parameters by finding those parameters that maximize the probability of the data under the model (the likelihood). For simple problems, the ML solution can be found analytically by taking the equation that gives the likelihood as a function of a parameter, taking the derivative of the equation, and solving for 0 (which represents the point where the slope is flat, presumably the maximum of the curve). For more complex problems, an iterative "hillclimbing" routine is used to find the likelihood peak. There are a wide variety of such algorithms, and some will work better on some problems than others.
 Parameter = A number in the likelihood equation which may vary (if it's a free parameter) or not (if it's a fixed parameter).
 Data = Observations, which do not vary once they are collected.
 AIC = Akaike Information Criterion, a likelihoodbased measure of model fit that penalizes more complex models (more complex = more free parameters).
 AICc = Akaike Information Criterion, with correction for sample size (corrected AIC, or secondorder Akaike Information Criterion)
Higher LnL corresponds to higher data probability, so higher (less negative) LnLs are better, representing better model fit to the data. LnL = 2 is better than LnL = 5.
To convert LnL to plain likelihood, take e^{LnL}. In R, this is the exp() function; see script below.
For calculation of AIC etc., see Brian O'Meara's webpage and references therein: http://www.brianomeara.info/tutorials/aic
Note that with AIC, AICc, etc., lower is better. Lower AIC indicates better model fit. So, AIC = 10 is better than AIC = 100.
Data have likelihood, not models
It is fairly common to hear researchers talking about the "likelihood of models", or "model A has higher likelihood than model B". While this is not completely horrible, it is not technically correct, and can cause farreaching confusions in studenthood and beyond. Technically, data have likelihood, and models confer likelihood on data. That is, given a particular probabilistic model and model parameter values, there is a certain probability of producing the data you have observed. The likelihood is the probability of the data given the model, NOT the probability of the model, given the data. To get the latter, you need to use Bayes' Theorem and a Bayesian analysis.
AIC or AICc can be used to calculate model weights and relative likelihoods, which do give a sense of which models are better supported by the data. However, here you are effectively assuming equal prior probability of all of the considered models, and a zero prior probability on any models not considered.
Likelihood and loglikelihood (R script)
Short R script to show how to convert between LnL and likelihood
# Convert a loglikelihood to a plain likelihood/probability:
LnL = 2
likelihood = exp(LnL)
likelihood
# These are the same, showing that the default of log() is base e (i.e., this is the natural log)
log(likelihood)
log(likelihood, base=exp(1))
# exp(1) equals e:
exp(1)
# Compare two likelihoods
LnL1 = 2
likelihood1 = exp(LnL1)
LnL2 = 5
likelihood2 = exp(LnL2)
likelihood1
likelihood2
# You should get:
# > likelihood1
# [1] 0.1353353
# > likelihood2
# [1] 0.006737947
# You can see that #1 is higher than #2, whether you are looking at straight likelihood/probability, or loglikelihood (LnL)
Basics
^{(link to this section)}
Goal
^{(link to this section)}
The goal of statistical model comparison is to compare the fit of DIFFERENT models to the SAME data. This means that likelihoods, AIC values, etc., can only be compared on the SAME data. Comparisons of these values across different data have no meaning. This is because likelihood means "probability of the data given a model".
(You can, of course, say something like "DEC+J is better than DEC on dataset 1, and DEC is better than DEC+J on dataset 2." You just can't conclude anything from "DEC+J confers LnL 10 on dataset 1, DEC+J confers LnL 100 on dataset 2." Dataset 2 could have lower likelihood just because it is a bigger dataset, for instance. A specific sequence of 100 coin flips has a lower probability than a specific sequence of 10 coin flips.)
Nesting of models
^{(link to this section)}
Model A is nested inside Model B when fixing a parameter in Model B results in a model identical to Model A. For example, in BioGeoBEARS, DEC has two free parameters, d and e, and the parameter j is fixed to 0. DEC+J has three free parameters, d, e, and j. When j=0, DEC+J reduces to DEC. So DEC is nested inside DEC+J, DIVALIKE is nested inside DIVALIKE+J, and BAYAREALIKE nests inside BAYAREALIKE+J.
Any other model comparisons are not nested — e.g., DEC is not nested inside DIVALIKE+J, even though DEC has 2 free parameters and DIVALIKE+J has 3 free parameters. The nonnesting occurs because these models have different fixed assumptions (different fixed parameter values, in BioGeoBEARS) controlling differences in e.g. vicariance, subset sympatry, etc.
Likelihood Ratio Test (LRT)
^{(link to this section)}
When two models are nested, the Likelihood Ratio Test (LRT) can be used to test the null hypothesis that the two models confer the same likelihood on the data. This test is just a chisquared test, with the test statistic (D) being 2*(the difference in loglikelihood), and the degrees of freedom being the difference in the number of parameters. In Excel, the function to calculate the pvalue is CHISQ.DIST.RT.
The LRT can only be used in pairwise fashion, to compare two models. If you have a larger number of models, you could do a LRT on each pair of models (as long as one of these models nests within the other). Note that with this strategy you have to start worrying about multipletesting bias.
Akaike Information Criterion
^{(link to this section)}
AIC and AICc can be used to compare two models (or any number of models), whether they are nested or not. There is no statistical theory to identify a strict pvalue cutoff for significance when using AIC or AICc. Instead, AIC/AICc give a measure of relative model probability. See: http://en.wikipedia.org/wiki/Akaike_information_criterion and: http://www.brianomeara.info/tutorials/aic
Maximum Likelihood (ML)
^{(link to this section)}
ML optimization routines and their pitfalls
^{(link to this section)}
The LRT, AIC, and AICc all assume that you have actually found the Maximum Likelihood (ML) solution, that is, the model parameter values that confer the maximum likelihood on the data. Functions to search the parameter space for the values that maximize the likelihood of the data are welldeveloped (see the optim() function in R base, and optimx R Package).
However, these optimization algorithms do not always work perfectly, particularly when the likelihood surface is very flat. This can occur when a model is a very poor fit, or when your starting values for a search are far from the ML values, or when you have accidentally set the bounds of your parameter search such that the ML values of the parameters are outside the bounds. (In the latter case, if you are lucky, the ML search will keep hitting this limit, indicating that there might be higher likelihoods attainable above the bound.)
Another problem can occur when two parameters are nonidentifiable — you will get a "likelihood ridge", and the true values of the parameters might be anywhere along this ridge.
All of these issues get worse as more and more free parameters are added. In BioGeoBEARS, so far we have just been using models with 2 and 3 parameters, and optim/optimx seem to work well in general.
The one optimization problem I have sometimes noticed (in perhaps 1% of datasets) are cases where the +J model has a lower likelihood under ML search than the corresponding 2parameter model where j has been fixed to 0. If you see a 3parameter model with lower likelihood than a 2parameter that nests inside it (the nesting is crucial), you have immediate evidence for a problem in optimization, because the 3parameter model should always be able to get at least equal likelihood to the nested 2parameter model, since the 3parameter model contains the 2parameter model.
This situation probably occurs when the ML value of j is close to 0, but the likelihood surface for j is flat and is interacting with d and e.
The fix, however, is easy: just use the ML parameter estimates from the 2parameter model as the starting values for the ML optimization of the 3parameter search. I have now made this the default setup in the example script.
Doublechecking ML
^{(link to this section)}
The more general solution to ML optimization problems is to repeat the search with a variety of different starting parameter values, and see if you keep getting the same ML parameter estimates.
I have not automated this procedure, as most of the time it isn't necessary, and it significantly slows the run time (10 searches will take 10 times longer than 1). BioGeoBEARS makes it easy to change the starting values of parameters, however (the "init" column in the params_table).
This may be particularly useful as doublecheck, for example if reviewers ask that you doublecheck you are getting the ML parameter estimates.
Improving the chance of finding the true ML solution
^{(link to this section)}
In most reallife problems (where there is no analytical solution), you can never be absolutely 100% sure that your hillclimbing ML algorithm has found the parameter values that maximize the likelihood of the data. However, there are a number steps that can be taken, if you have indications that optimization is problematic, or just if you have reason to be worried (e.g., models with >3 free parameters in BioGeoBEARS) or cautious.
None of these strategies are guarantees, of course. The most important strategy is to always look at your data and your various ML results (parameter values, LnL, ancestral states, etc.) and ask yourself if they make sense.
Strategies for improving the search for the ML parameter values include:
 Look at the parameter values and resulting LnL as they are printed to screen during the BioGeoBEARS search. If the likelihood climbs rapidly as the parameters shift, you probably have a strong likelihood gradient. If the parameters and LnL remain stuck near the starting values, your starting parameter values may have been in a flat region of the likelihood surface, causing an optimization problem.
 Set speedup=FALSE in the BioGeoBEARS_run_object. To speed up the optimx/optim search, I modified the optimx/optim defaults to have a higher tolerance and a faster cutoff. When speedup=TRUE, this can cut search times by about 50%, as often, a lot of the search time is spent bouncing around in the tiny space near the peak, estimating the 3rd, 4th, and 5th significant digits of the parameters, which are not particularly useful or relevant. Setting speedup=FALSE will run the optimx/optim defaults, and may fix some optimization problems, but definitely not all of them.
 Start searches for more complex models with the ML parameter values from a simpler, nested model (see above).
 Start searches from a variety of plausible and extreme parameter values, and different combinations of parameter values. If all searches hit the same ML solution, you can be highly confident your search is finding the ML parameters.
 Look at the saved optim_result in the results_object and, in the help for optimx and optim, consult the "Value" description (the "Value" section describes the "value" resulting from the function, i.e. the output/results. This will indicate whether or not optimx/optim algorithm detected a problem.
 The most thorough thing you can do is a "gridded search", where you take say, 25 possible values of each free parameter, and then calculate the likelihood of the data for each combination of these values. You can then plot the likelihood surface yourself (e.g. a 3D plot or a contour plot for a 2parameter problem, or a colored ternary diagram for a 3parameter search) and see where the peak is, or find ridges or multiple peaks, if those exist. Adding more parameter combinations can explore other regions of parameter space, or increase the resolution in a region of interest (e.g. near the peak). This strategy will work as long as your sampled parameter values span the actual likelihood peak in a reasonable way.
Zen and the Art of Statistical Model Selection
^{(link to this section)}
1. The standard (ubiquitous) advice in statistical model choice is "All models are wrong, but some models are useful." (George Box)
2. In other words, even your bestfitting model is probably still wrong — at best, it is a decent approximation of the true model. At worst, it is a horrible, poorlyfitting model. If you only ever use one model, you don't know when the model is fitting relatively well and when it is fitting poorly. The main point of BioGeoBEARS was to enable the creation of different models, so that these issues became testable against datasets.
3. Although you will never know for sure if you have the "true" model (except in cases where the data has been simulated by a computer program with a known model), scientists can use model comparison and model selection procedures to at least determine which models are better than others. By designing models to include or exclude processes that we think might be important, we can let the data tell us which models and which processes seem to be well supported / good fits to the data.
4. The models I initially set up in the example script were just chosen to imitate models/programs currently in use in the literature (see Matzke 2013, Figure 1) — DEC, DIVALIKE, and BAYAREALIKE. Instead of just running different methods, observing that the results differ, and shrugging, I advocate for the position that we should use model choice procedures to choose the models that fit the best.
The "+J" version of each model just adds founderevent speciation, a process which had been ignored in some models, probably because of the residual effects of the vicariance biogeography tradition.
This created six models, and this was a big enough step that it will probably take some time for researchers to try out the models and the model selection procedures and get a sense of their utility. But I actually designed BioGeoBEARS as a supermodel, such that specifying certain parameters can create a variety of other models. Adding constraints on dispersal, distance, changing geography, etc., adds yet more models. So, there are many more models that could be imagined and tested! Remember, it took a few decades for DNA substitution models to evolve from JukesCantor to the fullblown GTR+I+gamma model (and even more recent, more sophisticated models).
(That said, new base models beyond the basic 6 have not been tested much if at all, so users should be aware of possible issues, and take them as experimental until they or others have done some serious study of the performance of the new model. See "issues to watch out for", below.)
Issues to watch out for
^{(link to this section)}
1. The optimx/optim optimizers seem to find the ML solution quite reliably on models with 2 or 3 parameters, but the searches will become slower and less reliable with more parameters. One way to check ML searches is start from different starting values and see if the inference ends up at the same peak.
1a. My general experience with ML optimizers in R is that the R package optim (BioGeoBEARS_run_object$use_optimx = FALSE) is somewhat slower perhaps a bit less successful than optimx. The R package for generalized simulated annealing, GenSA (BioGeoBEARS_run_object$use_optimx = "GenSA") seems to do best at the more complex problems with many parameters, but it is significantly slower.
2. Another test is to see if the more complex models reliably equal or exceed the likelihood of simpler models nested within them — if they don't, you've got optimization problems.
3. Some combinations of free parameters might be nonidentifiable.
4. Some models might be physically absurd.
5. See also BioGeoBEARS mistakes to avoid for other sorts of conceptual mistakes to avoid.
Key Reference: Burnham and Anderson (2002)
^{(link to this section)}
The "Bible" of AIC/AICc is:
Burnham, Kenneth P.; Anderson, David R. (1998, first edition; 2002, second edition). Model Selection and Multimodel Inference: A Practical InformationTheoretic Approach. New York, Springer. doi: 10.1007/b97636
Link: http://www.springer.com/gb/book/9780387953649
I like to point out that it has 36,458 citations (as of 20170418): https://scholar.google.com.au/scholar?hl=en&q=Burnham%2C+Kenneth+P.%2C+Anderson%2C+David+R.&btnG=&as_sdt=1%2C5&as_sdtp=
On occasion, I see reviewers or editors that are confused/ambivalent about AIC. Probably this is just due to unfamiliarity — entire generations of scientists were trained in nothing but traditional "frequentist" statistics, and the may get nervous if they do not see a Pvalue. In such situations, I recommend that authors respond by pointing out that AIC, AICc, etc. are now standard, wellknown techniques in science, even if they are new in biogeography. A statistical method with 36,000+ citations should not need to be reexplained, from scratch, in every new scientific article using the method. Citation of Burnham and Anderson (2002) should be enough, and readers who are confused about AIC and related approaches should be referred to Burnham and Anderson (2002), and/or the references below.
References
^{(link to this section)}
Below, I list some references that are particularly useful and/or important for work with Akaike Information Criterion and related approaches.
1. Akaike, Hirotugu (1974). "A new look at the statistical model identification." IEEE Transactions on Automatic Control 19(6): 716723. doi: 10.1109/TAC.1974.1100705 Link: http://dx.doi.org/10.1109/TAC.1974.1100705
 Abstract: The history of the development of statistical hypothesis testing in time series analysis is reviewed briefly and it is pointed out that the hypothesis testing procedure is not adequately defined as the procedure for statistical model identification. The classical maximum likelihood estimation procedure is reviewed and a new estimate minimum information theoretical criterion (AIC) estimate (MAICE) which is designed for the purpose of statistical identification is introduced. When there are several competing models the MAICE is defined by the model and the maximum likelihood estimates of the parameters which give the minimum of AIC defined by AIC = (2)log(maximum likelihood) + 2(number of independently adjusted parameters within the model). MAICE provides a versatile procedure for statistical model identification which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure. The practical utility of MAICE in time series analysis is demonstrated with some numerical examples.
2. Burnham, Kenneth P.; Anderson, David R. (1998, first edition; 2002, second edition). Model Selection and Multimodel Inference: A Practical InformationTheoretic Approach. New York, Springer. doi: 10.1007/b97636
Link: http://www.springer.com/gb/book/9780387953649
 Abstract: We wrote this book to introduce graduate students and research workers in various scientific disciplines to the use of informationtheoretic approaches in the analysis of empirical data. These methods allow the databased selection of a "best" model and a ranking and weighting of the remaining models in a predefined set. Traditional statistical inference can then be based on this selected best model. However, we now emphasize that informationtheoretic approaches allow formal inference to be based on more than one model (multimodel inference). Such procedures lead to more robust inferences in many cases, and we advocate these approaches throughout the book. The second edition was prepared with three goals in mind. First, we have tried to improve the presentation of the material. Boxes now highlight essential expressions and points. Some reorganization has been done to improve the fiow of concepts, and a new chapter has been added. Chapters 2 and 4 have been streamlined in view of the detailed theory provided in Chapter 7. Second, concepts related to making formal inferences from more than one model (multimodel inference) have been emphasized throughout the book, but particularly in Chapters 4, 5, and 6. Third, new technical material has been added to Chapters 5 and 6. Well over 100 new references to the technical literature are given. These changes result primarily from our experiences while giving several seminars, workshops, and graduate courses on material in the first edition.
3. Anderson, David R.; Burnham, Kenneth P. (1999). "Understanding information criteria for selection among capturerecapture or ring recovery models." Bird Study 46(sup1): S14S21. doi: 10.1080/00063659909477227 Link: http://dx.doi.org/10.1080/00063659909477227
 Abstract: We provide background information to allow a heuristic understanding of two types of criteria used in selecting a model for making inferences from ringing data. The first type of criteria (e.g. AIC, AlCc QAICc and TIC) are estimates of (relative) KullbackLeibler information or distance and attempt to select a good approximating model for inference, based on the principle of parsimony. The second type of criteria (e.g. BIC, MDL, HQ) are 'dimension consistent' in that they attempt to consistently estimate the dimension of the true model. These latter criteria assume that a true model exists, that it is in the set of candidate models and that the goal of model selection is to find the true model, which in turn requires that the sample size is very large. The KullbackLeibler based criteria do not assume a true model exists, let alone that it is in the set of models being considered. Based on a review of these criteria, we recommend use of criteria that are based on KullbackLeibler information in the biological sciences.
4. Franklin, Alan B.; Shenk, Tanya M.; Anderson, David R.; Burnham, Kenneth P. (2001). "Statistical Model Selection: An Alternative to Null Hypothesis Testing." Modeling in Natural Resource Management: Development, Interpretation, and Application. Edited by T. M. Shenk and A. B. Franklin. Washington, Island Press: 7590
Link: https://books.google.com.au/books?id=Uk7rZ7DCvY4C&dq=burnham+and+anderson&lr=&source=gbs_navlinks_s
 Summary: A useful short introduction to AIC and an example of its use, with AIC model weights.
5. Burnham, Kenneth P.; Anderson, David R. (2001). "KullbackLeibler information as a basis for strong inference in ecological studies." Wildlife Research 28: 111119. doi: 10.1071/WR99107 Link: http://www.publish.csiro.au/wr/WR99107
 Abstract: We describe an informationtheoretic paradigm for analysis of ecological data, based on Kullback–Leibler information, that is an extension of likelihood theory and avoids the pitfalls of null hypothesis testing. Informationtheoretic approaches emphasise a deliberate focus on the a priori science in developing a set of multiple working hypotheses or models. Simple methods then allow these hypotheses (models) to be ranked from best to worst and scaled to reflect a strength of evidence using the likelihood of each model (g_{i}), given the data and the models in the set (i.e. L(g_{i}  data)). In addition, a variance component due to modelselection uncertainty is included in estimates of precision. There are many cases where formal inference can be based on all the models in the a priori set and this multimodel inference represents a powerful, new approach to valid inference. Finally, we strongly recommend inferences based on a priori considerations be carefully separated from those resulting from some form of data dredging. An example is given for questions related to age and sexdependent rates of tag loss in elephant seals (Mirounga leonina).
6. Anderson, David R.; Burnham, Kenneth P. (2002). "Avoiding Pitfalls When Using InformationTheoretic Methods." The Journal of Wildlife Management 66(3): 912918. doi: 10.2307/3803155 10.2307/3803155 10.2307/3803155 Link: http://www.jstor.org/stable/3803155
 Abstract: We offer suggestions to avoid misuse of informationtheoretic methods in wildlife laboratory and field studies. Our suggestions relate to basic science issues and the need to ask deeper questions (4 problems are noted), errors in the way that analytical methods are used (7 problems), and outright mistakes seen commonly in the published literature (5 problems). We assume that readers are familiar with the informationtheoretic approaches and provide several examples of misuse. Any method can be misusedour purpose here is to suggest constructive ways to avoid misuse.
7. Burnham, Kenneth P.; Anderson, David R. (2004). "Multimodel Inference: Understanding AIC and BIC in Model Selection." Sociological Methods & Research 33(2): 261304. doi: 10.1177/0049124104268644 Link: http://smr.sagepub.com/content/33/2/261.abstract
 Abstract: The model selection literature has been generally poor at reflecting the deep foundations of the Akaike information criterion (AIC) and at making appropriate comparisons to the Bayesian information criterion (BIC). There is a clear philosophy, a sound criterion based in information theory, and a rigorous statistical foundation for AIC. AIC can be justified as Bayesian using a “savvy” prior on models that is a function of sample size and the number of model parameters. Furthermore, BIC can be derived as a nonBayesian result. Therefore, arguments about using AIC versus BIC for model selection cannot be from a Bayes versus frequentist perspective. The philosophical context of what is assumed about reality, approximating models, and the intent of modelbased inference should determine whether AIC or BIC is used. Various facets of such multimodel inference are presented here, particularly methods of model averaging.
8. Posada, David; Buckley, T. R. (2004). "Model Selection and Model Averaging in Phylogenetics: Advantages of Akaike Information Criterion and Bayesian Approaches Over Likelihood Ratio Tests." Systematic Biology 53(5): 793808. doi: 10.1080/10635150490522304 Link: http://sysbio.oxfordjournals.org/content/53/5/793.abstract
 Abstract: Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the selection of substitution models in phylogenetics from a theoretical, philosophical and practical point of view, and summarize this comparison in table format. We argue that the most commonly implemented model selection approach, the hierarchical likelihood ratio test, is not the optimal strategy for model selection in phylogenetics, and that approaches like the Akaike Information Criterion (AIC) and Bayesian methods offer important advantages. In particular, the latter two methods are able to simultaneously compare multiple nested or nonnested models, assess model selection uncertainty, and allow for the estimation of phylogenies and model parameters using all available models (modelaveraged inference or multimodel inference). We also describe how the relative importance of the different parameters included in substitution models can be depicted. To illustrate some of these points, we have applied AICbased model averaging to 37 mitochondrial DNA sequences from the subgenus Ohomopterus (genus Carabus) ground beetles described by Sota and Vogler (2001).
9. Anderson, David R. (2008). Model Based Inference in the Life Sciences: A Primer on Evidence. New York, Springer. doi: 10.1007/9780387740751 Link: http://www.springer.com/gp/book/9780387740737
 Abstract: The abstract concept of "information" can be quantified and this has led to many important advances in the analysis of data in the empirical sciences. This text focuses on a science philosophy based on "multiple working hypotheses" and statistical models to represent them. The fundamental science question relates to the empirical evidence for hypotheses in this set — a formal strength of evidence. KullbackLeibler information is the information lost when a model is used to approximate full reality. Hirotugu Akaike found a link between KL information (a cornerstone of information theory) and the maximized loglikelihood (a cornerstone of mathematical statistics). This combination has become the basis for a new paradigm in model based inference. The text advocates formal inference from all the hypotheses/models in the a priori set — multimodel inference. This compelling approach allows a simple ranking of the science hypothesis and their models. Simple methods are introduced for computing the likelihood of model i, given the data; the probability of model i, given the data; and evidence ratios. These quantities represent a formal strength of evidence and are easy to compute and understand, given the estimated model parameters and associated quantities (e.g., residual sum of squares, maximized loglikelihood, and covariance matrices). Additional forms of multimodel inference include model averaging, unconditional variances, and ways to rank the relative importance of predictor variables. This textbook is written for people new to the informationtheoretic approaches to statistical inference, whether graduate students, postdocs, or professionals in various universities, agencies or institutes. Readers are expected to have a background in general statistical principles, regression analysis, and some exposure to likelihood methods. This is not an elementary text as it assumes reasonable competence in modeling and parameter estimation.
10. Anderson, David R.; Burnham, Kenneth P. (2006). "AIC Myths and Misunderstandings." Retrieved 20170418 from https://sites.warnercnr.colostate.edu/anderson/wpcontent/uploads/sites/26/2016/11/AICMythsandMisunderstandings.pdf
 Abstract: "Produced and posted by David Anderson and Kenneth Burnham. This site will be updated occasionally. The site is a commentary; we have not spent a great deal of time and effort to refine the wording or be comprehensive in any respect. It is informal and we hope people will benefit from our quick thoughts on various matters."
11. Bolker, Ben M., Brooks, M. E.; Clark, C. J.; Geange, S. W.; Poulsen, J. R.; Stevens, M. H. H.; White, J.S. S. (2009). "Generalized linear mixed models: a practical guide for ecology and evolution." Trends in Ecology & Evolution 24(3): 127135. doi: 10.1016/j.tree.2008.10.008 Link: http://www.sciencedirect.com/science/article/pii/S0169534709000196
 Abstract: How should ecologists and evolutionary biologists analyze nonnormal data that involve random effects? Nonnormal data such as counts or proportions often defy classical statistical procedures. Generalized linear mixed models (GLMMs) provide a more flexible approach for analyzing nonnormal data when random effects are present. The explosion of research on GLMMs in the last decade has generated considerable uncertainty for practitioners in ecology and evolution. Despite the availability of accurate techniques for estimating GLMM parameters in simple cases, complex GLMMs are challenging to fit and statistical inference such as hypothesis testing remains difficult. We review the use (and misuse) of GLMMs in ecology and evolution, discuss estimation and inference and summarize ‘bestpractice’ data analysis procedures for scientists facing this challenge.
12. Burnham, Kenneth P.; Anderson, David R.; Huyvaert, K. P. (2011). "AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons." Behavioral Ecology and Sociobiology 65(1): 2335. doi: 10.1007/s0026501010296 Link: http://dx.doi.org/10.1007/s0026501010296
 Abstract: We briefly outline the informationtheoretic (IT) approaches to valid inference including a review of some simple methods for making formal inference from all the hypotheses in the model set (multimodel inference). The IT approaches can replace the usual t tests and ANOVA tables that are so inferentially limited, but still commonly used. The IT methods are easy to compute and understand and provide formal measures of the strength of evidence for both the null and alternative hypotheses, given the data. We give an example to highlight the importance of deriving alternative hypotheses and representing these as probability models. Fifteen technical issues are addressed to clarify various points that have appeared incorrectly in the recent literature. We offer several remarks regarding the future of empirical science and data analysis under an IT framework.
Some handy graphics from this paper:
13. Cobb, George (2015). "Mere Renovation is Too Little Too Late: We Need to Rethink our Undergraduate Curriculum from the Ground Up." The American Statistician 69(4): 266282. doi: 10.1080/00031305.2015.1093029 Link: http://dx.doi.org/10.1080/00031305.2015.1093029
 Abstract: The last halfdozen years have seen The American Statistician publish wellargued and provocative calls to change our thinking about statistics and how we teach it, among them Brown and Kass, Nolan and TempleLang, and Legler et al. Within this past year, the ASA has issued a new and comprehensive set of guidelines for undergraduate programs (ASA, Curriculum Guidelines for Undergraduate Programs in Statistical Science). Accepting (and applauding) all this as background, the current article argues the need to rethink our curriculum from the ground up, and offers five principles and two caveats intended to help us along the path toward a new synthesis. These principles and caveats rest on my sense of three parallel evolutions: the convergence of trends in the roles of mathematics, computation, and context within statistics education. These ongoing changes, together with the articles cited above and the seminal provocation by Leo Breiman call for a deep rethinking of what we teach to undergraduates. In particular, following Brown and Kass, we should put priority on two goals, to make fundamental concepts accessible and to minimize prerequisites to research.