2014 Summer Research Experiences (SRE) at NIMBioS for Undergraduates and Teachers

This is my tutorial for the 2014 Summer Research Experiences (SRE) at NIMBioS for Undergraduates and Teachers.

Reference: Matzke, Nicholas J. (2014). "Introduction to R, starting from scratch." 2014 Summer Research Experiences (SRE) at NIMBioS for Undergraduates and Teachers. Given Monday, June 16, 9-11 am, Claxton 206, National Institute of Mathematical and Biological Science (NIMBioS). http://nimbios.org/sre/sre2014

```
# This is a comment
# We will paste in the example R script from:
# http://phylo.wikidot.com/2014-summer-research-experiences-sre-at-nimbios-for-undergra
# Instructions:
# 1. Open a *plain-text* editor (Mac: TextWrangler, BBedit,
# R.app, RStudio) (Windows: Notetab++, RStudio)
#
# 2. Copy/paste in the example script
#
# 3. Save it as a .R file, in a
# directory like REU_example
#
# 4. Have R open on the left, and the
# text file open on the right
##############################################################
# =============================================
# Introduction to R, starting from scratch
# =============================================
# by Nick Matzke (and whoever else adds to this PhyloWiki page)
# Copyright 2014-infinity
# matzkeATberkeley.edu
# Last update: June 2014
#
# Please link/cite if you use this, email me if you have
# thoughts/improvements/corrections.
#
##############################################################
#
# Reference: Matzke, Nicholas J. (2014). "Introduction to R, starting from scratch."
# 2014 Summer Research Experiences (SRE) at NIMBioS for Undergraduates and Teachers.
# Given Monday, June 16, 9-11 am, Claxton 206, National Institute of
# Mathematical and Biological Science (NIMBioS).
# http://nimbios.org/sre/sre2014
#
# Free to use/redistribute under:
# Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the above license, linked here:
#
# http://creativecommons.org/licenses/by-nc-sa/3.0/
#
# Summary:
#
# You are free:
#
# * to Share — to copy, distribute and transmit the work
# * to Remix — to adapt the work
#
# Under the following conditions:
#
# * Attribution — You must attribute the work in the manner
# specified by the author or licensor (but not in any way that
# suggests that they endorse you or your use of the work).
# * Noncommercial — You may not use this work for commercial purposes.
#
# * Share Alike — If you alter, transform, or build upon this work,
# you may distribute the resulting work only under the same or
# similar license to this one.
#
# http://creativecommons.org/licenses/by-nc-sa/3.0/
#
###################################################################
#######################################################
# CHAPTER 1: PROPAGANDA FOR R
#
# R is a programming language designed primarily for
# data analysis and statistics.
#
# The big advantages of R are:
#
# 1. It is free.
# 2. It is easy.
#
# Point #2 sometimes takes some convincing, especially
# if you haven't programmed before. But, trust me, R
# is WAY easier than ANY other programming language
# I have ever tried, which you could also do serious
# science with.
#
# MATLAB is probably the only other competitor for ease
# of use and scientific ability, but Matlab costs
# hundreds of dollars, and hundreds of dollars more for
# the various extensions (for e.g. statistics, image
# analysis, etc.). This works great when your institution
# has a site license for Matlab, but it suck when you
# move to a new school/job.
#
# R is easy because most of the "computer science
# details" -- how to represent numbers and other
# objects in the computer as binary bits/bytes,
# how to manage memory, how to cast and type variables,
# blah blah blah, are done automatically behind the
# scenes.
#
# This means almost anyone can get going with R in
# minutes, by just typing in commands and not having
# to spend days learning the difference between a
# short and long integer, blah blah blah.
#
# That said, the cost of this automation is that R
# is slower than other programming languages. However,
# this doesn't matter for common, basic sorts of
# statistical analyses -- say, linear regression with
# 1,000 data observations. It DOES matter if you are
# dealing with huge datasets -- say, large satellite
# images, or whole genomes.
#
# In these situations, you should use specialist
# software, which is typically written in Python
# (for manipulating textual data, e.g. genome files)
# or Java, C, or C++ (for high-powered computing).
#
# (Although, in many situations, the slow parts of
# R can be re-programmed in C++, and accessed from
# R.)
#
# R is also pretty bad for large, complex programming
# projects. Python and C++ are "object-oriented."
# In computer-programming, "objects" help organize
# your data and tasks. For example, if you are
# writing a video game, you might want to program
# many different monsters. However, you don't want to
# re-program the behavior of each monster from scratch.
# Instead, you create a general object, "monster", and
# give it attributes (speed, armor, etc.). The "monster"
# object takes inputs (like what enemies are close to
# it) and produces outputs (motion or attacks in a
# certain direction).
#
# Each specific type of monster would be an instance
# of the monster class of objects. Each individual
# monster of a specific type would be its own object,
# keeping track of hit points, etc.
#
# You can see that, for serious programming, this
# object-oriented style would be the way to go. Therefore,
# "real" computer-science classes teach you this way
# of programming. This is great if you want to
# go work in the video game industry and devote your
# life to coding.
#
# However, if you just want to plot some data and
# run some statistical tests and do some science,
# you don't want to have to go through a bunch of
# rigamarole first. You just want to load the data
# and plot it and be done. This is what R is for.
#
#######################################################
#######################################################
# CHAPTER 2: GETTING R
#######################################################
#
# R is free and available for all platforms. You can
# download it here.:
#
# http://www.r-project.org/
#
# Tip for free, scientific software:
#
# Unless you are doing something expert, you will want
# the "binary" file rather than the source code.
#
# Programmers write source code in text files.
#
# A compiler program turns this into a "binary" which
# actually executes (runs) on a computer.
#
# Compiling from source code can take minutes or hours,
# and sometimes will crash if your computer & compiler
# are not set up right.
#
# A binary should just work, once you have installed it,
# assuming you've got the binary for your machine.
#
# ASSIGNMENT: Once you have R installed (it appear in
# "Applications" on a Mac, or "Program Files" on a
# Windows machine), open it to make sure it works.
# Then, return to this tutorial.
#
########################################################
#######################################################
# CHAPTER 3: GET A *PLAIN*-TEXT EDITOR
#######################################################
#
# Many people make the mistake of typing commands
# into R, but not saving those commands.
#
# *ALWAYS* SAVE YOUR COMMANDS IN A TEXT FILE!!
# *ALWAYS* SAVE YOUR COMMANDS IN A TEXT FILE!!
# *ALWAYS* SAVE YOUR COMMANDS IN A TEXT FILE!!
#
# Got it? Good.
#
# The next mistake people make is to use Word or
# some other monstrosity to save their commands.
# You can do this if you want, but the formatting
# etc. just gets in the way.
#
# Find or download a PLAIN-TEXT editor (aka ASCII
# text editor). Common examples:
#
# Mac: TextWrangler (free) or BBedit
#
# Windows: Notepad (free, search Programs) or Notetab
#
# Or: versions of R that have a GUI (GUI=Graphical User
# Interface) also have a built-in editor.
#
#
# WHY SAVE YOUR COMMANDS?
#
# Because you can come back in 6 months and run the
# same analysis again, just by pasting the commands
# back into R.
#
# Trust me, this is MUCH better than trying to remember
# what buttons to click in some software.
#
# And, anytime
# you need to do something more than a few times,
# it gets super-annoying to click all of the buttons
# again and again.
#
# This is why most serious scientific software is
# command-line, rather than menu-driven.
#
#
# HOW TO TAKE NOTES IN R SCRIPTS
#
# Put a "#" symbol in front of your comments. Like I
# did here. COMMENTS ARE GOOD! COMMENT EVERYTHING!
#
#
# ASSIGNMENT: Once you've found a plain-text editor,
# return to this tutorial.
#######################################################
#######################################################
# CHAPTER 4: R BASICS
#######################################################
#
# There are two major hurdles in learning R:
#
# 1. Getting/setting your working directory.
#
# 2. Loading your data
#
# 3. Learning the commands to do what you want.
#
# Points #1 and #2 are easy to learn -- just don't
# forget! You can never get anything significant
# done in R if you can't get your data loaded.
#
# Point #3 -- No one knows "all" of R's commands. As
# we see, every package and function creates
# additional commands.
#
# Your goal is just to learn the basics, and then learn
# how to find the commands you need.
#
# ASSIGNMENT: Type/paste in each of the commands below
# into your text file, then into R. Take notes as
# you go.
#######################################################
#######################################################
# Working directories:
#
# One of the first things you want to do, usually, is
# decide on your working directory.
#
# You should create a new directory using:
#
# Mac: Finder
# Windows: Windows Explorer (or File Manager or
# whatever it's called these days)
#
# ROOLZ FOR FILES AND DIRECTORIES IN R
#
# 1. Put the directory somewhere you will find it
# again.
#
# 2. Never use spaces in filenames.
#
# 3. Never use spaces in directory names.
#
# 4. Never use spaces in anything involving
# files/directories.
#
# 5. Never! It just causes problems later. The
# problems are fixable, but it's easier to
# just never use spaces.
#
# 6. Use underscore ("_") instead of spaces.
#
#
# FINDING MAC/WINDOWS DIRECTORIES IN R
#
# Usually, you can drag/drop the file or directory
# into R to see the full path to the file.
# Copy this into the 'here' in wd="here", below.
#
#
# CHANGE FILE SETTINGS IN MACS/WINDOWS
#
# Modern Macs/Windows hide a lot of information
# from you. This makes life easier for John Q. Public,
# but makes it harder for scientists.
#
# Good preferences for your file viewer:
#
# * Turn ON viewing file extensions (.txt, .docx, etc.)
# * Turn ON viewing of hidden files
# * Change file viewing to "list" format
#
# See Preferences in Mac Finder or Windows Explorer.
#
########################################################
# On my Mac, this is a working directory I have chosen
# (change it to be yours)
wd = "/Users/nickm/Desktop/Rintro/"
wd = "~/Desktop/Rintro/"
wd="/drives/GDrive/REU_example"
# On a PC, you might have to specify paths like this:
#wd = "c:\\Users\\nick\\Desktop\\_ib200a\\ib200b_sp2011\\lab03"
# setwd: set working directory
setwd(wd)
# getwd: get working directory
getwd()
# list.file: list the files in the directory
list.files()
#######################################################
# PLAYING WITH R
#######################################################
# (Preliminary: this might be useful; uncomment if so)
# options(stringsAsFactors = FALSE)
# concatentate to a list with c()
student_names = c("Nick", "Hillary", "Sonal")
# describe what the function "c" does:
#
grade1 = c(37, 100, 60)
grade2 = c(43, 80, 70)
grade3 = c(100, 90, 100)
grade1
grade2
grade3
print(grade3)
# column bind (cbind)
temp_table = cbind(student_names, grade1, grade2, grade3)
class(temp_table)
# convert to data frame
grade_data = as.data.frame(temp_table)
class(grade_data)
# Don't convert to factors
grade_data = as.data.frame(temp_table, stringsAsFactors=FALSE)
# add column headings
col_headers = c("names", "test1", "test2", "test3")
names(grade_data) = col_headers
print(grade_data)
# change the column names back
old_names = c("student_names", "grade1", "grade2", "grade3")
names(grade_data) = old_names
grade_data$grade1
# Let's calculate some means
# mean of one column
mean(grade_data$grade1)
# R can be very annoying in certain situations, e.g. treating numbers as character data
# What does as.numeric do?
#
as.numeric(grade_data$grade1)
grade_data$grade1 = as.numeric(as.character(grade_data$grade1))
grade_data$grade2 = as.numeric(as.character(grade_data$grade2))
grade_data$grade3 = as.numeric(as.character(grade_data$grade3))
print(grade_data)
# mean of one column
mean(grade_data$grade1)
# apply the mean function over the rows, for just the numbers columns (2, 3, and 4)
apply(X=grade_data[ , 2:4], MARGIN=2, FUN=mean)
# Why doesn't this work?
mean(grade_data)
# What caused the warning message in mean(grade_data)?
# How about this?
colMeans(grade_data[,2:4])
# How about this?
colMeans(grade_data[,2:4])
# More functions
sum(grade_data$grade1)
median(grade_data$grade1)
# standard deviation
apply(X=grade_data[ , 2:4], MARGIN=1, FUN=sd)
# store st. dev and multiply by 2
mean_values = apply(grade_data[ , 2:4], 1, mean)
sd_values = apply(grade_data[ , 2:4], 1, sd)
2 * sd_values
# print to screen even within a function:
print(sd_values)
# row bind (rbind)
grade_data2 = rbind(grade_data, c("means", mean_values), c("stds", sd_values))
#######################################################
# GETTING DATA
#######################################################
#
# Let's download some data. Francis Galton was one
# of the founders of statistics. He was also
# the cousin of Charles Darwin. Galton invented the
# term "regression". These days, "regression" means
# fitting the best-fit line to a series of x and y
# data points.
#
# But, why is the weird term "regression" used for this?
# What is regressing?
#
# Let's look at Galton's original dataset: the heights
# of parents and children.
#
# Use your web browser to navigate here:
#
# http://www.math.uah.edu/stat/data/Galton.html
#
# ...and save Galton.txt (right-click, save) to your
# working directory.
#
# After doing this, double-click on Galton.txt and
# view the file, just to see what's in there.
#
#######################################################
# Before proceeding, double-check that your data file
# is in the working directory:
getwd()
list.files()
# Let's store the filename in a variable
#
# Note: In Nick's head:
#
# "wd" means "working directory"
# "fn" means "filename"
#
#wd = "/drives/Dropbox/_njm/__packages/Rintro/"
#setwd(wd)
fn = "Galton.txt"
# Now, read the file into a data.frame
heights = read.table(file=fn, header=TRUE, sep="\t")
# Now, look at "heights"
heights
# Whoops, that went by fast! Let's just look at the
# top of the data table
head(heights)
# Let's get other information on the data.table
# Column names
names(heights)
# Dimensions (rows, columns)
dim(heights)
# Class (data.frame, matrix, character, numeric, list, etc.)
class(heights)
# The heights data is the adult height of a child (in inches),
# and the "midparent" height -- the mean of the two parents.
# QUESTION: Do the means of parent and child height differ?
# Means
colMeans(heights)
colMeans(heights[,-4])
# Standard deviations
apply(X=heights[,-4], MARGIN=2, FUN=sd)
# Min & Max
apply(X=heights[,-4], MARGIN=2, FUN=min)
apply(X=heights[,-4], MARGIN=2, FUN=max)
# They seem pretty close, but let's do a test
# Make sure numbers columns are numeric
heights$Family = as.numeric(heights$Family)
heights$Father = as.numeric(heights$Father)
heights$Height = as.numeric(heights$Height)
heights$Kids = as.numeric(heights$Kids)
# Let's add the Midparent column
heights[,c("Father","Mother")]
# Take the mean of Father and Mother columns, store in column "Midparent"
heights$Midparent = apply(X=heights[,c("Father","Mother")], MARGIN=1, FUN=mean)
# View the new column
head(heights)
# Population Mean Between Two Independent Samples
# http://www.r-tutor.com/elementary-statistics/inference-about-two-populations/population-mean-between-two-independent-samples
# (change "Child" to "Height")
ttest_result1 = t.test(x=heights$Midparent, y=heights$Height, paired=FALSE, alternative="two.sided")
ttest_result1
# But wait, this test assumes that the samples from each population
# are independent. Do you think parent heights and child heights are
# independent?
# Probably not. Actually, these samples are paired, so let's
# check that:
# Population Mean Between Two Matched Samples
# http://www.r-tutor.com/elementary-statistics/inference-about-two-populations/population-mean-between-two-matched-samples
ttest_result2 = t.test(x=heights$Midparent, y=heights$Height, paired=TRUE, alternative="two.sided")
ttest_result2
# Compare the two:
ttest_result1
ttest_result2
# Interestingly, it looks like parents are slightly taller than the children!
#
# Is this statistically significant?
#
# But is it a large effect? Is it *practically* significant?
#
# Let's plot the histograms
hist(heights$Midparent)
hist(heights$Height)
# That's a little hard to compare, due to the different
# automated scaling of the x-axis.
# Let's fix the x-axis to be (5 feet, 7 feet)
xlims = c(5*12, 7*12)
hist(heights$Midparent, xlim=xlims)
hist(heights$Height, xlim=xlims)
# And fix the y-axis
# Let's fix the y-axis to be (0, 220)
ylims = c(0, 220)
hist(heights$Midparent, xlim=xlims, ylim=ylims)
hist(heights$Height, xlim=xlims, ylim=ylims)
# Let's plot the means and 95% confidence intervals on top
# Midparent values
hist(heights$Midparent, xlim=xlims, ylim=ylims)
# Plot the mean
abline(v=mean(heights$Midparent), lty="dashed", lwd=2, col="blue")
# Plot the 95% confidence interval (2.5% - 97.5%)
CI_025 = mean(heights$Midparent) - 1.96*sd(heights$Midparent)
CI_975 = mean(heights$Midparent) + 1.96*sd(heights$Midparent)
abline(v=CI_025, lty="dotted", lwd=2, col="blue")
abline(v=CI_975, lty="dotted", lwd=2, col="blue")
# Child values
hist(heights$Height, xlim=xlims, ylim=ylims)
# Plot the mean
abline(v=mean(heights$Height), lty="dashed", lwd=2, col="blue")
# Plot the 95% confidence interval (2.5% - 97.5%)
CI_025 = mean(heights$Height) - 1.96*sd(heights$Height)
CI_975 = mean(heights$Height) + 1.96*sd(heights$Height)
abline(v=CI_025, lty="dotted", lwd=2, col="blue")
abline(v=CI_975, lty="dotted", lwd=2, col="blue")
# Let's put it all in a nice PDF format to save it
# Open a PDF for writing
pdffn = "Galton_height_histograms_v1.pdf"
pdf(file=pdffn, width=8, height=10)
# Do 2 subplots
par(mfrow=c(2,1))
# Midparent values
hist(heights$Midparent, xlim=xlims, ylim=ylims, xlab="height (inches)", ylab="Count", main="Midparent heights")
# Plot the mean
abline(v=mean(heights$Midparent), lty="dashed", lwd=2, col="blue")
# Plot the 95% confidence interval (2.5% - 97.5%)
CI_025 = mean(heights$Midparent) - 1.96*sd(heights$Midparent)
CI_975 = mean(heights$Midparent) + 1.96*sd(heights$Midparent)
abline(v=CI_025, lty="dotted", lwd=2, col="blue")
abline(v=CI_975, lty="dotted", lwd=2, col="blue")
# Child values
hist(heights$Height, xlim=xlims, ylim=ylims, xlab="height (inches)", ylab="Count", main="Child heights")
# Plot the mean
abline(v=mean(heights$Height), lty="dashed", lwd=2, col="blue")
# Plot the 95% confidence interval (2.5% - 97.5%)
CI_025 = mean(heights$Height) - 1.96*sd(heights$Height)
CI_975 = mean(heights$Height) + 1.96*sd(heights$Height)
abline(v=CI_025, lty="dotted", lwd=2, col="blue")
abline(v=CI_975, lty="dotted", lwd=2, col="blue")
# Close the PDF writing
dev.off()
# Write a system command as a text string
cmdstr = paste("open ", pdffn, sep="")
cmdstr
# Send the command to the computer system's Terminal/Command Line
system(cmdstr)
# The PDF should hopefully pop up, e.g. if you have the free Adobe Reader
# The difference in means is very small, even though it appears to be
# statistically significant.
#
# This is a VERY IMPORTANT lesson:
#
# "statistically significant" DOES NOT ALWAYS MEAN "practically "significant",
# "interesting", "scientifically relevant", etc.
#
#
# The difference may have to do with:
#
# * Galton's 'method' of dealing with the fact that
# male and female children have different average heights --
# he multiplied the female heights by 1.08!
#
# * Different nutrition between the generations
#
# * Maybe the adult children weren't quite all fully grown
#
# * Chance rejection of the null
#
# Who knows?
# You may have noticed that the standard deviations look to be
# a lot different. Can we test for this?
# Yes! The null hypothesis is that the ratio of the
# variances is 1:
Ftest_result = var.test(x=heights$Midparent, y=heights$Height, ratio=1, alternative="two.sided")
Ftest_result
# We get extremely significant rejection of the null. What is
# the likely cause of the lower variance in the midparent data?
#
# For the complex story of Galton's original data, see:
#
# http://www.medicine.mcgill.ca/epidemiology/hanley/galton/
#
# James A. Hanley (2004). 'Transmuting' women into men:
# Galton's family data on human stature. The American Statistician, 58(3) 237-243.
# http://www.medicine.mcgill.ca/epidemiology/hanley/reprints/hanley_article_galton_data.pdf
#
# BTW, Galton was both a genius, and promoted some deeply flawed ideas
# like eugenics:
# http://isteve.blogspot.com/2013/01/regression-toward-mean-and-francis.html
#
# We noted before that child and parent heights might not be
# independent. Let's test this!
# QUESTION: is there a relationship?
# Start by plotting the data:
plot(x=heights$Midparent, y=heights$Height)
# It looks like there is a positive relationship:
# taller parents have taller children.
# However, it's a little bit hard to tell for
# sure, because Galton's data is only measured
# to the half-inch, so many dots are plotting
# on top of each other. We can fix this by
# "jittering" the data:
# Plot the data, with a little jitter
plot(x=jitter(heights$Midparent), y=jitter(heights$Height))
# It looks like there's a positive relationship, which makes
# sense. Can we confirm this with a statistical test?
# Let's build a linear model (lm)
lm_result = lm(formula=Height~Midparent, data=heights)
lm_result
# This just has the coefficients, this doesn't tell us much
# What's in the linear model? A list of items:
names(lm_result)
# See the statistical results
summary(lm_result)
# Analysis of variance (ANOVA)
anova(lm_result)
# You can get some standard diagnostic regression plots with:
plot(lm_result)
# Let's plot the regression line on top of the points
intercept_value = lm_result$coefficients["(Intercept)"]
slope_value = lm_result$coefficients["Midparent"]
# Plot the points
plot(x=jitter(heights$Midparent), y=jitter(heights$Height))
# Add the line
abline(a=intercept_value, b=slope_value, col="blue", lwd=2, lty="dashed")
# It's a little hard to tell if the slope is 1:1 or not,
# Because the x-axis and y-axis aren't the same
# Let's fix this
# Plot the points
xlims = c(5*12, 6.5*12)
ylims = c(5*12, 6.5*12)
plot(x=jitter(heights$Midparent, factor=3), y=jitter(heights$Height, factor=3), xlab="Midparent height", ylab="Child height", xlim=xlims, ylim=ylims)
title("Galton's height data")
# Add the regression line
abline(a=intercept_value, b=slope_value, col="blue", lwd=2, lty="dashed")
# Add the 1:1 line
abline(a=0, b=1, col="darkgreen", lwd=2, lty="dashed")
# Is the slope statistically different from 1:1?
# We can test this by subtracting a 1:1 relationship from the data, and seeing if
# the result has a slope different from 0
child_minus_1to1 = heights$Height - (1/1*heights$Midparent)
heights2 = heights
heights2 = cbind(heights2, child_minus_1to1)
# Let's build a linear model (lm)
lm_result2 = lm(formula=child_minus_1to1~Midparent, data=heights2)
lm_result2
# This just has the coefficients, this doesn't tell us much
# What's in the linear model? A list of items:
names(lm_result2)
# See the statistical results
summary(lm_result2)
# Analysis of variance (ANOVA)
anova(lm_result2)
# You can get some standard diagnostic regression plots with:
plot(lm_result2)
# Let's plot the regression line on top of the points
intercept_value = lm_result2$coefficients["(Intercept)"]
slope_value = lm_result2$coefficients["Midparent"]
# Plot the points
plot(x=jitter(heights2$Midparent), y=jitter(heights2$child_minus_1to1), xlim=xlims, xlab="Midparent heights", ylab="Child heights minus 1:1 line", main="Relationship after subtracting 1:1 line")
# Add the regression line
abline(a=intercept_value, b=slope_value, col="blue", lwd=2, lty="dashed")
# Add the expected line if the relationship was 1:1
abline(a=0, b=0, col="darkgreen", lwd=2, lty="dashed")
# Yep, the relationship is definitely different than 1:1
# Why is the relationship between parent height and offspring
# height LESS THAN 1:1???
#
# Why do tall parents tend to produce offspring shorter
# than themselves? Why does height seem to "regress"?
# What about the children of short parents? Do they
# 'regress'?
#
# What are possible statistical consequences/hazards of this?
#
# Why is all of this rarely explained when regression
# is taught?
#
#######################################################
# CHAPTER 5: MAKE YOUR OWN FUNCTIONS, AND DO MAXIMUM LIKELIHOOD
#
# R has many good functions, but it is easy to make your
# own! In fact, this is necessary for some applications.
#
#######################################################
# Let's consider some coin-flip data.
#
# Here are 100 coin flips:
coin_flips = c('H','T','H','T','H','H','T','H','H','H','T','H','H','T','T','T','T','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','T','T','T','H','T','T','T','H','T','T','T','H','H','H','T','T','H','H','H','T','H','H','H','T','T','H','H','H','H','H','H','H','T','T','H','H','H','H','T','T','H','H','H','T','T','H','H','H','H','H','H','T','T','T','H','H','H','H','H','H','T','H','T','H','H','T','T')
coin_flips
# What is your guess at "P_heads", the probability of heads?
#
# What do you think the Maximum Likelihood (ML) estimate would be?
#
# In the case of binomial data, we actually have a formula to calculate
# the ML estimate:
# Find the heads
heads_TF = (coin_flips == "H")
heads_TF
# Find the tails
tails_TF = (coin_flips == "T")
tails_TF
numHeads = sum(heads_TF)
numHeads
numTails = sum(tails_TF)
numTails
numTotal = length(coin_flips)
numTotal
# Here's the formula:
P_heads_ML_estimate = numHeads / numTotal
P_heads_ML_estimate
# Well, duh, that seems pretty obvious. At least it would have been, if we
# weren't thinking of coins, where we have a strong prior belief that the
# coin is probably fair.
# What does it mean to say that this is "maximum likelihood" estimate of P_heads?
#
# "Likelihood", in statistics, means "the probability of the data under the model"
#
# This technical definition has some interesting consequences:
#
# * Data has likelihood, models do not.
# * The likelihood of noise in my attic, under the model that grelims
# are having a party up there, is 1.
# Let's calculate the probability of the coin flip data under the
# hypothesis/model that P_heads is 0.5
# We'll be very inefficient, and use a for-loop, and
# if/else statements
# Loop through all 100 flips
# Make a list of the probability of
# each datum
P_heads_guess = 0.5
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list
for (i in 1:length(coin_flips))
{
# Print an update
cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")
# Get the current coin flip
coin_flip = coin_flips[i]
# If the coin flip is heads, give that datum
# probability P_heads_guess.
# If tails, give it (1-P_heads_guess)
if (coin_flip == "H")
{
probs_list[i] = P_heads_guess
} # End if heads
if (coin_flip == "T")
{
probs_list[i] = (1-P_heads_guess)
} # End if tails
} # End for-loop
# Look at the resulting probabilities
probs_list
# We get the probability of all the data by multiplying
# all the probabilities
likelihood_of_data_given_P_heads_guess1 = prod(probs_list)
likelihood_of_data_given_P_heads_guess1
# That's a pretty small number! You'll see that it's
# just 0.5^100:
0.5^100
# A probability of 0.5 is not small, but multiply it
# 100 values of 0.5 together, and you get a small value.
# That's the probability of that specific sequence of
# heads/tails, given the hypothesis that the true
# probability is P_heads_guess.
# Let's try another probability:
# Loop through all 100 flips
# Make a list of the probability of
# each datum
P_heads_guess = 0.7
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list
for (i in 1:length(coin_flips))
{
# Print an update
cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")
# Get the current coin flip
coin_flip = coin_flips[i]
# If the coin flip is heads, give that datum
# probability P_heads_guess.
# If tails, give it (1-P_heads_guess)
if (coin_flip == "H")
{
probs_list[i] = P_heads_guess
} # End if heads
if (coin_flip == "T")
{
probs_list[i] = (1-P_heads_guess)
} # End if tails
} # End for-loop
# Look at the resulting probabilities
probs_list
# We get the probability of all the data by multiplying
# all the probabilities
likelihood_of_data_given_P_heads_guess2 = prod(probs_list)
likelihood_of_data_given_P_heads_guess2
# We got a different likelihood. It's also very small.
# But that's not important. What's important is,
# how many times higher is it?
likelihood_of_data_given_P_heads_guess2 / likelihood_of_data_given_P_heads_guess1
# Whoa! That's a lot higher! This means the coin flip data is 54 times more
# probable under the hypothesis that P_heads=0.7 than under the
# hypothesis that P_heads=0.5.
# Maximum likelihood: You can see that the BEST explanation of the data
# would be the one with the value of P_heads that maximized the probability
# of the data. This would be the Maximum Likelihood solution.
# We could keep copying and pasting code, but that seems annoying. Let's make a function
# instead:
# Function that calculates the probability of coin flip data
# given a value of P_heads_guess
calc_prob_coin_flip_data <- function(P_heads_guess, coin_flips)
{
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list
for (i in 1:length(coin_flips))
{
# Print an update
#cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")
# Get the current coin flip
coin_flip = coin_flips[i]
# If the coin flip is heads, give that datum
# probability P_heads_guess.
# If tails, give it (1-P_heads_guess)
if (coin_flip == "H")
{
probs_list[i] = P_heads_guess
} # End if heads
if (coin_flip == "T")
{
probs_list[i] = (1-P_heads_guess)
} # End if tails
} # End for-loop
# Look at the resulting probabilities
probs_list
# We get the probability of all the data by multiplying
# all the probabilities
likelihood_of_data_given_P_heads_guess = prod(probs_list)
# Return result
return(likelihood_of_data_given_P_heads_guess)
}
# Now, we can just use this function:
calc_prob_coin_flip_data(P_heads_guess=0.5, coin_flips=coin_flips)
calc_prob_coin_flip_data(P_heads_guess=0.6, coin_flips=coin_flips)
calc_prob_coin_flip_data(P_heads_guess=0.7, coin_flips=coin_flips)
# Look at that! We did all of that work in a split-second.
# In fact, we can make another for-loop, and search for the ML
# value of P_heads by trying all of the values and plotting them.
# Sequence of 50 possible values of P_heads between 0 and 1
P_heads_values_to_try = seq(from=0, to=1, length.out=50)
likelihoods = rep(NA, times=length(P_heads_values_to_try))
for (i in 1:length(P_heads_values_to_try))
{
# Get the current guess at P_heads_guess
P_heads_guess = P_heads_values_to_try[i]
# Calculate likelihood of the coin flip data under
# this value of P_heads
likelihood = calc_prob_coin_flip_data(P_heads_guess=P_heads_guess, coin_flips=coin_flips)
# Store the likelihood value
likelihoods[i] = likelihood
} # End for-loop
# Here are the resulting likelihoods:
likelihoods
# Let's try plotting the likelihoods to see if there's a peak
plot(x=P_heads_values_to_try, y=likelihoods)
lines(x=P_heads_values_to_try, y=likelihoods)
# Whoa! That's quite a peak! You can see that the likelihoods
# vary over several orders of magnitude.
#
# Partially because of this extreme variation, we often use the
# log-likelihood (natural log, here) instead of the raw
# likelihood.
#
# (Other reasons: machines have a minimum precision, log-likelihoods
# can be added instead of multiplied, AIC is calculated from
# log-likelihood, etc.)
#
#
log_likelihoods = log(likelihoods, base=exp(1))
plot(x=P_heads_values_to_try, y=log_likelihoods)
lines(x=P_heads_values_to_try, y=log_likelihoods)
# Let's plot these together
par(mfrow=c(2,1))
plot(x=P_heads_values_to_try, y=likelihoods, main="Likelihood (L) of the data")
lines(x=P_heads_values_to_try, y=likelihoods)
plot(x=P_heads_values_to_try, y=log_likelihoods, main="Log-likelihood (LnL) of the data")
lines(x=P_heads_values_to_try, y=log_likelihoods)
# Maximum likelihood optimization
#
# You can see that the maximum likelihood of the data occurs when
# P_heads is somewhere around 0.6 or 0.7. What is it
# exactly?
#
# We could just keep trying more values until we find whatever
# precision we desire. But, R has a function for
# maximum likelihood optimization!
#
# It's called optim(). Optim() takes a function as an input.
# Fortunately, we've already written a function!
#
# Let's modify our function a bit to return the log-likelihood,
# and print the result:
# Function that calculates the probability of coin flip data
# given a value of P_heads_guess
calc_prob_coin_flip_data2 <- function(P_heads_guess, coin_flips)
{
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list
for (i in 1:length(coin_flips))
{
# Print an update
#cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")
# Get the current coin flip
coin_flip = coin_flips[i]
# If the coin flip is heads, give that datum
# probability P_heads_guess.
# If tails, give it (1-P_heads_guess)
if (coin_flip == "H")
{
probs_list[i] = P_heads_guess
} # End if heads
if (coin_flip == "T")
{
probs_list[i] = (1-P_heads_guess)
} # End if tails
} # End for-loop
# Look at the resulting probabilities
probs_list
# We get the probability of all the data by multiplying
# all the probabilities
likelihood_of_data_given_P_heads_guess = prod(probs_list)
# Get the log-likelihood
LnL = log(likelihood_of_data_given_P_heads_guess)
LnL
# Error correction: if -Inf, reset to a low value
if (is.finite(LnL) == FALSE)
{
LnL = -1000
}
# Print some output
print_txt = paste("\nWhen P_heads=", P_heads_guess, ", LnL=", LnL, sep="")
cat(print_txt)
# Return result
return(LnL)
}
# Try the function out:
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.1, coin_flips=coin_flips)
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.2, coin_flips=coin_flips)
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.3, coin_flips=coin_flips)
# Looks like it works! Let's use optim() to search for he
# best P_heads value:
# Set a starting value of P_heads
starting_value = 0.1
# Set the limits of the search
limit_bottom = 0
limit_top = 1
optim_result = optim(par=starting_value, fn=calc_prob_coin_flip_data2, coin_flips=coin_flips, method="L-BFGS-B", lower=limit_bottom, upper=limit_top, control=list(fnscale=-1))
# You can see the search print out as it proceeds.
# Let's see what ML search decided on:
optim_result
# Let's compare the LnL from ML search, with the binomial mean
optim_result$par
# Here's the formula:
P_heads_ML_estimate = numHeads / numTotal
P_heads_ML_estimate
# Wow! Pretty good!
#
# But -- why would anyone ever go through all the rigamarole, when they could
# just calculate P_head directly?
#
# Well, only in simple cases do we have a formula for the maximum likelihood
# estimation of the mean. The optim() strategy works whether or not
# there is a simple formula.
#
# In real life science, ML optimization gets use A LOT, but most scientists
# don't learn it until graduate school, if then.
#
# For a real-life example of ML analysis, try the tutorial for my biogeography
# R package, BioGeoBEARS:
#
# http://phylo.wikidot.com/biogeobears#toc16
#
# NOTE: BAYESIAN METHODSs
# By the way, having done this ML search, we are very close to being able
# to do a Bayesian MCMC (Markov-Chain, Monte-Carlo) analysis. However,
# we don't have time for this today. Come talk to me this
# summer if you are interested!
#######################################################
# CHAPTER 6 (BONUS): R PACKAGES AND MORE R
#
# Many good functions are found in base R, but there
# are many, many more in other R packages
#######################################################
#############################################
# install a package (only have to do once)
#############################################
# Type this:
odd(13)
# What happened?
#
# Now do this:
install.packages("gtools")
# gtools contains functions for odd/even (and many other things)
# after a package is installed, you have to library() it to use its
# functions during an R session
library(gtools)
# Now type this:
odd(13)
# For loops
# Here, we are using for loops, if statements, and the gtools function "odd"
# What does the code in this for loop do?
#
for (i in 1:10)
{
if (odd(i) == TRUE)
{
print(paste(i, "is odd!", sep=" "))
}
else
{
print("Blah!")
}
}
#
paste("This", "is", "fun", sep=" ")
# print can be annoying, use cat
for (i in 1:10)
{
if (odd(i) == TRUE)
{
cat(paste(i, "is odd!", "\n", sep=" "))
}
else
{
cat("Blah!\n" )
}
}
# How to make your own function
# (These can be sources from a source script, which is
# a .R file either on your computer
get_square <- function(x)
{
output = x^2
return(output)
}
x = 4
#
(newval = get_square(x))
# Write to tab-delimited text file
fn = "grade_data.txt"
write.table(grade_data, file=fn, quote=FALSE, sep=" ", row.names=TRUE, col.names=TRUE)
# read it back in
new_data = read.table(fn, header=TRUE, sep=" ", quote="", stringsAsFactors = FALSE, strip.white=TRUE, fill=TRUE)
# plots and stats
#
plot(new_data$grade1, new_data$grade2)
#
title("Hi Tom!")
#
plot(new_data$grade1, new_data$grade2, xlab="scores for grade #1", ylab="scores for grade #2")
#
lines(new_data$grade1, new_data$grade2)
#
pairs(new_data[, 2:4])
#
cor(new_data[, 2:4])
# A function I stole from somewhere and put in genericR_v1:
pairs_with_hist(new_data[, 2:4])
# CTRL-left or right to arrow through plots
# help on any function
?mean
?std
# once you are looking at the help page, go to the bottom & click index to see all the options for the package
# or just e.g.
?gtools
# search for text in help
# (marginally useful)
??histogram
# I like to google: ' "r-help" something something '
# ...since someone has always asked my question on the r-help listserv
###############################################
# Basic crash course in APE
# The R Package APE: Analysis of Phylogenetics
# and Evolution
#
# Paradis's book on APE is linked from the
# course website:
# http://ib.berkeley.edu/courses/ib200b/IB200B_SyllabusHandouts.shtml
# (for Feb. 3)
###############################################
# Install APE
install.packages("ape")
# (This should install some other needed packages also)
library(ape)
# This is what a Newick string looks like:
newick_str = "(((Humans, Chimps), Gorillas), Orangs);"
tr = read.tree(text=newick_str)
plot(tr)
# What is the data class of "tr"?
#
class(tr)
# Is there any difference in the graphic produced by these two commands?
#
plot(tr)
plot.phylo(tr)
# What is the difference in the result of these two help commands?
#
?plot
?plot.phylo
# What are we adding to the tree and the plot of the tree, this time?
#
newick_str = "(((Humans:6.0, Chimps:6.0):1.0, Gorillas:7.0):1.0, Orangs:8.0):1.0;"
tr = read.tree(text=newick_str)
plot(tr)
# What are we adding to the tree and the plot of the tree, this time?
#
newick_str = "(((Humans:6.0, Chimps:6.0)LCA_humans_chimps:1.0, Gorillas:7.0)LCA_w_gorillas:1.0, Orangs:8.0)LCA_w_orangs:1.0;"
tr = read.tree(text=newick_str)
plot(tr, show.node.label=TRUE)
# More on Newick format, which, annoyingly, is sometimes inconsistent:
# http://en.wikipedia.org/wiki/Newick_format
# Have a look at how the tree is stored in R
tr
#
tr$tip.label
#
tr$edge
#
tr$edge.length
#
tr$node.label
# If you forget how to find these, you can use the "attributes" function
#
attributes(tr)
# Now plot the tree in different ways:
# (CTRL-right or CTRL-left to flip between the trees in the graphics window)
plot(tr, type="phylogram")
#
plot(tr, type="phylogram", direction="rightwards")
plot(tr, type="phylogram", direction="leftwards")
plot(tr, type="phylogram", direction="upwards")
plot(tr, type="phylogram", direction="downwards")
#
plot(tr, type="cladogram")
plot(tr, type="fan")
plot(tr, type="unrooted")
plot(tr, type="radial")
#
plot(tr, type="unrooted", edge.width=5)
plot(tr, type="unrooted", edge.width=5, edge.color="blue")
plot(tr, type="unrooted", edge.width=5, edge.color="blue", lab4ut="horizontal")
plot(tr, type="unrooted", edge.width=5, edge.color="blue", lab4ut="axial")
# In R GUI, you can save any displayed tree to PDF, or do a screen capture etc.
# you can also save a tree to PDF as follows:
pdffn = "homstree.pdf"
pdf(file=pdffn)
plot(tr, type="unrooted", edge.width=5, edge.color="blue", lab4ut="axial")
dev.off()
# In Macs (and maybe PCs), this will open the PDF from R:
cmdstr = paste("open ", pdffn, sep="")
system(cmdstr)
# How to save the tree as text files
#
newick_fn = "homstree.newick"
write.tree(tr, file=newick_fn)
#
nexus_fn = "homstree.nexus"
write.nexus(tr, file=nexus_fn)
moref(nexus_fn)
# To conclude the lab, I wanted to find, download, and display
# a "tree of life".
#
# To do this, I went to the TreeBase search page:
# http://www.treebase.org/treebase-web/search/studySearch.html
#
# ...and searched on studies with the title "tree of life"
#
# Annoyingly, the fairly famous tree from:
#
# Ciccarelli F.D. et al. (2006). "Toward automatic reconstruction of
# a highly resolved tree of life." Science, 311:1283-1287.
# http://www.sciencemag.org/content/311/5765/1283.abstract
#
# ...was not online, as far as I could tell. And a lot of these are the "turtle trees of life", etc.
# Lame. But this one was a tree covering the root of known
# cellular life.
#
# Caetano-anollés G. et al. (2002). "Evolved RNA secondary structure
# and the rooting of the universal tree of life." Journal of
# Molecular Evolution.
#
# Check S796 for this study, then click over to the "Trees" tab to get the
# tree...
#
# http://www.phylowidget.org/full/?tree=%27http://www.treebase.org/treebase-web/tree_for_phylowidget/TB2:Tr3931%27
#
# Or, download the tree from our website, here:
# http://ib.berkeley.edu/courses/ib200b/labs/Caetano-anolles_2002_JME_ToL.newick
# load the tree and play with it:
newick_fn = "Caetano-anolles_2002_JME_ToL.newick"
tree_of_life = read.tree(newick_fn)
plot(tree_of_life, type="cladogram")
plot(tree_of_life, type="phylogram")
plot(tree_of_life, type="unrooted", lab4ut="axial")
# aw, no branch lengths in TreeBase! Topology only! Lame!
#######################################################
# CHAPTER 7 (BONUS): AN EXAMPLE USING ALL OF THE ABOVE
# WITH BIOGEOGRAPHY
#
# See BioGeoBEARS tutorial:
#
# http://phylo.wikidot.com/biogeobears#toc16
#
#######################################################
```