Biosci220 Lab 4
#######################################################
# WEEK 4 LAB
#######################################################
#######################################################
# PART 1: ADVANTAGES/DISADVANTAGES OF MODELS
#######################################################
#
# Please form groups of 2-4 people, and discuss the readings
# (Asimov and O'Neill) for 10 minutes. Keep two lists:
#
# - a list of possible advantages of relying on models
# - a list of possible disadvantages of relying on models
#
# Then, spend 10 minutes reading these articles:
#
# Why outbreaks like coronavirus spread exponentially, and
# how to "flatten the curve" (free online)
# By Harry Stevens
# March 14, 2020
# https://www.washingtonpost.com/graphics/2020/world/corona-simulator/
#
# NextStrain Covid-19 webpage:
# https://nextstrain.org/ncov
#
# Now, discuss what you have heard about models in
# connection with COVID-19, and add to your list any
# advantages/disadvantages you see in the use of
# epidemiological models in the science/public
# policy/public communication arenas.
#
# Once you have some items, go up to the whiteboard and
# write (SHORT, i.e. 1-5 words) descriptions of the
# most important advantages/disadvantages. If you see
# something similar already written, just add a
# tick-mark.
#
# Please also use sanitiser and/or wash your hands after
# using the shared pen.
#
# We will then review this as a class.
#
#######################################################
#######################################################
# PART 2: INTRODUCTION TO ML INFERENCE
########################################################
#
# We will begin our exploration of model-based inference
# with a very simple dataset and model.
#
# Many biological problems, in many different fields,
# from ecology to epidemiology, are concerned with
# trying to estimate the prevalence or frequency of
# something.
#
# For example:
#
# - Out of 1000 ponds in a region, how many contain a
# particular endangered frog species?
#
# - Out of a million people, how many are resistant
# to measles (due to e.g. vaccination or previous
# infection)?
#
# These examples (leaving out numerous real-world
# complexities) boil down to this: there are 2
# possible outcomes, and we are interested in how
# often each outcome occurs.
#
# Another process with 2 outcomes is coin-flipping.
#
# Coin-flipping, and any other 2-outcome process,
# can be described with a one-parameter model, using
# a binomial distribution ("bi-nomial" = "two names").
#
# (We will use coin-flipping as an example, but you can
# mentally substitute any other binomial process if
# you like.)
#
# The single parameter describing a binomial process
# is "p", the probability of one of the outcomes. The
# probability of the other outcome is 1-p.
#
# Because there are so many different probabilities and
# "p" terms used in probability
# and statistics, I will use "P_heads" - the probability
# that a coin will produce "Heads" on a single flip - to
# describe this parameter.
#
########################################################
# Let's consider some coin-flip data.
#
# Here are 100 coin flips:
coin_flips = c('H','T','H','T','H','H','T','H','H','H','T','H','H','T','T','T','T','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','T','T','T','H','T','T','T','H','T','T','T','H','H','H','T','T','H','H','H','T','H','H','H','T','T','H','H','H','H','H','H','H','T','T','H','H','H','H','T','T','H','H','H','T','T','H','H','H','H','H','H','T','T','T','H','H','H','H','H','H','T','H','T','H','H','T','T')
# Look at the data
coin_flips
# What is your guess at "P_heads", the probability of heads?
#
# In the case of binomial data, we actually have a simple
# formula to calculate the best estimate of P_heads:
# Find the heads
heads_TF = (coin_flips == "H")
heads_TF
# Find the tails
tails_TF = (coin_flips == "T")
tails_TF
numHeads = sum(heads_TF)
numHeads
numTails = sum(tails_TF)
numTails
numTotal = length(coin_flips)
numTotal
# Here's the formula:
P_heads_ML_estimate = numHeads / numTotal
P_heads_ML_estimate
# Well, duh, that seems pretty obvious. At least it would have been, if we
# weren't thinking of coins, where we have a strong prior belief that the
# coin is probably fair.
# It turns out that this formula can be justified through a technique
# known as Maximum Likelihood.
#
# What does it mean to say we have a "maximum likelihood" estimate of P_heads?
#
# "Likelihood", in statistics, means "the probability of the data under the model"
#
# Let's calculate the probability of the coin flip data under the
# hypothesis/model that P_heads is 0.5
# We'll be very inefficient, and use a for-loop, and
# if/else statements
# Loop through all 100 flips
# Make a list of the probability of
# each datum
P_heads_guess = 0.5
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list
for (i in 1:length(coin_flips))
{
# Print an update
cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")
# Get the current coin flip
coin_flip = coin_flips[i]
# If the coin flip is heads, give that datum
# probability P_heads_guess.
# If tails, give it (1-P_heads_guess)
if (coin_flip == "H")
{
probs_list[i] = P_heads_guess
} # End if heads
if (coin_flip == "T")
{
probs_list[i] = (1-P_heads_guess)
} # End if tails
} # End for-loop
# Look at the resulting probabilities
probs_list
# We get the probability of all the data by multiplying
# all the probabilities, with the prod() function.
likelihood_of_data_given_P_heads_guess1 = prod(probs_list)
likelihood_of_data_given_P_heads_guess1
# That's a pretty small number! You'll see that it's
# just 0.5^100:
0.5^100
# A probability of 0.5 is not small, but multiply it
# 100 values of 0.5 together, and you get a small value.
# That's the probability of that specific sequence of
# heads/tails, given the hypothesis that the true
# probability is P_heads_guess.
# Let's try another parameter value:
# Loop through all 100 flips
# Make a list of the probability of
# each datum
P_heads_guess = 0.7
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list
for (i in 1:length(coin_flips))
{
# Print an update
cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")
# Get the current coin flip
coin_flip = coin_flips[i]
# If the coin flip is heads, give that datum
# probability P_heads_guess.
# If tails, give it (1-P_heads_guess)
if (coin_flip == "H")
{
probs_list[i] = P_heads_guess
} # End if heads
if (coin_flip == "T")
{
probs_list[i] = (1-P_heads_guess)
} # End if tails
} # End for-loop
# Look at the resulting probabilities
probs_list
# We get the probability of all the data by multiplying
# all the probabilities
likelihood_of_data_given_P_heads_guess2 = prod(probs_list)
likelihood_of_data_given_P_heads_guess2
# We got a different likelihood. It's also very small.
# But that's not important. What's important is,
# how many times higher is it?
likelihood_of_data_given_P_heads_guess2 / likelihood_of_data_given_P_heads_guess1
# Whoa! That's a lot higher! This means the coin flip data is 54 times more
# probable under the hypothesis that P_heads=0.7 than under the
# hypothesis that P_heads=0.5.
# Maximum likelihood: You can see that the BEST explanation of the data
# would be the one with the value of P_heads that maximized the probability
# of the data. This would be the Maximum Likelihood solution.
# We could keep copying and pasting code, but that seems annoying. Let's make a function
# instead:
# Function that calculates the probability of coin flip data
# given a value of P_heads_guess
calc_prob_coin_flip_data <- function(P_heads_guess, coin_flips)
{
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list
for (i in 1:length(coin_flips))
{
# Print an update
#cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")
# Get the current coin flip
coin_flip = coin_flips[i]
# If the coin flip is heads, give that datum
# probability P_heads_guess.
# If tails, give it (1-P_heads_guess)
if (coin_flip == "H")
{
probs_list[i] = P_heads_guess
} # End if heads
if (coin_flip == "T")
{
probs_list[i] = (1-P_heads_guess)
} # End if tails
} # End for-loop
# Look at the resulting probabilities
probs_list
# We get the probability of all the data by multiplying
# all the probabilities
likelihood_of_data_given_P_heads_guess = prod(probs_list)
# Return result
return(likelihood_of_data_given_P_heads_guess)
}
# Now, we can just use this function:
calc_prob_coin_flip_data(P_heads_guess=0.5, coin_flips=coin_flips)
calc_prob_coin_flip_data(P_heads_guess=0.6, coin_flips=coin_flips)
calc_prob_coin_flip_data(P_heads_guess=0.7, coin_flips=coin_flips)
# Look at that! We did all of that work in a split-second.
# In fact, we can make another for-loop, and search for the ML
# value of P_heads by trying all of the values and plotting them.
# Sequence of 50 possible values of P_heads between 0 and 1
P_heads_values_to_try = seq(from=0, to=1, length.out=50)
likelihoods = rep(NA, times=length(P_heads_values_to_try))
for (i in 1:length(P_heads_values_to_try))
{
# Get the current guess at P_heads_guess
P_heads_guess = P_heads_values_to_try[i]
# Calculate likelihood of the coin flip data under
# this value of P_heads
likelihood = calc_prob_coin_flip_data(P_heads_guess=P_heads_guess, coin_flips=coin_flips)
# Store the likelihood value
likelihoods[i] = likelihood
} # End for-loop
# Here are the resulting likelihoods:
likelihoods
# Let's try plotting the likelihoods to see if there's a peak
plot(x=P_heads_values_to_try, y=likelihoods)
lines(x=P_heads_values_to_try, y=likelihoods)
# Whoa! That's quite a peak! You can see that the likelihoods
# vary over several orders of magnitude.
#
# Partially because of this extreme variation, we often use the
# log-likelihood (natural log, here) instead of the raw
# likelihood.
#
# (Other reasons: machines have a minimum precision, log-likelihoods
# can be added instead of multiplied, AIC is calculated from
# log-likelihood, etc.)
#
#
log_likelihoods = log(likelihoods, base=exp(1))
plot(x=P_heads_values_to_try, y=log_likelihoods)
lines(x=P_heads_values_to_try, y=log_likelihoods)
# Let's plot these together
par(mfrow=c(2,1))
plot(x=P_heads_values_to_try, y=likelihoods, main="Likelihood (L) of the data")
lines(x=P_heads_values_to_try, y=likelihoods)
plot(x=P_heads_values_to_try, y=log_likelihoods, main="Log-likelihood (LnL) of the data")
lines(x=P_heads_values_to_try, y=log_likelihoods)
# Maximum likelihood optimization
#
# You can see that the maximum likelihood of the data occurs when
# P_heads is somewhere around 0.6 or 0.7. What is it
# exactly?
#
# We could just keep trying more values until we find whatever
# precision we desire. But, R has a function for
# maximum likelihood optimization!
#
# It's called optim(). Optim() takes a function as an input.
# Fortunately, we've already written a function!
#
# Let's modify our function a bit to return the log-likelihood,
# and print the result:
# Function that calculates the probability of coin flip data
# given a value of P_heads_guess
calc_prob_coin_flip_data2 <- function(P_heads_guess, coin_flips)
{
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list
for (i in 1:length(coin_flips))
{
# Print an update
#cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")
# Get the current coin flip
coin_flip = coin_flips[i]
# If the coin flip is heads, give that datum
# probability P_heads_guess.
# If tails, give it (1-P_heads_guess)
if (coin_flip == "H")
{
probs_list[i] = P_heads_guess
} # End if heads
if (coin_flip == "T")
{
probs_list[i] = (1-P_heads_guess)
} # End if tails
} # End for-loop
# Look at the resulting probabilities
probs_list
# We get the probability of all the data by multiplying
# all the probabilities
likelihood_of_data_given_P_heads_guess = prod(probs_list)
# Get the log-likelihood
LnL = log(likelihood_of_data_given_P_heads_guess)
LnL
# Error correction: if -Inf, reset to a low value
if (is.finite(LnL) == FALSE)
{
LnL = -1000
}
# Print some output
print_txt = paste("\nWhen P_heads=", P_heads_guess, ", LnL=", LnL, sep="")
cat(print_txt)
# Return result
return(LnL)
}
# Try the function out:
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.1, coin_flips=coin_flips)
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.2, coin_flips=coin_flips)
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.3, coin_flips=coin_flips)
# Looks like it works! Let's use optim() to search for he
# best P_heads value:
# Set a starting value of P_heads
starting_value = 0.1
# Set the limits of the search
limit_bottom = 0
limit_top = 1
optim_result = optim(par=starting_value, fn=calc_prob_coin_flip_data2, coin_flips=coin_flips, method="L-BFGS-B", lower=limit_bottom, upper=limit_top, control=list(fnscale=-1))
# You can see the search print out as it proceeds.
# Let's see what ML search decided on:
optim_result
# Let's compare the LnL from ML search, with the binomial mean
optim_result$par
# Here's the formula:
P_heads_ML_estimate = numHeads / numTotal
P_heads_ML_estimate
# Wow! Pretty good!
#
# Where does that formula for P_heads_ML_estimate come from? We will discuss
# this in lab on the board, but here is a summary.
#
# It turns out
# that, in simple cases, we can use calculus to derive "analytical solutions"
# for maximizing the likelihood. Basically, this involves:
#
# - taking the derivative of the likelihood function
# (the derivative is the slope of a function)
# - setting the formula for the derivative to equal 0
# (because, when the derivative=0, you are at a maximum or minimum)
# - solving for the parameter value that causes the derivative to
# equal 0.
#
# This same basic calculus-based procedure provides a justification for
# the formulas that are used in e.g. linear regression.
#
# The nice things about calculus-derived formulas for maximizing likelihood
# -- known as "analytic solutions" -- include:
#
# - it is very fast
# - the solution can be found without a computer (this was useful back in
# the 20th century, when statistics was being developed without computers)
# - there is no chance of optimization error (optimization errors can occur
# if likelihood surfaces are too flat, too bumpy, etc.)
#
# This raises the question:
# Why would anyone ever go through all the optimization rigamarole, when
# they could just calculate P_heads directly?
#
# Well, only in simple cases do we have a formula for the maximum likelihood
# estimate. The optim() strategy works whether or not
# there is a simple formula.
#
# In real life science, with complex models of biological processes (such as
# epidemiological models), ML optimization gets used A LOT, but most scientists
# don't learn it until graduate school, if then.
#
# WHAT WE LEARNED
#
# You can now see that there is nothing magical, or even particularly
# difficult, about these concepts:
#
# - likelihood
# - log-likelihood
# - maximum likelihood (ML)
# - inferring a parameter through ML
# IMPLICATIONS
#
# It turns out that likelihood is one of the core concepts in model-
# based inference, and statistics generally. Bayesian analyses
# have likelihood as a core component (we will talk about this more
# later), and methods from introductory statistics (linear models etc.)
#######################################################
# PART 3: BUILDING BLOCKS OF MODELS IN R
#######################################################
# R has a number of functions that speed up the process
# of building models, and simulating from them.
# For example, for binomial processes, these are:
# dbinom(x, size, prob, log) -- the probability density
# (the likelihood) of x "heads"
#
# size = the number of tries (the number of coin flips)
#
# prob = the parameter value (e.g. P_heads)
#
# log = should the likelihood be log-transformed when output?
#
# The other functions are:
#
# rbinom(n, size, prob) -- simulate n binomial outcomes,
# using "size" and "prob" as above
#
# pbinom(q, ...) - used for calculating tail probabilities
# (proportion of the distribution less than the value q)
#
# qbinom(p, ...) - reverse of pbinom()
#
# Any time you hear about "probability densities", "distributions"
# "random number generators", "random draws", or "simulations"
# in R, functions like these are being used.
# For uniform distributions, the equivalent functions are:
#
# dunif, runif, punif, qunif
# For normal (=Gaussian/bell-curve) distributions, the equivalents are:
#
#
# dnorm, rnorm, pnorm, qnorm
# Random seeds, and random number generators
# Random numbers in computers are not truly random. To make random numbers
# repeatable, a "seed" number is always used.
# Compare:
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)
set.seed(54321)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)
set.seed(54321)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)
set.seed(54321)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)
# By re-setting the seed, we re-set the random number generator
#######################################################
# LAB QUIZ: REFER TO THE QUIZ ON CANVAS
#######################################################