Biosci220 Lab 4
#######################################################
# WEEK 4 LAB
#######################################################

#######################################################
# PART 1: ADVANTAGES/DISADVANTAGES OF MODELS
#######################################################
# 
# Please form groups of 2-4 people, and discuss the readings
# (Asimov and O'Neill) for 10 minutes.  Keep two lists:
#
# - a list of possible advantages of relying on models
# - a list of possible disadvantages of relying on models
#
# Then, spend 10 minutes reading these articles:
# 
# Why outbreaks like coronavirus spread exponentially, and
# how to "flatten the curve" (free online)
# By Harry Stevens
# March 14, 2020
# https://www.washingtonpost.com/graphics/2020/world/corona-simulator/
#
# NextStrain Covid-19 webpage:
# https://nextstrain.org/ncov
# 
# Now, discuss what you have heard about models in 
# connection with COVID-19, and add to your list any 
# advantages/disadvantages you see in the use of 
# epidemiological models in the science/public 
# policy/public communication arenas.
#
# Once you have some items, go up to the whiteboard and
# write (SHORT, i.e. 1-5 words) descriptions of the
# most important advantages/disadvantages. If you see
# something similar already written, just add a 
# tick-mark.
# 
# Please also use sanitiser and/or wash your hands after
# using the shared pen.
#
# We will then review this as a class.
# 
#######################################################

#######################################################
# PART 2: INTRODUCTION TO ML INFERENCE
########################################################
# 
# We will begin our exploration of model-based inference
# with a very simple dataset and model.
# 
# Many biological problems, in many different fields,
# from ecology to epidemiology, are concerned with 
# trying to estimate the prevalence or frequency of
# something. 
#
# For example:
#
# - Out of 1000 ponds in a region, how many contain a
#   particular endangered frog species?
#
# - Out of a million people, how many are resistant
#   to measles (due to e.g. vaccination or previous 
#   infection)?
#
# These examples (leaving out numerous real-world
# complexities) boil down to this: there are 2 
# possible outcomes, and we are interested in how
# often each outcome occurs.
#
# Another process with 2 outcomes is coin-flipping.
#
# Coin-flipping, and any other 2-outcome process,
# can be described with a one-parameter model, using
# a binomial distribution ("bi-nomial" = "two names").
#
# (We will use coin-flipping as an example, but you can
# mentally substitute any other binomial process if
# you like.)
#
# The single parameter describing a binomial process
# is "p", the probability of one of the outcomes. The
# probability of the other outcome is 1-p.
#
# Because there are so many different probabilities and 
# "p" terms used in probability
# and statistics, I will use "P_heads" - the probability
# that a coin will produce "Heads" on a single flip - to 
# describe this parameter.
# 
########################################################

# Let's consider some coin-flip data.  
#
# Here are 100 coin flips:

coin_flips = c('H','T','H','T','H','H','T','H','H','H','T','H','H','T','T','T','T','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','T','T','T','H','T','T','T','H','T','T','T','H','H','H','T','T','H','H','H','T','H','H','H','T','T','H','H','H','H','H','H','H','T','T','H','H','H','H','T','T','H','H','H','T','T','H','H','H','H','H','H','T','T','T','H','H','H','H','H','H','T','H','T','H','H','T','T')

# Look at the data
coin_flips

# What is your guess at "P_heads", the probability of heads?
#

# In the case of binomial data, we actually have a simple 
# formula to calculate the best estimate of P_heads:

# Find the heads
heads_TF = (coin_flips == "H")
heads_TF

# Find the tails
tails_TF = (coin_flips == "T")
tails_TF

numHeads = sum(heads_TF)
numHeads

numTails = sum(tails_TF)
numTails

numTotal = length(coin_flips)
numTotal

# Here's the formula:
P_heads_ML_estimate = numHeads / numTotal
P_heads_ML_estimate

# Well, duh, that seems pretty obvious.  At least it would have been, if we 
# weren't thinking of coins, where we have a strong prior belief that the
# coin is probably fair.

# It turns out that this formula can be justified through a technique
# known as Maximum Likelihood.
#
# What does it mean to say we have a "maximum likelihood" estimate of P_heads?
# 
# "Likelihood", in statistics, means "the probability of the data under the model"
#

# Let's calculate the probability of the coin flip data under the 
# hypothesis/model that P_heads is 0.5

# We'll be very inefficient, and use a for-loop, and
# if/else statements

# Loop through all 100 flips
# Make a list of the probability of 
# each datum
P_heads_guess = 0.5

# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list

for (i in 1:length(coin_flips))
    {
    # Print an update
    cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")

    # Get the current coin flip
    coin_flip = coin_flips[i]

    # If the coin flip is heads, give that datum
    # probability P_heads_guess.
    # If tails, give it (1-P_heads_guess)

    if (coin_flip == "H")
        {
        probs_list[i] = P_heads_guess
        } # End if heads

    if (coin_flip == "T")        
        {
        probs_list[i] = (1-P_heads_guess)
        } # End if tails
    } # End for-loop

# Look at the resulting probabilities
probs_list

# We get the probability of all the data by multiplying
# all the probabilities, with the prod() function.
likelihood_of_data_given_P_heads_guess1 = prod(probs_list)
likelihood_of_data_given_P_heads_guess1

# That's a pretty small number!  You'll see that it's 
# just 0.5^100:
0.5^100

# A probability of 0.5 is not small, but multiply it 
# 100 values of 0.5 together, and you get a small value.
# That's the probability of that specific sequence of 
# heads/tails, given the hypothesis that the true
# probability is P_heads_guess.

# Let's try another parameter value:

# Loop through all 100 flips
# Make a list of the probability of 
# each datum
P_heads_guess = 0.7

# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list

for (i in 1:length(coin_flips))
    {
    # Print an update
    cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")

    # Get the current coin flip
    coin_flip = coin_flips[i]

    # If the coin flip is heads, give that datum
    # probability P_heads_guess.
    # If tails, give it (1-P_heads_guess)

    if (coin_flip == "H")
        {
        probs_list[i] = P_heads_guess
        } # End if heads

    if (coin_flip == "T")        
        {
        probs_list[i] = (1-P_heads_guess)
        } # End if tails
    } # End for-loop

# Look at the resulting probabilities
probs_list

# We get the probability of all the data by multiplying
# all the probabilities
likelihood_of_data_given_P_heads_guess2 = prod(probs_list)
likelihood_of_data_given_P_heads_guess2

# We got a different likelihood. It's also very small.
# But that's not important. What's important is, 
# how many times higher is it?

likelihood_of_data_given_P_heads_guess2 / likelihood_of_data_given_P_heads_guess1

# Whoa!  That's a lot higher!  This means the coin flip data is 54 times more
# probable under the hypothesis that P_heads=0.7 than under the 
# hypothesis that P_heads=0.5.

# Maximum likelihood: You can see that the BEST explanation of the data 
# would be the one with the value of P_heads that maximized the probability 
# of the data.  This would be the Maximum Likelihood solution.

# We could keep copying and pasting code, but that seems annoying.  Let's make a function 
# instead:

# Function that calculates the probability of coin flip data
# given a value of P_heads_guess
calc_prob_coin_flip_data <- function(P_heads_guess, coin_flips)
    {
    # Empty list of probabilities
    probs_list = rep(NA, times=length(coin_flips))
    probs_list

    for (i in 1:length(coin_flips))
        {
        # Print an update
        #cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")

        # Get the current coin flip
        coin_flip = coin_flips[i]

        # If the coin flip is heads, give that datum
        # probability P_heads_guess.
        # If tails, give it (1-P_heads_guess)

        if (coin_flip == "H")
            {
            probs_list[i] = P_heads_guess
            } # End if heads

        if (coin_flip == "T")        
            {
            probs_list[i] = (1-P_heads_guess)
            } # End if tails
        } # End for-loop

    # Look at the resulting probabilities
    probs_list

    # We get the probability of all the data by multiplying
    # all the probabilities
    likelihood_of_data_given_P_heads_guess = prod(probs_list)

    # Return result
    return(likelihood_of_data_given_P_heads_guess)
    }

# Now, we can just use this function:
calc_prob_coin_flip_data(P_heads_guess=0.5, coin_flips=coin_flips)
calc_prob_coin_flip_data(P_heads_guess=0.6, coin_flips=coin_flips)
calc_prob_coin_flip_data(P_heads_guess=0.7, coin_flips=coin_flips)

# Look at that!  We did all of that work in a split-second.

# In fact, we can make another for-loop, and search for the ML
# value of P_heads by trying all of the values and plotting them.

# Sequence of 50 possible values of P_heads between 0 and 1
P_heads_values_to_try = seq(from=0, to=1, length.out=50)
likelihoods = rep(NA, times=length(P_heads_values_to_try))

for (i in 1:length(P_heads_values_to_try))
    {
    # Get the current guess at P_heads_guess
    P_heads_guess = P_heads_values_to_try[i]

    # Calculate likelihood of the coin flip data under
    # this value of P_heads
    likelihood = calc_prob_coin_flip_data(P_heads_guess=P_heads_guess, coin_flips=coin_flips)

    # Store the likelihood value
    likelihoods[i] = likelihood
    } # End for-loop

# Here are the resulting likelihoods:
likelihoods

# Let's try plotting the likelihoods to see if there's a peak
plot(x=P_heads_values_to_try, y=likelihoods)
lines(x=P_heads_values_to_try, y=likelihoods)

# Whoa! That's quite a peak!  You can see that the likelihoods
# vary over several orders of magnitude.
#
# Partially because of this extreme variation, we often use the 
# log-likelihood (natural log, here) instead of the raw
# likelihood.
#
# (Other reasons: machines have a minimum precision, log-likelihoods
#  can be added instead of multiplied, AIC is calculated from 
#  log-likelihood, etc.)
#
#
log_likelihoods = log(likelihoods, base=exp(1))

plot(x=P_heads_values_to_try, y=log_likelihoods)
lines(x=P_heads_values_to_try, y=log_likelihoods)

# Let's plot these together
par(mfrow=c(2,1))
plot(x=P_heads_values_to_try, y=likelihoods, main="Likelihood (L) of the data")
lines(x=P_heads_values_to_try, y=likelihoods)

plot(x=P_heads_values_to_try, y=log_likelihoods, main="Log-likelihood (LnL) of the data")
lines(x=P_heads_values_to_try, y=log_likelihoods)

# Maximum likelihood optimization
# 
# You can see that the maximum likelihood of the data occurs when 
# P_heads is somewhere around 0.6 or 0.7.  What is it 
# exactly?
#
# We could just keep trying more values until we find whatever
# precision we desire.  But, R has a function for
# maximum likelihood optimization!
# 
# It's called optim().  Optim() takes a function as an input.
# Fortunately, we've already written a function!
#
# Let's modify our function a bit to return the log-likelihood,
# and print the result:

# Function that calculates the probability of coin flip data
# given a value of P_heads_guess
calc_prob_coin_flip_data2 <- function(P_heads_guess, coin_flips)
    {
    # Empty list of probabilities
    probs_list = rep(NA, times=length(coin_flips))
    probs_list

    for (i in 1:length(coin_flips))
        {
        # Print an update
        #cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")

        # Get the current coin flip
        coin_flip = coin_flips[i]

        # If the coin flip is heads, give that datum
        # probability P_heads_guess.
        # If tails, give it (1-P_heads_guess)

        if (coin_flip == "H")
            {
            probs_list[i] = P_heads_guess
            } # End if heads

        if (coin_flip == "T")        
            {
            probs_list[i] = (1-P_heads_guess)
            } # End if tails
        } # End for-loop

    # Look at the resulting probabilities
    probs_list

    # We get the probability of all the data by multiplying
    # all the probabilities
    likelihood_of_data_given_P_heads_guess = prod(probs_list)

    # Get the log-likelihood
    LnL = log(likelihood_of_data_given_P_heads_guess)
    LnL

    # Error correction: if -Inf, reset to a low value
    if (is.finite(LnL) == FALSE)
        {
        LnL = -1000
        }

    # Print some output
    print_txt = paste("\nWhen P_heads=", P_heads_guess, ", LnL=", LnL, sep="")
    cat(print_txt)

    # Return result
    return(LnL)
    }

# Try the function out:
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.1, coin_flips=coin_flips)
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.2, coin_flips=coin_flips)
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.3, coin_flips=coin_flips)

# Looks like it works!  Let's use optim() to search for he 
# best P_heads value:

# Set a starting value of P_heads
starting_value = 0.1

# Set the limits of the search
limit_bottom = 0
limit_top = 1

optim_result = optim(par=starting_value, fn=calc_prob_coin_flip_data2, coin_flips=coin_flips, method="L-BFGS-B", lower=limit_bottom, upper=limit_top, control=list(fnscale=-1))

# You can see the search print out as it proceeds.

# Let's see what ML search decided on:
optim_result

# Let's compare the LnL from ML search, with the binomial mean
optim_result$par

# Here's the formula:
P_heads_ML_estimate = numHeads / numTotal
P_heads_ML_estimate

# Wow!  Pretty good!  
#
# Where does that formula for P_heads_ML_estimate come from?  We will discuss
# this in lab on the board, but here is a summary.
#
# It turns out 
# that, in simple cases, we can use calculus to derive "analytical solutions"
# for maximizing the likelihood.  Basically, this involves:
#
# - taking the derivative of the likelihood function
#   (the derivative is the slope of a function)
# - setting the formula for the derivative to equal 0
#   (because, when the derivative=0, you are at a maximum or minimum)
# - solving for the parameter value that causes the derivative to 
#   equal 0.
#
# This same basic calculus-based procedure provides a justification for
# the formulas that are used in e.g. linear regression.
#
# The nice things about calculus-derived formulas for maximizing likelihood 
# -- known as "analytic solutions" -- include:
# 
# - it is very fast
# - the solution can be found without a computer (this was useful back in 
#   the 20th century, when statistics was being developed without computers)
# - there is no chance of optimization error (optimization errors can occur
#   if likelihood surfaces are too flat, too bumpy, etc.)
# 
# This raises the question:
# Why would anyone ever go through all the optimization rigamarole, when 
# they could just calculate P_heads directly?
#
# Well, only in simple cases do we have a formula for the maximum likelihood 
# estimate.  The optim() strategy works whether or not
# there is a simple formula.
#
# In real life science, with complex models of biological processes (such as 
# epidemiological models), ML optimization gets used A LOT, but most scientists
# don't learn it until graduate school, if then.
# 

# WHAT WE LEARNED
#
# You can now see that there is nothing magical, or even particularly
# difficult, about these concepts:
#
# - likelihood
# - log-likelihood
# - maximum likelihood (ML)
# - inferring a parameter through ML

# IMPLICATIONS
#
# It turns out that likelihood is one of the core concepts in model-
# based inference, and statistics generally.  Bayesian analyses 
# have likelihood as a core component (we will talk about this more
# later), and methods from introductory statistics (linear models etc.)

#######################################################
# PART 3: BUILDING BLOCKS OF MODELS IN R
#######################################################

# R has a number of functions that speed up the process 
# of building models, and simulating from them.

# For example, for binomial processes, these are:

# dbinom(x, size, prob, log) -- the probability density 
# (the likelihood) of x "heads"
# 
# size = the number of tries (the number of coin flips)
#
# prob = the parameter value (e.g. P_heads)
#
# log = should the likelihood be log-transformed when output?
# 

# The other functions are:
# 
# rbinom(n, size, prob) -- simulate n binomial outcomes,
# using "size" and "prob" as above
# 
# pbinom(q, ...) - used for calculating tail probabilities 
# (proportion of the distribution less than the value q)
#
# qbinom(p, ...) - reverse of pbinom()
#

# Any time you hear about "probability densities", "distributions"
# "random number generators", "random draws", or "simulations"
# in R, functions like these are being used.

# For uniform distributions, the equivalent functions are:
# 
# dunif, runif, punif, qunif

# For normal (=Gaussian/bell-curve) distributions, the equivalents are:
#
#
# dnorm, rnorm, pnorm, qnorm

# Random seeds, and random number generators

# Random numbers in computers are not truly random. To make random numbers
# repeatable, a "seed" number is always used.

# Compare:

rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)

set.seed(54321)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)

set.seed(54321)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)

set.seed(54321)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)

# By re-setting the seed, we re-set the random number generator

#######################################################
# LAB QUIZ: REFER TO THE QUIZ ON CANVAS
#######################################################
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License