Biosci220 Lab 4
``````#######################################################
# WEEK 4 LAB
#######################################################

#######################################################
#######################################################
#
# (Asimov and O'Neill) for 10 minutes.  Keep two lists:
#
# - a list of possible advantages of relying on models
# - a list of possible disadvantages of relying on models
#
# Then, spend 10 minutes reading these articles:
#
# Why outbreaks like coronavirus spread exponentially, and
# how to "flatten the curve" (free online)
# By Harry Stevens
# March 14, 2020
# https://www.washingtonpost.com/graphics/2020/world/corona-simulator/
#
# NextStrain Covid-19 webpage:
# https://nextstrain.org/ncov
#
# Now, discuss what you have heard about models in
# epidemiological models in the science/public
# policy/public communication arenas.
#
# Once you have some items, go up to the whiteboard and
# write (SHORT, i.e. 1-5 words) descriptions of the
# tick-mark.
#
# using the shared pen.
#
# We will then review this as a class.
#
#######################################################

#######################################################
# PART 2: INTRODUCTION TO ML INFERENCE
########################################################
#
# We will begin our exploration of model-based inference
# with a very simple dataset and model.
#
# Many biological problems, in many different fields,
# from ecology to epidemiology, are concerned with
# trying to estimate the prevalence or frequency of
# something.
#
# For example:
#
# - Out of 1000 ponds in a region, how many contain a
#   particular endangered frog species?
#
# - Out of a million people, how many are resistant
#   to measles (due to e.g. vaccination or previous
#   infection)?
#
# These examples (leaving out numerous real-world
# complexities) boil down to this: there are 2
# possible outcomes, and we are interested in how
# often each outcome occurs.
#
# Another process with 2 outcomes is coin-flipping.
#
# Coin-flipping, and any other 2-outcome process,
# can be described with a one-parameter model, using
# a binomial distribution ("bi-nomial" = "two names").
#
# (We will use coin-flipping as an example, but you can
# mentally substitute any other binomial process if
# you like.)
#
# The single parameter describing a binomial process
# is "p", the probability of one of the outcomes. The
# probability of the other outcome is 1-p.
#
# Because there are so many different probabilities and
# "p" terms used in probability
# and statistics, I will use "P_heads" - the probability
# that a coin will produce "Heads" on a single flip - to
# describe this parameter.
#
########################################################

# Let's consider some coin-flip data.
#
# Here are 100 coin flips:

coin_flips = c('H','T','H','T','H','H','T','H','H','H','T','H','H','T','T','T','T','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','T','T','T','H','T','T','T','H','T','T','T','H','H','H','T','T','H','H','H','T','H','H','H','T','T','H','H','H','H','H','H','H','T','T','H','H','H','H','T','T','H','H','H','T','T','H','H','H','H','H','H','T','T','T','H','H','H','H','H','H','T','H','T','H','H','T','T')

# Look at the data
coin_flips

#

# In the case of binomial data, we actually have a simple
# formula to calculate the best estimate of P_heads:

# Find the tails
tails_TF = (coin_flips == "T")
tails_TF

numTails = sum(tails_TF)
numTails

numTotal = length(coin_flips)
numTotal

# Here's the formula:

# Well, duh, that seems pretty obvious.  At least it would have been, if we
# weren't thinking of coins, where we have a strong prior belief that the
# coin is probably fair.

# It turns out that this formula can be justified through a technique
# known as Maximum Likelihood.
#
# What does it mean to say we have a "maximum likelihood" estimate of P_heads?
#
# "Likelihood", in statistics, means "the probability of the data under the model"
#

# Let's calculate the probability of the coin flip data under the
# hypothesis/model that P_heads is 0.5

# We'll be very inefficient, and use a for-loop, and
# if/else statements

# Loop through all 100 flips
# Make a list of the probability of
# each datum

# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list

for (i in 1:length(coin_flips))
{
# Print an update
cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")

# Get the current coin flip
coin_flip = coin_flips[i]

# If the coin flip is heads, give that datum
# If tails, give it (1-P_heads_guess)

if (coin_flip == "H")
{

if (coin_flip == "T")
{
} # End if tails
} # End for-loop

# Look at the resulting probabilities
probs_list

# We get the probability of all the data by multiplying
# all the probabilities, with the prod() function.

# That's a pretty small number!  You'll see that it's
# just 0.5^100:
0.5^100

# A probability of 0.5 is not small, but multiply it
# 100 values of 0.5 together, and you get a small value.
# That's the probability of that specific sequence of
# heads/tails, given the hypothesis that the true

# Let's try another parameter value:

# Loop through all 100 flips
# Make a list of the probability of
# each datum

# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list

for (i in 1:length(coin_flips))
{
# Print an update
cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")

# Get the current coin flip
coin_flip = coin_flips[i]

# If the coin flip is heads, give that datum
# If tails, give it (1-P_heads_guess)

if (coin_flip == "H")
{

if (coin_flip == "T")
{
} # End if tails
} # End for-loop

# Look at the resulting probabilities
probs_list

# We get the probability of all the data by multiplying
# all the probabilities

# We got a different likelihood. It's also very small.
# But that's not important. What's important is,
# how many times higher is it?

# Whoa!  That's a lot higher!  This means the coin flip data is 54 times more
# probable under the hypothesis that P_heads=0.7 than under the

# Maximum likelihood: You can see that the BEST explanation of the data
# would be the one with the value of P_heads that maximized the probability
# of the data.  This would be the Maximum Likelihood solution.

# We could keep copying and pasting code, but that seems annoying.  Let's make a function

# Function that calculates the probability of coin flip data
# given a value of P_heads_guess
{
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list

for (i in 1:length(coin_flips))
{
# Print an update
#cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")

# Get the current coin flip
coin_flip = coin_flips[i]

# If the coin flip is heads, give that datum
# If tails, give it (1-P_heads_guess)

if (coin_flip == "H")
{

if (coin_flip == "T")
{
} # End if tails
} # End for-loop

# Look at the resulting probabilities
probs_list

# We get the probability of all the data by multiplying
# all the probabilities

# Return result
}

# Now, we can just use this function:

# Look at that!  We did all of that work in a split-second.

# In fact, we can make another for-loop, and search for the ML
# value of P_heads by trying all of the values and plotting them.

# Sequence of 50 possible values of P_heads between 0 and 1

{
# Get the current guess at P_heads_guess

# Calculate likelihood of the coin flip data under

# Store the likelihood value
likelihoods[i] = likelihood
} # End for-loop

# Here are the resulting likelihoods:
likelihoods

# Let's try plotting the likelihoods to see if there's a peak

# Whoa! That's quite a peak!  You can see that the likelihoods
# vary over several orders of magnitude.
#
# Partially because of this extreme variation, we often use the
# log-likelihood (natural log, here) instead of the raw
# likelihood.
#
# (Other reasons: machines have a minimum precision, log-likelihoods
#  log-likelihood, etc.)
#
#
log_likelihoods = log(likelihoods, base=exp(1))

# Let's plot these together
par(mfrow=c(2,1))
plot(x=P_heads_values_to_try, y=likelihoods, main="Likelihood (L) of the data")

plot(x=P_heads_values_to_try, y=log_likelihoods, main="Log-likelihood (LnL) of the data")

# Maximum likelihood optimization
#
# You can see that the maximum likelihood of the data occurs when
# P_heads is somewhere around 0.6 or 0.7.  What is it
# exactly?
#
# We could just keep trying more values until we find whatever
# precision we desire.  But, R has a function for
# maximum likelihood optimization!
#
# It's called optim().  Optim() takes a function as an input.
# Fortunately, we've already written a function!
#
# Let's modify our function a bit to return the log-likelihood,
# and print the result:

# Function that calculates the probability of coin flip data
# given a value of P_heads_guess
{
# Empty list of probabilities
probs_list = rep(NA, times=length(coin_flips))
probs_list

for (i in 1:length(coin_flips))
{
# Print an update
#cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="")

# Get the current coin flip
coin_flip = coin_flips[i]

# If the coin flip is heads, give that datum
# If tails, give it (1-P_heads_guess)

if (coin_flip == "H")
{

if (coin_flip == "T")
{
} # End if tails
} # End for-loop

# Look at the resulting probabilities
probs_list

# We get the probability of all the data by multiplying
# all the probabilities

# Get the log-likelihood
LnL

# Error correction: if -Inf, reset to a low value
if (is.finite(LnL) == FALSE)
{
LnL = -1000
}

# Print some output
cat(print_txt)

# Return result
return(LnL)
}

# Try the function out:

# Looks like it works!  Let's use optim() to search for he

# Set a starting value of P_heads
starting_value = 0.1

# Set the limits of the search
limit_bottom = 0
limit_top = 1

optim_result = optim(par=starting_value, fn=calc_prob_coin_flip_data2, coin_flips=coin_flips, method="L-BFGS-B", lower=limit_bottom, upper=limit_top, control=list(fnscale=-1))

# You can see the search print out as it proceeds.

# Let's see what ML search decided on:
optim_result

# Let's compare the LnL from ML search, with the binomial mean
optim_result\$par

# Here's the formula:

# Wow!  Pretty good!
#
# Where does that formula for P_heads_ML_estimate come from?  We will discuss
# this in lab on the board, but here is a summary.
#
# It turns out
# that, in simple cases, we can use calculus to derive "analytical solutions"
# for maximizing the likelihood.  Basically, this involves:
#
# - taking the derivative of the likelihood function
#   (the derivative is the slope of a function)
# - setting the formula for the derivative to equal 0
#   (because, when the derivative=0, you are at a maximum or minimum)
# - solving for the parameter value that causes the derivative to
#   equal 0.
#
# This same basic calculus-based procedure provides a justification for
# the formulas that are used in e.g. linear regression.
#
# The nice things about calculus-derived formulas for maximizing likelihood
# -- known as "analytic solutions" -- include:
#
# - it is very fast
# - the solution can be found without a computer (this was useful back in
#   the 20th century, when statistics was being developed without computers)
# - there is no chance of optimization error (optimization errors can occur
#   if likelihood surfaces are too flat, too bumpy, etc.)
#
# This raises the question:
# Why would anyone ever go through all the optimization rigamarole, when
# they could just calculate P_heads directly?
#
# Well, only in simple cases do we have a formula for the maximum likelihood
# estimate.  The optim() strategy works whether or not
# there is a simple formula.
#
# In real life science, with complex models of biological processes (such as
# epidemiological models), ML optimization gets used A LOT, but most scientists
# don't learn it until graduate school, if then.
#

# WHAT WE LEARNED
#
# You can now see that there is nothing magical, or even particularly
#
# - likelihood
# - log-likelihood
# - maximum likelihood (ML)
# - inferring a parameter through ML

# IMPLICATIONS
#
# It turns out that likelihood is one of the core concepts in model-
# based inference, and statistics generally.  Bayesian analyses
# later), and methods from introductory statistics (linear models etc.)

#######################################################
# PART 3: BUILDING BLOCKS OF MODELS IN R
#######################################################

# R has a number of functions that speed up the process
# of building models, and simulating from them.

# For example, for binomial processes, these are:

# dbinom(x, size, prob, log) -- the probability density
# (the likelihood) of x "heads"
#
# size = the number of tries (the number of coin flips)
#
# prob = the parameter value (e.g. P_heads)
#
# log = should the likelihood be log-transformed when output?
#

# The other functions are:
#
# rbinom(n, size, prob) -- simulate n binomial outcomes,
# using "size" and "prob" as above
#
# pbinom(q, ...) - used for calculating tail probabilities
# (proportion of the distribution less than the value q)
#
# qbinom(p, ...) - reverse of pbinom()
#

# Any time you hear about "probability densities", "distributions"
# "random number generators", "random draws", or "simulations"
# in R, functions like these are being used.

# For uniform distributions, the equivalent functions are:
#
# dunif, runif, punif, qunif

# For normal (=Gaussian/bell-curve) distributions, the equivalents are:
#
#
# dnorm, rnorm, pnorm, qnorm

# Random seeds, and random number generators

# Random numbers in computers are not truly random. To make random numbers
# repeatable, a "seed" number is always used.

# Compare:

rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)

set.seed(54321)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)

set.seed(54321)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)

set.seed(54321)
rnorm(n=1, mean=6, sd=1)
rnorm(n=1, mean=6, sd=1)

# By re-setting the seed, we re-set the random number generator

#######################################################
# LAB QUIZ: REFER TO THE QUIZ ON CANVAS
#######################################################```
```