Basic Introduction to R

#######################################################
# CHAPTER 1: PROPAGANDA FOR R
# 
# R is a programming language designed primarily for 
# data analysis and statistics.
#
# The big advantages of R are:
#
# 1. It is free.
# 2. It is easy.
#
# Point #2 sometimes takes some convincing, especially
# if you haven't programmed before. But, trust me, R
# is WAY easier than ANY other programming language
# I have ever tried, which you could also do serious
# science with.
#
# MATLAB is probably the only other competitor for ease 
# of use and scientific ability, but Matlab costs 
# hundreds of dollars, and hundreds of dollars more for
# the various extensions (for e.g. statistics, image 
# analysis, etc.).  This works great when your institution
# has a site license for Matlab, but it suck when you 
# move to a new school/job.
#
# R is easy because most of the "computer science 
# details" -- how to represent numbers and other 
# objects in the computer as binary bits/bytes,
# how to manage memory, how to cast and type variables,
# blah blah blah, are done automatically behind the 
# scenes. 
# 
# This means almost anyone can get going with R in 
# minutes, by just typing in commands and not having
# to spend days learning the difference between a 
# short and long integer, blah blah blah.
#
# That said, the cost of this automation is that R
# is slower than other programming languages.  However, 
# this doesn't matter for common, basic sorts of 
# statistical analyses -- say, linear regression with 
# 1,000 data observations.  It DOES matter if you are 
# dealing with huge datasets -- say, large satellite
# images, or whole genomes.
#
# In these situations, you should use specialist 
# software, which is typically written in Python
# (for manipulating textual data, e.g. genome files)
# or Java, C, or C++ (for high-powered computing).
#
# (Although, in many situations, the slow parts of 
#  R can be re-programmed in C++, and accessed from
#  R.)
#
# R is also pretty bad for large, complex programming
# projects.  Python and C++ are "object-oriented."
# In computer-programming, "objects" help organize
# your data and tasks.  For example, if you are 
# writing a video game, you might want to program 
# many different monsters. However, you don't want to
# re-program the behavior of each monster from scratch.
# Instead, you create a general object, "monster", and 
# give it attributes (speed, armor, etc.).  The "monster"
# object takes inputs (like what enemies are close to 
# it) and produces outputs (motion or attacks in a 
# certain direction).
#
# Each specific type of monster would be an instance 
# of the monster class of objects.  Each individual
# monster of a specific type would be its own object,
# keeping track of hit points, etc.
#
# You can see that, for serious programming, this 
# object-oriented style would be the way to go. Therefore,
# "real" computer-science classes teach you this way
# of programming. This is great if you want to 
# go work in the video game industry and devote your
# life to coding.
#
# However, if you just want to plot some data and 
# run some statistical tests and do some science, 
# you don't want to have to go through a bunch of 
# rigamarole first. You just want to load the data
# and plot it and be done.  This is what R is for.
#
#######################################################

#######################################################
# CHAPTER 2: GETTING R
#######################################################
# 
# R is free and available for all platforms. You can 
# download it here.:
# 
# http://www.r-project.org/
# 
# Tip for free, scientific software: 
# 
# Unless you are doing something expert, you will want 
# the "binary" file rather than the source code.
#
# Programmers write source code in text files.
# 
# A compiler program turns this into a "binary" which 
# actually executes (runs) on a computer.
# 
# Compiling from source code can take minutes or hours, 
# and sometimes will crash if your computer & compiler
# are not set up right.
# 
# A binary should just work, once you have installed it,
# assuming you've got the binary for your machine.
#
# ASSIGNMENT: Once you have R installed (it appear in 
# "Applications" on a Mac, or "Program Files" on a 
# Windows machine), open it to make sure it works. 
# Then, return to this tutorial.
# 
########################################################

#######################################################
# CHAPTER 3: GET A *PLAIN*-TEXT EDITOR
#######################################################
#
# Many people make the mistake of typing commands 
# into R, but not saving those commands.
# 
# *ALWAYS* SAVE YOUR COMMANDS IN A TEXT FILE!!
# *ALWAYS* SAVE YOUR COMMANDS IN A TEXT FILE!!
# *ALWAYS* SAVE YOUR COMMANDS IN A TEXT FILE!!
#
# Got it?  Good.
#
# The next mistake people make is to use Word or 
# some other monstrosity to save their commands. 
# You can do this if you want, but the formatting 
# etc. just gets in the way.  
#
# Find or download a PLAIN-TEXT editor (aka ASCII
# text editor). Common examples:
#
# Mac: TextWrangler (free) or BBedit 
#
# Windows: Notepad (free, search Programs) or Notetab
# 
# Or: versions of R that have a GUI (GUI=Graphical User
# Interface) also have a built-in editor.
# 
#
# WHY SAVE YOUR COMMANDS?
# 
# Because you can come back in 6 months and run the 
# same analysis again, just by pasting the commands
# back into R.
#
# Trust me, this is MUCH better than trying to remember
# what buttons to click in some software. 
# 
# And, anytime
# you need to do something more than a few times, 
# it gets super-annoying to click all of the buttons
# again and again. 
# 
# This is why most serious scientific software is 
# command-line, rather than menu-driven.
# 
#
# HOW TO TAKE NOTES IN R SCRIPTS
# 
# Put a "#" symbol in front of your comments. Like I 
# did here. COMMENTS ARE GOOD! COMMENT EVERYTHING!
# 
#
# ASSIGNMENT: Once you've found a plain-text editor, 
# return to this tutorial.
#######################################################

#######################################################
# CHAPTER 4: R BASICS
#######################################################
# 
# There are two major hurdles in learning R:
# 
# 1. Getting/setting your working directory.
#
# 2. Loading your data
#
# 3. Learning the commands to do what you want.
#
# Points #1 and #2 are easy to learn -- just don't
# forget!  You can never get anything significant
# done in R if you can't get your data loaded.
# 
# Point #3 -- No one knows "all" of R's commands. As
# we see, every package and function creates 
# additional commands.
# 
# Your goal is just to learn the basics, and then learn
# how to find the commands you need.
# 
# ASSIGNMENT: Type/paste in each of the commands below
# into your text file, then into R. Take notes as 
# you go.
#######################################################

#######################################################
# Working directories:
#
# One of the first things you want to do, usually, is 
# decide on your working directory.
# 
# You should create a new directory using:
#
# Mac: Finder
# Windows: Windows Explorer (or File Manager or
# whatever it's called these days)
#
# ROOLZ FOR FILES AND DIRECTORIES IN R
# 
# 1. Put the directory somewhere you will find it 
#    again.
#
# 2. Never use spaces in filenames.
# 
# 3. Never use spaces in directory names.
# 
# 4. Never use spaces in anything involving 
#    files/directories.
#
# 5. Never!  It just causes problems later. The 
#    problems are fixable, but it's easier to 
#    just never use spaces.  
# 
# 6. Use underscore ("_") instead of spaces.
# 
#
# FINDING MAC/WINDOWS DIRECTORIES IN R
#
# Usually, you can drag/drop the file or directory
# into R to see the full path to the file.
# Copy this into the 'here' in wd="here", below.
#
# 
# CHANGE FILE SETTINGS IN MACS/WINDOWS
# 
# Modern Macs/Windows hide a lot of information 
# from you.  This makes life easier for John Q. Public,
# but makes it harder for scientists.  
# 
# Good preferences for your file viewer:
#
# * Turn ON viewing file extensions (.txt, .docx, etc.)
# * Turn ON viewing of hidden files 
# * Change file viewing to "list" format
#
# See Preferences in Mac Finder or Windows Explorer.
# 
########################################################

# On my Mac, this is a working directory I have chosen
# (change it to be yours)
wd = "/Users/nickm/Desktop/Rintro/"
wd = "~/Desktop/Rintro/"
wd="/drives/GDrive/REU_example"

# On a PC, you might have to specify paths like this:
#wd = "c:\\Users\\nick\\Desktop\\_ib200a\\ib200b_sp2011\\lab03"

# setwd: set working directory
setwd(wd)

# getwd: get working directory
getwd()

# list.file: list the files in the directory
list.files()

#######################################################
# PLAYING WITH R
#######################################################

# (Preliminary: this might be useful; uncomment if so)
# options(stringsAsFactors = FALSE)

# concatentate to a list with c()
student_names = c("Nick", "Hillary", "Sonal")

# describe what the function "c" does:
#
grade1 = c(37, 100, 60)
grade2 = c(43, 80, 70)
grade3 = c(100, 90, 100)
grade1
grade2
grade3
print(grade3)

# column bind (cbind)
temp_table = cbind(student_names, grade1, grade2, grade3)
class(temp_table)

# convert to data frame
grade_data = as.data.frame(temp_table)
class(grade_data)

# Don't convert to factors
grade_data = as.data.frame(temp_table, stringsAsFactors=FALSE)

# add column headings
col_headers = c("names", "test1", "test2", "test3")
names(grade_data) = col_headers
print(grade_data)

# change the column names back
old_names = c("student_names", "grade1", "grade2", "grade3")
names(grade_data) = old_names

grade_data$grade1

# Let's calculate some means
# mean of one column
mean(grade_data$grade1)

# R can be very annoying in certain situations, e.g. treating numbers as character data
# What does as.numeric do?
#
as.numeric(grade_data$grade1)
grade_data$grade1 = as.numeric(as.character(grade_data$grade1))
grade_data$grade2 = as.numeric(as.character(grade_data$grade2))
grade_data$grade3 = as.numeric(as.character(grade_data$grade3))
print(grade_data)

# mean of one column
mean(grade_data$grade1)

# apply the mean function over the rows, for just the numbers columns (2, 3, and 4)
apply(X=grade_data[ , 2:4], MARGIN=2, FUN=mean)

# Why doesn't this work?
mean(grade_data)
# What caused the warning message in mean(grade_data)?

# How about this?
colMeans(grade_data[,2:4])

# How about this?
colMeans(grade_data[,2:4])

# More functions
sum(grade_data$grade1)
median(grade_data$grade1)

# standard deviation
apply(X=grade_data[ , 2:4], MARGIN=1, FUN=sd)

# store st. dev and multiply by 2
mean_values = apply(grade_data[ , 2:4], 1, mean)
sd_values = apply(grade_data[ , 2:4], 1, sd)
2 * sd_values

# print to screen even within a function:
print(sd_values)

# row bind (rbind)
grade_data2 = rbind(grade_data, c("means", mean_values), c("stds", sd_values))

#######################################################
# GETTING DATA
#######################################################
#
# Let's download some data.  Francis Galton was one 
# of the founders of statistics.  He was also 
# the cousin of Charles Darwin. Galton invented the 
# term "regression".  These days, "regression" means
# fitting the best-fit line to a series of x and y
# data points.
# 
# But, why is the weird term "regression" used for this?
# What is regressing?
#
# Let's look at Galton's original dataset: the heights
# of parents and children.
#
# Use your web browser to navigate here:
#
# http://www.randomservices.org/random/data/Galton.html
# 
# ...and save "Galton's height data" as Galton.txt 
# (right-click, save) to your
# working directory.
#
# After doing this, double-click on Galton.txt and 
# view the file, just to see what's in there.
#
#######################################################

# Before proceeding, double-check that your data file
# is in the working directory:
getwd()
list.files()

# Let's store the filename in a variable
# 
# Note: In Nick's head:
# 
# "wd" means "working directory"
# "fn" means "filename"
# 
#wd = "/drives/Dropbox/_njm/__packages/Rintro/"
#setwd(wd)
fn = "Galton.txt"

# Now, read the file into a data.frame
heights = read.table(file=fn, header=TRUE, sep="\t")

# Now, look at "heights"
heights

# Whoops, that went by fast!  Let's just look at the 
# top of the data table
head(heights)

# Let's get other information on the data.table

# Column names
names(heights)

# Dimensions (rows, columns)
dim(heights)

# Class (data.frame, matrix, character, numeric, list, etc.)
class(heights)

# The heights data is the adult height of a child (in inches),
# and the "midparent" height -- the mean of the two parents.

# QUESTION: Do the means of parent and child height differ?

# Means
colMeans(heights)

colMeans(heights[,-4])

# Standard deviations
apply(X=heights[,-4], MARGIN=2, FUN=sd)

# Min & Max
apply(X=heights[,-4], MARGIN=2, FUN=min)
apply(X=heights[,-4], MARGIN=2, FUN=max)

# They seem pretty close, but let's do a test

# Make sure numbers columns are numeric
heights$Family = as.numeric(heights$Family)
heights$Father = as.numeric(heights$Father)
heights$Height = as.numeric(heights$Height)
heights$Kids = as.numeric(heights$Kids)

# Let's add the Midparent column
heights[,c("Father","Mother")]

# Take the mean of Father and Mother columns, store in column "Midparent"
heights$Midparent = apply(X=heights[,c("Father","Mother")], MARGIN=1, FUN=mean)

# View the new column
head(heights)

# Population Mean Between Two Independent Samples
# http://www.r-tutor.com/elementary-statistics/inference-about-two-populations/population-mean-between-two-independent-samples

# (change "Child" to "Height")
ttest_result1 = t.test(x=heights$Midparent, y=heights$Height, paired=FALSE, alternative="two.sided")
ttest_result1

# But wait, this test assumes that the samples from each population 
# are independent. Do you think parent heights and child heights are 
# independent?

# Probably not.  Actually, these samples are paired, so let's
# check that:

# Population Mean Between Two Matched Samples
# http://www.r-tutor.com/elementary-statistics/inference-about-two-populations/population-mean-between-two-matched-samples

ttest_result2 = t.test(x=heights$Midparent, y=heights$Height, paired=TRUE, alternative="two.sided")
ttest_result2

# Compare the two:
ttest_result1
ttest_result2

# Interestingly, it looks like parents are slightly taller than the children!
# 
# Is this statistically significant?
#
# But is it a large effect?  Is it *practically* significant?
#

# Let's plot the histograms
hist(heights$Midparent)
hist(heights$Height)

# That's a little hard to compare, due to the different 
# automated scaling of the x-axis.

# Let's fix the x-axis to be (5 feet, 7 feet)
xlims = c(5*12, 7*12)
hist(heights$Midparent, xlim=xlims)
hist(heights$Height, xlim=xlims)

# And fix the y-axis
# Let's fix the y-axis to be (0, 220)
ylims = c(0, 220)
hist(heights$Midparent, xlim=xlims, ylim=ylims)
hist(heights$Height, xlim=xlims, ylim=ylims)

# Let's plot the means and 95% confidence intervals on top

# Midparent values
hist(heights$Midparent, xlim=xlims, ylim=ylims)

# Plot the mean
abline(v=mean(heights$Midparent), lty="dashed", lwd=2, col="blue")

# Plot the 95% confidence interval (2.5% - 97.5%)
CI_025 = mean(heights$Midparent) - 1.96*sd(heights$Midparent)
CI_975 = mean(heights$Midparent) + 1.96*sd(heights$Midparent)
abline(v=CI_025, lty="dotted", lwd=2, col="blue")
abline(v=CI_975, lty="dotted", lwd=2, col="blue")

# Child values
hist(heights$Height, xlim=xlims, ylim=ylims)

# Plot the mean
abline(v=mean(heights$Height), lty="dashed", lwd=2, col="blue")

# Plot the 95% confidence interval (2.5% - 97.5%)
CI_025 = mean(heights$Height) - 1.96*sd(heights$Height)
CI_975 = mean(heights$Height) + 1.96*sd(heights$Height)
abline(v=CI_025, lty="dotted", lwd=2, col="blue")
abline(v=CI_975, lty="dotted", lwd=2, col="blue")

# Let's put it all in a nice PDF format to save it

# Open a PDF for writing
pdffn = "Galton_height_histograms_v1.pdf"
pdf(file=pdffn, width=8, height=10)

# Do 2 subplots
par(mfrow=c(2,1))

# Midparent values
hist(heights$Midparent, xlim=xlims, ylim=ylims, xlab="height (inches)", ylab="Count", main="Midparent heights")

# Plot the mean
abline(v=mean(heights$Midparent), lty="dashed", lwd=2, col="blue")

# Plot the 95% confidence interval (2.5% - 97.5%)
CI_025 = mean(heights$Midparent) - 1.96*sd(heights$Midparent)
CI_975 = mean(heights$Midparent) + 1.96*sd(heights$Midparent)
abline(v=CI_025, lty="dotted", lwd=2, col="blue")
abline(v=CI_975, lty="dotted", lwd=2, col="blue")

# Child values
hist(heights$Height, xlim=xlims, ylim=ylims, xlab="height (inches)", ylab="Count", main="Child heights")

# Plot the mean
abline(v=mean(heights$Height), lty="dashed", lwd=2, col="blue")

# Plot the 95% confidence interval (2.5% - 97.5%)
CI_025 = mean(heights$Height) - 1.96*sd(heights$Height)
CI_975 = mean(heights$Height) + 1.96*sd(heights$Height)
abline(v=CI_025, lty="dotted", lwd=2, col="blue")
abline(v=CI_975, lty="dotted", lwd=2, col="blue")

# Close the PDF writing
dev.off()

# Write a system command as a text string
cmdstr = paste("open ", pdffn, sep="")
cmdstr

# Send the command to the computer system's Terminal/Command Line
system(cmdstr)
# The PDF should hopefully pop up, e.g. if you have the free Adobe Reader

# The difference in means is very small, even though it appears to be 
# statistically significant. 
#
# This is a VERY IMPORTANT lesson: 
#
# "statistically significant" DOES NOT ALWAYS MEAN "practically "significant",
# "interesting", "scientifically relevant", etc.
# 
# 
# The difference may have to do with:
# 
# * Galton's 'method' of dealing with the fact that 
# male and female children have different average heights -- 
# he multiplied the female heights by 1.08!
#
# * Different nutrition between the generations
#
# * Maybe the adult children weren't quite all fully grown
#
# * Chance rejection of the null
#
# Who knows?

# You may have noticed that the standard deviations look to be 
# a lot different.  Can we test for this?

# Yes! The null hypothesis is that the ratio of the 
# variances is 1:
Ftest_result = var.test(x=heights$Midparent, y=heights$Height, ratio=1, alternative="two.sided")
Ftest_result

# We get extremely significant rejection of the null.  What is 
# the likely cause of the lower variance in the midparent data?

#
# For the complex story of Galton's original data, see:
# 
# http://www.medicine.mcgill.ca/epidemiology/hanley/galton/
#
# James A. Hanley (2004). 'Transmuting' women into men: 
# Galton's family data on human stature. The American Statistician, 58(3) 237-243. 
# http://www.medicine.mcgill.ca/epidemiology/hanley/reprints/hanley_article_galton_data.pdf
# 
# BTW, Galton was both a genius, and promoted some deeply flawed ideas
# like eugenics:
# http://isteve.blogspot.com/2013/01/regression-toward-mean-and-francis.html
# 

# We noted before that child and parent heights might not be 
# independent.  Let's test this!

# QUESTION: is there a relationship?

# Start by plotting the data:
plot(x=heights$Midparent, y=heights$Height)

# It looks like there is a positive relationship: 
# taller parents have taller children.

# However, it's a little bit hard to tell for 
# sure, because Galton's data is only measured
# to the half-inch, so many dots are plotting
# on top of each other.  We can fix this by
# "jittering" the data:

# Plot the data, with a little jitter
plot(x=jitter(heights$Midparent), y=jitter(heights$Height))

# It looks like there's a positive relationship, which makes
# sense.  Can we confirm this with a statistical test?

# Let's build a linear model (lm)
lm_result = lm(formula=Height~Midparent, data=heights)
lm_result

# This just has the coefficients, this doesn't tell us much
# What's in the linear model? A list of items:
names(lm_result)

# See the statistical results
summary(lm_result)

# Analysis of variance (ANOVA)
anova(lm_result)

# You can get some standard diagnostic regression plots with:
plot(lm_result)

# Let's plot the regression line on top of the points
intercept_value = lm_result$coefficients["(Intercept)"]
slope_value = lm_result$coefficients["Midparent"]

# Plot the points
plot(x=jitter(heights$Midparent), y=jitter(heights$Height))

# Add the line
abline(a=intercept_value, b=slope_value, col="blue", lwd=2, lty="dashed")

# It's a little hard to tell if the slope is 1:1 or not,
# Because the x-axis and y-axis aren't the same
# Let's fix this

# Plot the points
xlims = c(5*12, 6.5*12)
ylims = c(5*12, 6.5*12)
plot(x=jitter(heights$Midparent, factor=3), y=jitter(heights$Height, factor=3), xlab="Midparent height", ylab="Child height", xlim=xlims, ylim=ylims)
title("Galton's height data")

# Add the regression line
abline(a=intercept_value, b=slope_value, col="blue", lwd=2, lty="dashed")

# Add the 1:1 line
abline(a=0, b=1, col="darkgreen", lwd=2, lty="dashed")

# Is the slope statistically different from 1:1?

# We can test this by subtracting a 1:1 relationship from the data, and seeing if 
# the result has a slope different from 0
child_minus_1to1 = heights$Height - (1/1*heights$Midparent)
heights2 = heights
heights2 = cbind(heights2, child_minus_1to1)

# Let's build a linear model (lm)
lm_result2 = lm(formula=child_minus_1to1~Midparent, data=heights2)
lm_result2

# This just has the coefficients, this doesn't tell us much
# What's in the linear model? A list of items:
names(lm_result2)

# See the statistical results
summary(lm_result2)

# Analysis of variance (ANOVA)
anova(lm_result2)

# You can get some standard diagnostic regression plots with:
plot(lm_result2)

# Let's plot the regression line on top of the points
intercept_value = lm_result2$coefficients["(Intercept)"]
slope_value = lm_result2$coefficients["Midparent"]

# Plot the points
plot(x=jitter(heights2$Midparent), y=jitter(heights2$child_minus_1to1), xlim=xlims, xlab="Midparent heights", ylab="Child heights minus 1:1 line", main="Relationship after subtracting 1:1 line")

# Add the regression line
abline(a=intercept_value, b=slope_value, col="blue", lwd=2, lty="dashed")

# Add the expected line if the relationship was 1:1
abline(a=0, b=0, col="darkgreen", lwd=2, lty="dashed")

# Yep, the relationship is definitely different than 1:1

# Why is the relationship between parent height and offspring
# height LESS THAN 1:1???
#
# Why do tall parents tend to produce offspring shorter
# than themselves?   Why does height seem to "regress"?
# What about the children of short parents?  Do they 
# 'regress'?
# 
# What are possible statistical consequences/hazards of this?
#
# Why is all of this rarely explained when regression 
# is taught?
#
PhyloWiki

Assisting research and education in phylogenetics and evolution

Page tags

Add a new page