Explore Genome Size Data

This R script is part of a lab exercise for BioSci210: Evolution, at the University of Auckland.

It is designed to be run simply by copying-and-pasting from the script, into R. The necessary files are hosted online, so you will have to be connected to the internet for this to work.

Copy and paste the script in chunks, reading the comments as you go, and looking (and taking the time to understand) each plot that that is being produced. Be careful not to skip sections, this may cause errors, forcing you to go back and re-do the pasting again.

Note for running offline: If you want to run the script offline, you can save each file to a working directory on your local hard drive, and then change the R script to look at this working directory (wd="working directory") using the setwd() and getwd() functions, and editing the trfn and datafn lines to refer to the local copies of the files. You can also download the R functions from gist, and source() the local versions of those files.)

Main Goals:

The main goals of this exercise include:

  • Observe the diversity of genome sizes
  • Understand scatter plots and regressions
  • Understand log-scales for the x- and y-axis of scatter plots
  • Understand where human genome size fits in the diversity of genome size
  • Understand the difference between simple linear regression, and Phylogenetic Independent Contrasts
  • Understand that correlations that may appear significant when phylogeny is ignored, may disappear once phylogeny is taken into account
  • Observe some phylogenetically-structured patterns in genome size, and speculate about possible causes
  • Ask yourself if the genome size of humans appears "special" or "optimal" in any obvious way

Side Goals

  • This lab is not intended to train you in R, but it will give you some exposure to it. If you are interested in starting to learning R (which is a great job skill for biologists and any other scientists), I would recommend (1) working through this introductory tutorial: http://phylo.wikidot.com/2014-summer-research-experiences-sre-at-nimbios-for-undergra ; (2) working through the genome-size script slowly, trying to understand the major actions being done; (3) take a workshop or a biostatistics/bioinformatics course with a heavy R focus.
  • Similarly, this lab is not intended to train you in statistics, but it will give you some additional exposure.

Genome size analysis script (should run "out of the box")

Download the R script, "_lab4_exploring_genome_size_data_v4.R", from this link, or from in Files, at the bottom of this webpage.

The script is also pasted below, so you can copy-paste it from there if you prefer.

#######################################################
# BIOSCI210, Lab 4: Exploring Genome Size Variation
#
# In this lab, we will explore genome size, and how it
# varies across vertebrates.
#
# Rather than give you a large written introduction to
# the material, please be sure to watch these lectures before 
# finishing the lab:
#
# Lecture 25: Quantitative methods in macroevolution
# Lecture 28: Nonadaptive evolution
#
# The purpose of this lab is to *explore* the data,
# basically by running the given code, looking at
# the plots, and *thinking* about the plots
# and what they tell you.
#
# The purpose is not to teach you how to code in R,
# although it will give you a little more experience
# in R (perhaps building on other courses). All code should
# run "out of the box" by copy-paste into the R
# window (or highlighting code in RStudio and
# hitting CTRL-RETURN), as long as:
# 
# * you have R packages "ape" and "phytools" installed.
#
# * you run the commands in order -- if you accidentally
#  skip some commands, later commands may give errors.
#
# Also: You might find this script useful as a starting 
# point, if you ever have to do a phylogenetically-aware analysis of
# multispecies data in future work.
#
# Answer the questions in the Canvas quiz. They are 
# marked in this script with "Q:". Choose the best 
# answer!
#######################################################

#######################################################
# Genome Size dataset
# 
# Our genome size data are taken (with permission) from:
# 
# Gregory, T. Ryan (2020). Animal Genome Size Database. http://www.genomesize.com .
# 
# C-value is the haploid genome size in picograms. 1 picogram of DNA is roughly
# 1 billion bases, i.e. 1 Gigabase or Gb. The human haploid genome size is
# about 3.3 Gb.
#
#
# (Actually 1 pg DNA is about 0.978 Gb.)
#######################################################

#######################################################
# 
# LAB 4 SETUP INSTRUCTIONS
# 
# For lab 4, you will need access to these:
#
# (Note: all of these are available in the computer lab):
#
# 1. Web browser, and access to CANVAS (for quizzes,
#    lectures, online disucssions etc.).
#
# 2. Ability to copy/paste text. This is possible even 
#    on a smartphone, but probably is much easier on
#    a desktop/laptop/tablet.
#
# 3. You need to have access to R, and most students 
#    like to use RStudio. These are installed on the
#    lab computers. These programs are free, so you can 
#    also install these programs on your own computer, 
#    or use them through a browser with the 
#    "RStudio Cloud" website.
#
# Download R:
# http://www.r-project.org/
#
# Download RStudio:
# https://rstudio.com/products/rstudio/
#
# RStudio Cloud (running RStudio through a website)
# https://rstudio.cloud/
# (you will want to register a free account)
#
# 4. You will need 2 R packages installed. These
#    come installed on the lab computers, but otherwise
#    you may have to install them. Un-comment these
#    lines and run these commands in R:
# 
# install.packages("ape")
# install.packages("phytools")
# library(ape)
# library(phytools)
#
# 5. The TAs and I will take questions on Piazza all 
#    week. I also plan to do live Zoom help sessions,
#    so if you want to access these, please install 
#    Zoom. 
#
#######################################################

#######################################################
# NOTE: Phylogenetics, and R, are two of the areas 
# for which New Zealand and the University of 
# Auckland are well-known in science. For example,
# the original publication of R, Ihaka & 
# Gentleman (1996), now has over 10,000 citations
# in the scientific literature. The authors, Ross
# Ihaka and Robert Gentleman, are from the 
# Department of Statistics at the 
# University of Auckland.
#
# Ihaka, Ross; Gentleman, Robert (1996). "R: A Language 
# for Data Analysis and Graphics." Journal of 
# Computational and Graphical Statistics. Volume 5,
# Issue 3, pages 299-314.
# https://amstat.tandfonline.com/doi/abs/10.1080/10618600.1996.10474713
#
# More on Ross Ihaka:
#
# Middleton, Juliet (2009). "Academic unfazed by rock star 
# status." New Zealand Herald, Jan. 10, 2009.
# https://www.nzherald.co.nz/nz/academic-unfazed-by-rock-star-status/LMU5EIMP7QCMZFCBM3UNW4C7CI/
# 
# Ihaka Lecture Series at the University of Auckland
# https://www.auckland.ac.nz/en/science/about-the-faculty/department-of-statistics/ihaka-lecture-series.html
# 
#######################################################

#######################################################
# Background on exercises using R
#
# Real-life computational biologists tend to be 
# pluralists -- they will use whatever combination of
# programs, websites, and programming languages 
# "works" for their current tasks.
#
# Due to the Covid situation, we may not have access
# to the computer lab.  In some ways, this is an 
# advantage!  By being forced to install some programs
# on your own computers (with a variety of different
# operating systems, etc.), and figuring out how to 
# make them work on your computer, you are doing
# something much closer to the day-to-day work
# of a computational biologist.
#
# The main thing you need to make this work is: 
# patience.  When you get stuck, 
# 
# 1. First, take a deep breath, and see if you can 
#    identify the problem.
#
# 2. Second, google the problem (e.g., google the 
#    error message). If it's an installation 
#    problem, google your question along with the 
#    name of your operating system. (E.g., "problem 
#    installing AliView on Windows" is better than
#    just "problem installing AliView)
# 
# 3. If you can't figure out the issue after a few
#    minutes, POST YOUR QUESTION TO PIAZZA.
#
#    a. Note: Your questions will get better answers
#       if you include sufficient information in your
#       question. "It's not working!" does not give
#       us any ability to help. We instructors can't 
#       see what you are seeing, unless you show us!
#
#    b. Therefore, include screenshots of error messages,
#       or (better) copy-paste the commands you used, AND
#       the compete error messages that resulted. This 
#       way, other people can search and find the 
#       questions/answers for their own questions.
# 
# 4. If you find a solution, please post that! This 
#    will also help others. 
#
#######################################################

########################################################
# Advice on navigating the file system
#
# Every operating system (OS) has a different file 
# viewer. This is the best place to see where your
# folders and files are, and to arrange files/folders
# so that you can find them.
#
# On Windows, the file viewer is Windows Explorer, and
# can usually be opened with WindowsKey+E
#
# On Macs, the file viewer is Finder, and can be opened
# by clicking the little blue faces icon: 
# https://www.lifewire.com/finders-icon-view-options-2260725
#
# On Linux -- if you have Linux, you probably know 
# all about file viewing already.
#
########################################################

########################################################
# Advice on filenames / directory names
#
# Short version: STICK TO PLAIN-TEXT, NO SPACES
#
# While modern operating systems often allow filenames
# and file paths (directory names) to have spaces in 
# them, as well as apostrophes, inverted commas, 
# regular commas, etc., you should avoid these.
#
# This is because R, and many other pieces of scientific
# software, react badly to filenames with spaces, 
# special characters, etc.  This creates endless 
# difficulties.
# 
# Therefore, keep it simple:
#
# BAD PATH AND FILENAME:
# /Nick's cool documents/Nick's lab 3.txt
#
# GOOD PATH AND FILENAME
# /Documents/Nicks_lab3.txt
# 
########################################################

########################################################
# Advice on plain-text editors:
########################################################
#
# In computational biology, data, inputs, code, etc., are typically
# contained in PLAIN-TEXT (ASCII) files. These files have *no* fonts,
# text-sizes, invisible formatting, weird character codes, etc.
# 
# Plain-text files are best viewed in an ASCII text viewer, so that you can
# see the difference between spaces and tabs, you can see blank lines at
# the end of some files, etc.
# 
# Note also that THE DIFFERENCE BETWEEN TABS AND SPACES IS IMPORTANT IN
# THESE TEXT FILES. COMMON PROBLEMS INCLUDE:
# 
# Tabs / spaces at the end of a line, after any text. Delete these in the
# text-editor.
# A line in the text file that looks blank, but is actually 8 tabs (or
# whatever). This can be produced by e.g. copying/pasting a series of
# distance matrices from Excel to text. When you do this, just check the
# blank lines for tabs, and delete these tabs in the text-editor so the
# line is actually blank.
#
# Good programs:
# 
# TextWranger (mac), BBedit (mac), Notetab (windows), Rstudio or R.app
# editors, any code editor
# 
# Bad (bad Bad BAD!!) programs:
# Word/Office (mac/windows), WordPad (windows), TextEdit (Mac)
# 
# Marginal:
# Notepad (windows)
# 
########################################################

#######################################################
#######################################################
# R script for analyzing genome size data for vertebrates.
# Should run "out of the box", if you are online and
# have the R packages "ape" and "phytools" installed.
#######################################################
#######################################################
# ABBREVIATIONS
#    NOTES ON READING NICK'S CODE:
#       "wd" means "working directory"
#       "fn" means "filename"
#       "trfn" means "tree filename"
#       "data_fn" means "data filename"
#       "TF" stores a series of TRUE/FALSE results
#       "pch" means "point character", e.g. for plotting points
#       "cex" means "character expansion", e.g. char. size
# 
#       Scientific notation in R:
#       3.65e2 means 3.65 x 10^2 = 365
#       3.65e-2 means 3.65 x 10^-2 = 0.0365

#######################################################

# Get the working directory (set it to a local directory, if you desire)
wd = getwd()

# Specifying the working directory on Nick's machine
# wd = "/drives/GDrive/__classes/BIOSCI210/lab4_genome_size/"

# See the working directory
wd

# Tell R what the working directory is, with the setwd command
setwd(wd)

# Nick's machine - make PDFs of all plots
# pdffn = "Lab4_plots_v1.pdf"
# pdf(file=pdffn, width=6, height=6)

# Double-check the working directory
getwd()

###############################################################
# Setup
###############################################################
# Load the APE and BioGeoBEARS packages
# (APE = Analysis of Phylogenetics and Evolution)
# If the library command doesn't work, first type 
# install.packages("ape")
# install.packages("phytools")
library(ape)
library(phytools)

# Get a linear regression function from Nick Matzke's "gist" archive
source("https://gist.githubusercontent.com/nmatzke/b21fcccb09b42959897d/raw/5f461b8eb1032fa7aad2e0ae5c13e9f9a1783cdb/_genericR_v1.R")

# Get a tree-table function from Nick Matzke's "gist" archive
source("https://gist.githubusercontent.com/nmatzke/a40dc1291115072889c81e416953edad/raw/bc3f657bc80aafa885bd39c3fdcd41b5d52d9816/prt_alone_v1.R")

###############################################################
# Files -- these are online links; if you like, 
# download to your working directory (you will then have to 
# change trfn and data_fn to refer to that local directory)
###############################################################
trfn = "http://phylo.wikidot.com/local--files/explore-genome-size-data/birds_mammals_subset_tree.newick"
data_fn = "http://phylo.wikidot.com/local--files/explore-genome-size-data/birds_mammals_subset_Cvalue_bodysize.txt"

###############################################################
# Read in the tree file
###############################################################
# Read a Newick-formatted phylogeny file (which has been subset to birds and mammals 
# found in the C-value data file) to an APE tree object 
tr = read.tree(trfn)

# Look at the first few tip labels.
tr$tip.label[1:10]

# Plot the phylogeny (skipping the tip names, as we have so many)
plot(tr, show.tip.label=FALSE)
axisPhylo()
mtext(text="Millions of years ago (Ma)", side=1, line=2.5)
title("Phylogeny of (selected) birds and mammals")

# Put a blue asterisk where humans are:
# Get the tip number for humans:
speciesname = "Homo_sapiens"
tipnum_of_Homo_sapiens = (1:length(tr$tip.label))[tr$tip.label == speciesname]
# Get the height of the tree
tr_height = get_max_height_tree(tr)
# Plot the blue asterisk
points(x=tr_height+4, y=tipnum_of_Homo_sapiens, pch="*", col="blue", cex=1.5)

###############################################################
# Read in the data file
###############################################################
# Read a table-delimited data file (bird and mammal C-values and body sizes) 
# to an R data.frame named "datadf"
datadf = read.table(file=data_fn, header=TRUE, sep="\t", stringsAsFactors=FALSE)

# Order the table to match the phylogeny tip labels
datadf = datadf[match(tr$tip.label, table=datadf$name),]

# Look at the first few species in the table
datadf$name[1:10]
head(datadf)

# Q1: Approximately what age is the common ancestor of birds and mammals, according
# to this phylogeny?

# Q2: How many species are in this phylogeny? (Hint: each species is 
# a "tip" in the tree, you can get the species names with
# tr$tip.label, and the length() function will count 
# how many items an input has.)

# Q3: Is this a mostly-complete phylogeny? (i.e. does it contain most of 
#     they known species of birds and mammals?)

#######################################################
# Plot body size versus C-value
#######################################################

# Raw data
plot(x=datadf$mass_g, y=datadf$Cvalue, xlab="Body mass (grams)", ylab="C-value (picograms)")
title("Birds+mammals: Body mass versus genome size")

# Q4: Is it safe to conclude there is no relationship between genome size
# and body mass based on this plot?

# Q5: What is the mass of that huge animal, in kilograms?  Hint: the "mass_g" column of datadf contains the mass in grams, and the "max" command gives you the maximum value in the input. Try: max(datadf$mass_g)

# Q6: Name the species with the largest body size in the data table.

# Which row matches the maximum mass?
TF = datadf$mass_g == max(datadf$mass_g)

# View the matching row:
datadf[TF,]

# Plot with log(body mass) on x-axis
plot(x=datadf$mass_g, y=datadf$Cvalue, xlab="Body mass (grams)", ylab="C-value (picograms)", log="x")
title("Birds+mammals: Body mass (log-scaled) versus genome size")

# Put a blue asterisk where humans are:
rownum = (1:nrow(datadf))[datadf$name == "Homo_sapiens"]
points(x=datadf$mass_g[rownum], y=datadf$Cvalue[rownum], pch="*", col="blue", cex=5)

# Q7: Do humans have a particularly large genome size?

# Q8: Does this plot suggest a relationship between body mass and genome size?

# The linear_regression_plot() function (which you loaded from Matzke's 
# code archive with the source() function, above) plots your "x" and "y" data,
# and also calculates & plots a standard linear regression,
# returning a slope, intercept, and p-value.
#
# IMPORTANT CRASH COURSE ON P-VALUES:
# 
# P-values are measures of *inconsistency* with the null
# hypothesis. They measure the probability that some
# statistic (or a more extreme value of that statistic)
# would be produced if the null model is true.
#
# For these regressions, the null model is that the 
# slope, m, equals 0.0 (a flat line). 
#
# By common convention, p-values less than 0.05 are taken
# to be "statistically significant" evidence against the 
# null hypothesis, because they indicate that the observed
# data would be relatively improbable under the null hypothesis.
#
# Under this cutoff, P-values > 0.05 mean that we "fail to reject the null
# model", which is different than "we accept the null model."
#
# P-values are commonly misunderstood and mis-used, and they 
# don't tell you what you intuitively want to know: "is my
# model true?".  So, there is large debate in statistics and 
# science over when/whether P-values should be replaced 
# with alternatives. Google will bring up many discussions.
# 

# Here, we regress log10(body mass) against genome size 
# (C-value = haploid genome size in picograms)
linear_regression_plot(x=log10(datadf$mass_g), y=datadf$Cvalue, xlabel="Birds+mammals: Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(datadf$mass_g)), ylim=c(0,10), slope1=FALSE, intercept_in_legend=TRUE)

points(x=log10(datadf$mass_g[rownum]), y=datadf$Cvalue[rownum], pch="*", col="blue", cex=5)
title("Birds+mammals: Body mass (log-scaled) versus genome size")

# Q9: Is the inferred relationship between body mass and genome size positive or negative?

# Q10: What is the estimated slope of this relationship?

# Q11: The null model is that the slope m = 0.0, i.e. no relationship between body mass and genome size. Is the estimated slope statistically significantly different from 0.0?

# Q12: What is the reported p-value?  Does does this p-value indicate "barely significant", e.g. near the p=0.05 cutoff, or does it indicate "highly significant," i.e. a p-value very far below the 0.05 cutoff?

#######################################################
# Phylogenetic independent contracts
#######################################################
# Let's take the same data, but this time plot the 
# "Phylogenetically Independent Contrasts" (PICs) of 
# log(body mass) versus C-value.
#
# Contrasts are just *differences* between pairs of species
# (or between pairs of clades).
#
# See the lectures for more on Phylogenetic Independent Contrasts.
#
# We calculate the contrasts with APE's "pic" function.
#
# Then we plot the contrasts, and do the linear regression 
# on the contrasts, in the usual way.
# 
bodysize_PICs = pic(x=log10(datadf$mass_g), phy=tr)
Cvalue_PICs = pic(x=datadf$Cvalue, phy=tr)

linear_regression_plot(x=bodysize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of body mass (log10 of grams)", ylabel="P.I.C.s of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(bodysize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE)
title("Birds+mammals: PICs of log(body mass) vs. genome size")

# Q13: What is the estimated slope of this relationship?

# Q14: The null model is that the slope m = 0.0, i.e. no relationship between 
# contrast in body mass and contrast in genome size. Is the estimated slope 
# statistically significantly different from 0.0?

# Q15: What is the reported p-value?  Does does this p-value indicate a 
# statistically significant relationship?

# Q16: How is it possible that plotting body mass versus C-value could 
# produce an extremely statistically significant relationship, but plotting body-size 
# contrasts versus C-value contrasts could produce no statistically significant relationship?

# Color dots by Linnaean class
cols = rep("black", times=nrow(datadf))
TF = datadf$Class == "Mammalia"
cols[TF] = "blue"
TF = datadf$Class == "Aves"
cols[TF] = "red"
linear_regression_plot(x=log10(datadf$mass_g), y=datadf$Cvalue, xlabel="Birds+mammals: Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(datadf$mass_g)), ylim=c(0,10), slope1=FALSE, intercept_in_legend=TRUE, col=cols)
title("Birds+Mammals: body mass vs. C-value")

# Q17: How many species are included in datadf?
nrow(datadf)

# Q18: What may caused the apparently highly significant relationship between 
#      body mass and C-value in the raw data?

#####################################################################
# PART II - DO YOUR OWN INTERPRETATION
# 
#
# Below, we do a similar analysis within mammals, and 
# within birds.
#
# Then we do an analysis comparing red-blood-cell size
# with C-value, focusing on amphibians.
#
# Examine the results, and in the short answer box, answer these questions:
# 
# Q19: Answer with 1-3 sentences each. (Paste the questions in the 
#      short answer box, then insert your answer after each one.)
#
# 19A. What do the raw-data regressions in within mammals 
# and within birds seem to suggest about the relationship between
# body mass and C-value?
#
# 19B. What do the PIC analyses suggest about these relationships?
#
# 19C. Do the PIC analyses falsify the idea that body mass might have
#      some relationship to C-value? Why or why not?
#
# 19D. What do the raw-data and PIC regressions suggest about the
#      relationship between cell volume and genome size in amphibians?
#
# 19E. Do you think cell volume and genome size have about the same 
#      relationship across vertebrate classes, or do different rules
#      seem to apply in different groups? What might explain the 
#      differences?
#
# 19F. How would you reply to the assertion "humans have very big genomes,
#      this must have something to do with why they are so advanced and complex"?
#      
#####################################################################

#######################################################
# Subset for just mammals or just birds
#######################################################
# Load mammal-specific and bird-specific trees
mammstr = read.tree(file="http://phylo.wikidot.com/local--files/explore-genome-size-data/mammal_tree.newick")
birdstr = read.tree(file="http://phylo.wikidot.com/local--files/explore-genome-size-data/bird_tree.newick")

# Subset and order tables
TF = datadf$Class == "Mammalia"
mammsdf = datadf[TF,]
mammsdf = mammsdf[match(mammstr$tip.label, table=mammsdf$name),]
mammsdf$name[1:10]
mammstr$tip.label[1:10]

TF = datadf$Class == "Aves"
birdsdf = datadf[TF,]
birdsdf = birdsdf[match(birdstr$tip.label, table=birdsdf$name),]
birdsdf$name[1:10]
birdstr$tip.label[1:10]

#######################################################
# Mammals-only analysis
#######################################################
cols = rep("black", times=nrow(mammsdf))
linear_regression_plot(x=log10(mammsdf$mass_g), y=mammsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(mammsdf$mass_g)), ylim=c(0,10), slope1=FALSE, intercept_in_legend=TRUE, col=cols)
title("Mammals: body mass vs. C-value")

#######################################################
# Phylogenetic independent contracts - within mammals
#######################################################
bodysize_PICs = pic(x=log10(mammsdf$mass_g), phy=mammstr)
Cvalue_PICs = pic(x=mammsdf$Cvalue, phy=mammstr)

linear_regression_plot(x=bodysize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of body mass (log10 of grams)", ylabel="P.I.C.s of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(bodysize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE)
title("Mammals: body mass contrasts vs. C-value contrasts")

# Color in a particular clade
cols = rep("black", times=nrow(mammsdf))
cols[mammsdf$Order == "Chiroptera"] = "blue"

linear_regression_plot(x=log10(mammsdf$mass_g), y=mammsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(mammsdf$mass_g)), ylim=c(0,10), slope1=FALSE, intercept_in_legend=TRUE, col=cols)
title("Mammals: body mass vs. C-value")

# Add a legend
xval = 6
yval = 9
points(x=xval, y=yval, col="blue")
text(x=xval, y=yval, col="blue", label="Chiroptera", pos=4)
points(x=xval, y=yval+0.4, col="black")
text(x=xval, y=yval+0.4, col="black", label="other mammals", pos=4)
points(x=log10(mammsdf$mass_g)[rownum], y=mammsdf$Cvalue[rownum], pch="*", col="green3", cex=4)
points(x=xval, y=yval-0.4, pch="*", col="green3", cex=4)
text(x=xval, y=yval-0.4, col="green3", label="Homo sapiens", pos=4)

#######################################################
# Birds-only analysis
#######################################################
cols = rep("black", times=nrow(birdsdf))
linear_regression_plot(x=log10(birdsdf$mass_g), y=birdsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(birdsdf$mass_g)), ylim=minmax_pretty(birdsdf$Cvalue), slope1=FALSE, intercept_in_legend=TRUE, col=cols)
title("Birds: body mass vs. C-value")

#######################################################
# Phylogenetic independent contracts
#######################################################
bodysize_PICs = pic(x=log10(birdsdf$mass_g), phy=birdstr)
Cvalue_PICs = pic(x=birdsdf$Cvalue, phy=birdstr)

linear_regression_plot(x=bodysize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of body mass (log10 of grams)", ylabel="P.I.C.s of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(bodysize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE)
title("Birds: body mass contrasts vs. C-value contrasts")

#######################################################
# Birds-only analysis - coloring in a particular clade
#######################################################
# Color in a particular clade
cols = rep("black", times=nrow(birdsdf))
cols[birdsdf$Family == "Trochilidae"] = "blue"
linear_regression_plot(x=log10(birdsdf$mass_g), y=birdsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(birdsdf$mass_g)), ylim=minmax_pretty(birdsdf$Cvalue), slope1=FALSE, intercept_in_legend=TRUE, col=cols)
title("Birds: body mass vs. C-value")

# Add a legend
xval = 4.5
yval = 2.0
points(x=xval, y=yval, col="blue")
text(x=xval, y=yval, col="blue", label="Trochilidae", pos=4)
points(x=xval, y=yval-0.05, col="black")
text(x=xval, y=yval-0.05, col="black", label="other birds", pos=4)

#######################################################
# Birds-only analysis - coloring in a few more clades
#######################################################

# Color by group
cols = rep("black", times=nrow(birdsdf))
cols[birdsdf$Family == "Trochilidae"] = "blue"

# point characters (pch)
pchs = rep("o", times=nrow(birdsdf))

# Highlight a few of the big birds
cols[birdsdf$Family == "Struthionidae"] = "green4"
pchs[birdsdf$Family == "Struthionidae"] = "S"
cols[birdsdf$Family == "Dromaiidae"] = "green2"
pchs[birdsdf$Family == "Dromaiidae"] = "D"

linear_regression_plot(x=log10(birdsdf$mass_g), y=birdsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=pchs, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(birdsdf$mass_g)), ylim=minmax_pretty(birdsdf$Cvalue), slope1=FALSE, intercept_in_legend=TRUE, col=cols)
title("Birds: body mass vs. C-value")

# Add a legend
xval = 4.5
yval = 2.0
points(x=xval, y=yval, col="blue")
text(x=xval, y=yval, col="blue", label="Trochilidae", pos=4)
points(x=xval, y=yval-0.05, col="green4")
text(x=xval, y=yval-0.05, col="green4", label="S=Struthionidae", pos=4)
points(x=xval, y=yval-0.10, col="green2")
text(x=xval, y=yval-0.10, col="green2", label="D=Dromaiidae", pos=4)
points(x=xval, y=yval-0.15, col="black")
text(x=xval, y=yval-0.15, col="black", label="other birds", pos=4)

#######################################################
# Does C-value correlate with Cell Size?
#######################################################
# 
# Here is a dataset of species for which C-value Cell-size were
# both available. This data is derived from T. Ryan Gregory's
# Genome Size Database, used with permission:
#
# Gregory, T. Ryan (2020). Animal Genome Size Database. http://www.genomesize.com .
# 
# C-value is the haploid genome size in picograms. 1 picogram of DNA is roughly
# 1 billion bases, i.e. 1 Gigabase or Gb. The human haploid genome size is
# about 3.3 Gb.
#
# Cell Size is measured as mean corpuscular volume ("MCV"). "Corpuscles" is another
# word for erythrocytes, i.e. red blood cells.
#
# We are just going to use MCV, but below are some notes from Gregory's database
# to indicate how cell size is measured:
#######################################################

# Notes on measuring cell size from the Animal Genome Size Database:
# ====================================================================
# http://www.genomesize.com/cellsize/amphibians.htm
#
# NOTE: These are size measurements for amphibian erythrocytes (red blood cells).
# Erythrocytes in amphibians are elliptical, and as such two 
# different diameters are provided:
# 
# CLD ("cell long diameter")
# CSD ("cell short diameter")
# NLD ("nucleus long diameter") 
# NSD ("nucleus short diameter") are also provided where available.  
#
# In both cases, the diameters were measured from dry smears.  These were used to calculate # dry cell area (CA) and 
# dry nucleus area (NA)
# using the equation for elliptical area: CA = pi * (CLD/2) * (CSD/2).  
#
# MCV ("mean corpuscular volume") is wet volume and is usually measured 
# with a Coulter counter.  
# DCV ("dry cell volume") and 
# DNV ("dry nucleus volume")
#
# have occasionally been measured from dry smears, and are also provided.  
# These various measurements (e.g., CA and MCV) should not be compared to each other, 
# so pick only one for any analyses you wish to do, or run them separately.  
#
# Diameters are in µm, areas are in µm2, and volumes are in µm3.
#
#
# Mammal erythrocyte sizes
# http://www.genomesize.com/cellsize/mammals.htm
#
# NOTE: These are size measurements for mammalian erythrocytes (red blood cells).  
# Erythrocytes in mammals are unique among vertebrates in that they are enucleated 
# (i.e., contain no nuclei) and circular rather than elliptical (with the exception 
# of some artiodactyls).  
# 
# Two different measurements are given.  
# First, DCD ("dry cell diamater"), measured from dry blood smears, and 
# MCV ("mean corpuscular volume"), is wet volume 
# and is usually measured with a Coulter counter.   
# The few diameters measured from wet smears have been converted 
# to estimated dry values as in Gregory (2000).  Diameters are in µm and 
# volumes are in µm3.
# ====================================================================

#######################################################
# Plotting cell size vs. genome size
#######################################################

# Download the data
cell_size_Cvalue_fn = "http://phylo.wdfiles.com/local--files/explore-genome-size-data/verts_cs_cv_v1.txt"
verts_cs_cv = read.table(file=cell_size_Cvalue_fn, header=TRUE, strip.white=TRUE, sep="\t", stringsAsFactors=FALSE)
head(verts_cs_cv)
dim(verts_cs_cv)

# Plot the data (non-logged)
colvals = rep("black", times=nrow(verts_cs_cv))  # point colors (col)
pchvals = rep(1, times=nrow(verts_cs_cv))        # point characters (pch)
plot(x=verts_cs_cv$cs, y=verts_cs_cv$cv, pch=pchvals, xlab="Cell size (µm3)", ylab="C-value", col=colvals)
title("Erythrocyte cell volume vs. C-value")

# Q: Does there appear to be a relationship between cell volume and C-value?

# Plot the data (logging the axes)
colvals = rep("black", times=nrow(verts_cs_cv))  # point colors (col)
pchvals = rep(1, times=nrow(verts_cs_cv))        # point characters (pch)
plot(x=verts_cs_cv$cs, y=verts_cs_cv$cv, pch=pchvals, xlab="Cell size (µm3)", ylab="C-value", col=colvals, log="xy")
title("Erythrocyte cell volume vs. C-value (axes logged)")

# Q: Does there appear to be a relationship between cell volume and C-value?

colvals = rep("black", times=nrow(verts_cs_cv))  # point colors (col)
pchvals = rep(1, times=nrow(verts_cs_cv))        # point characters (pch)

pchvals[verts_cs_cv$Class == "Fishes"] = "F"
colvals[verts_cs_cv$Class == "Fishes"] = "black"
pchvals[verts_cs_cv$Class == "Amphibians"] = "A"
colvals[verts_cs_cv$Class == "Amphibians"] = "green3"
pchvals[verts_cs_cv$Class == "Reptiles"] = "R"
colvals[verts_cs_cv$Class == "Reptiles"] = "red"
pchvals[verts_cs_cv$Class == "Birds"] = "B"
colvals[verts_cs_cv$Class == "Birds"] = "blue"
pchvals[verts_cs_cv$Class == "Mammals"] = "M"
colvals[verts_cs_cv$Class == "Mammals"] = "grey50"

# Plot the data, with letter representing the taxa
plot(x=verts_cs_cv$cs, y=verts_cs_cv$cv, pch=pchvals, xlab="Cell size (µm3)", ylab="C-value", col=colvals)
title("Erythrocyte cell volume vs. C-value")

# Plot the data, with letter representing the taxa (logging the axes)
plot(x=verts_cs_cv$cs, y=verts_cs_cv$cv, pch=pchvals, xlab="Cell size (µm3)", ylab="C-value", col=colvals, log="xy")
title("Erythrocyte cell volume vs. C-value (axes logged)")

# Add an orange asterisk for humans
points(x=90, y=3.3, pch="*", col="orange2", cex=5)
text(x=90, y=20.0, labels="YOU ARE HERE", pos=3, col="orange2", cex=2.0)
arrows(x0=90, x1=90, y0=20, y1=4.5, col="orange2", lwd=3, length=0.1)

# Amphibian tree
atr_fn = "http://phylo.wikidot.com/local--files/explore-genome-size-data/Pyron_amphibian_tree_subset.newick"
atr = read.tree(atr_fn)
atr

# Let's look at the relationship between cell size and cell volume in amphibians
#######################################################
# Amphibians analysis
#######################################################

spnames = verts_cs_cv$Species
spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names
rows_in_tr_TF = spnames %in% atr$tip.label  # TRUE/FALSE test for spnames in atr
spnames = spnames[rows_in_tr_TF]
verts_cs_cv$Species[rows_in_tr_TF]
cellsize = verts_cs_cv$cs[rows_in_tr_TF]
names(cellsize) = spnames
cval = verts_cs_cv$cv[rows_in_tr_TF]
names(cval) = spnames

#######################################################
# Plot cell-size vs. C-value
#######################################################
cols = rep("black", times=length(cellsize)) # dot colors
linear_regression_plot(x=cellsize, y=cval, xlabel="Cell size (µm3)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(cellsize), ylim=minmax_pretty(cval), slope1=FALSE, intercept_in_legend=TRUE, col=cols)
title("Amphibians: cell size vs. C-value")

#######################################################
# Phylogenetic independent contracts
#######################################################
cellsize_PICs = pic(x=cellsize, phy=atr)
Cvalue_PICs = pic(x=cval, phy=atr)

linear_regression_plot(x=cellsize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of cell size (µm3)", ylabel="P.I.C.s of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(cellsize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE)
title("Amphibians: cell-size contrasts vs. C-value contrasts")

#######################################################
# Plot log10(cell-size) vs. log10(C-value)
#######################################################
spnames = verts_cs_cv$Species
spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names
rows_in_tr_TF = spnames %in% atr$tip.label  # TRUE/FALSE test for spnames in atr
spnames = spnames[rows_in_tr_TF]
verts_cs_cv$Species[rows_in_tr_TF]
cellsize = log10(verts_cs_cv$cs[rows_in_tr_TF])
names(cellsize) = spnames
cval = log10(verts_cs_cv$cv[rows_in_tr_TF])
names(cval) = spnames

cols = rep("black", times=length(cellsize)) # dot colors
linear_regression_plot(x=cellsize, y=cval, xlabel="log of Cell size (µm3)", ylabel="log of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(cellsize), ylim=minmax_pretty(cval), slope1=FALSE, intercept_in_legend=TRUE, col=cols)
title("Amphibians: log10(cell size) vs. log10(C-value)")

#######################################################
# Phylogenetic independent contracts
# Plot constrasts of log10(cell-size) vs. constrasts of log10(C-value)
#######################################################

#######################################################
cellsize_PICs = pic(x=cellsize, phy=atr)
Cvalue_PICs = pic(x=cval, phy=atr)

linear_regression_plot(x=cellsize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of log cell size (µm3)", ylabel="P.I.C.s of log C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(cellsize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE)
title("Amphibians: log10(cell-size) contrasts vs. log10(C-value) contrasts")

#######################################################
# Ancestral character estimation (ACE) for mammals
#######################################################

library(phytools)

# Get mammal data, remove NAs
NAs_TF1 = is.na(verts_cs_cv$cs) == FALSE
NAs_TF2 = is.na(verts_cs_cv$cv) == FALSE

spnames = verts_cs_cv$Species
spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names
rows_in_tr_TF = spnames %in% mammstr$tip.label  # TRUE/FALSE test for spnames in mammstr
keepTF = (NAs_TF1 + NAs_TF2 + rows_in_tr_TF) == 3
spnames = spnames[keepTF]
cellsize = verts_cs_cv$cs[keepTF]
names(cellsize) = spnames
cval = verts_cs_cv$cv[keepTF]
names(cval) = spnames

# Subset mammal tree
tips_to_keep_TF = mammstr$tip.label %in% spnames
mtr = drop.tip(mammstr, tip=mammstr$tip.label[tips_to_keep_TF==FALSE])

# Using phytools, estimate ancestral cell size values,
# under the Brownian motion model for continuous traits
fit = phytools::fastAnc(tree=mtr, x=cellsize)

# projection of the reconstruction onto the edges of the tree
obj = contMap(tree=mtr, x=cellsize, plot=FALSE)
obj

# Plot the tree, with colors representing mean
# ancestral state values
tree=mtr
plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="Cell volume (µm3)")
title("Mammals: Ancest. state estimation for cell volume (µm3)", cex=0.15, line=-0.75)

# Using phytools, estimate ancestral genome size values,
# under the Brownian motion model for continuous traits
fit = phytools::fastAnc(tree=mtr, x=cval)

# projection of the reconstruction onto the edges of the tree
obj = contMap(tree=mtr, x=cval, plot=FALSE)
obj

# Plot the tree, with colors representing mean
# ancestral state values
tree=mtr
plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="C-value (genome size)")
title("Mammals: Ancest. state estimation for genome size (C-value)", cex=0.15, line=-0.75)

#######################################################
# Ancestral character estimation (ACE) for birds
#######################################################

library(phytools)

# Get mammal data, remove NAs
NAs_TF1 = is.na(verts_cs_cv$cs) == FALSE
NAs_TF2 = is.na(verts_cs_cv$cv) == FALSE

spnames = verts_cs_cv$Species
spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names
rows_in_tr_TF = spnames %in% birdstr$tip.label  # TRUE/FALSE test for spnames in birdstr
keepTF = (NAs_TF1 + NAs_TF2 + rows_in_tr_TF) == 3
spnames = spnames[keepTF]
cellsize = verts_cs_cv$cs[keepTF]
names(cellsize) = spnames
cval = verts_cs_cv$cv[keepTF]
names(cval) = spnames

# Subset mammal tree
tips_to_keep_TF = birdstr$tip.label %in% spnames
btr = drop.tip(birdstr, tip=birdstr$tip.label[tips_to_keep_TF==FALSE])

# Using phytools, estimate ancestral cell size values,
# under the Brownian motion model for continuous traits
fit = phytools::fastAnc(tree=btr, x=cellsize)

# projection of the reconstruction onto the edges of the tree
obj = contMap(tree=btr, x=cellsize, plot=FALSE)
obj

# Plot the tree, with colors representing mean
# ancestral state values
tree=btr
plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="Cell volume (µm3)")
title("Birds: Ancest. state estimation for cell volume (µm3)", cex=0.15, line=-0.75)

# Using phytools, estimate ancestral genome size values,
# under the Brownian motion model for continuous traits
fit = phytools::fastAnc(tree=btr, x=cval)

# projection of the reconstruction onto the edges of the tree
obj = contMap(tree=btr, x=cval, plot=FALSE)
obj

# Plot the tree, with colors representing mean
# ancestral state values
tree=btr
plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="C-value (genome size)")
title("Birds: Ancest. state estimation for genome size (C-value)", cex=0.15, line=-0.75)

#######################################################
# Ancestral character estimation (ACE) for amphibians
#######################################################

library(phytools)

# Re-set the data
spnames = verts_cs_cv$Species
spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names
rows_in_tr_TF = spnames %in% atr$tip.label  # TRUE/FALSE test for spnames in atr
spnames = spnames[rows_in_tr_TF]
verts_cs_cv$Species[rows_in_tr_TF]
cellsize = verts_cs_cv$cs[rows_in_tr_TF]
names(cellsize) = spnames
cval = verts_cs_cv$cv[rows_in_tr_TF]
names(cval) = spnames

# Using phytools, estimate ancestral cell size values,
# under the Brownian motion model for continuous traits
fit = phytools::fastAnc(tree=atr, x=cellsize)

# projection of the reconstruction onto the edges of the tree
obj = contMap(tree=atr, x=cellsize, plot=FALSE)
obj

# Plot the tree, with colors representing mean
# ancestral state values
tree=atr
plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="Cell volume (µm3)")
title("Amphibians: Ancest. state estimation for cell volume (µm3)", cex=0.15, line=-0.75)

# Using phytools, estimate ancestral genome size values,
# under the Brownian motion model for continuous traits
fit = phytools::fastAnc(tree=atr, x=cval)

# projection of the reconstruction onto the edges of the tree
obj = contMap(tree=atr, x=cval, plot=FALSE)
obj

# Plot the tree, with colors representing mean
# ancestral state values
tree=atr
plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="C-value (genome size)")
title("Amphibians: Ancest. state estimation for genome size (C-value)", cex=0.15, line=-0.75)

# Nick's machine - make PDFs of all plots
# (closing PDFs)
# dev.off()
# cmdstr = paste0("open ", pdffn)
# system(cmdstr)
File nameFile typeSize
birds_mammals_subset_Cvalue_bodysize.txtASCII text63.05 kBInfo
birds_mammals_subset_tree.newickASCII text46.63 kBInfo
bird_tree.newickASCII text28.33 kBInfo
_lab4_exploring_genome_size_data_v4.RUTF-8 Unicode English text42.57 kBInfo
Lab4_plots_v1.pdfPDF document233.93 kBInfo
mammal_tree.newickASCII text18.28 kBInfo
Pyron_amphibian_tree_dated.newickASCII text141.45 kBInfo
Pyron_amphibian_tree_subset.newickASCII text1.47 kBInfo
verts_cs_cv_v1.txtASCII text15.77 kBInfo
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License