Thursday, February 26, 2015

Setting up a virtual machine to run a linux operating system

I need linux so I can start learning how to use bgc, a program for analyzing genomic clines.  As I don't have any spare computers around to install linux on (and it seemed easier than dual-booting one of my main working computers), I decided to try a virtual machine.

First, download Oracle's free VirtualBox and your operating system of choice.  From the many flavors of linux, I started with Ubuntu as it seemed least intimidating and likely to do be sufficient for my purpose of running the one program.

Install VirtualBox first.  Once you've set up your first machine (I let it do all the default settings), copy the Ubuntu .iso file into your VM's folder.  Mine was stored at C:/Users/Username/VirtualBox VMs/yourvirtualmachinesname/ and I placed the .iso file in that folder.

Run your VM.  It'll ask you from what optical drive you want to boot.  Set the CD/DVD optical drive to the .iso file.  If you get a fatal boot error because you didn't do this the first time, just set the drive, then restart, and it should boot as if you had an operating system disk in a real computer.  It took a while to get to the install screen.  I let it run in the background and worked on other stuff.  Same for the Install Ubuntu wizard.  Periodically check and answer its questions.  It took a while (started 1pm, finished around 4pm), just like a real operating system install.

I found Ubuntu ran pretty slow in the VM on my aging tiny laptop, and XUbuntu was suggested to me as a less graphics-intensive alternative.  That's what I'm using now (just create a new VM in VirtualBox and follow the same steps) and that install also took less time (maybe an hour or two with me ignoring it as I did other tasks).  Stay tuned as I learn how to install some prerequisites for bgc (GNU Scientific Library and HDF5).

Monday, February 23, 2015

Beginning R: exploratory data analysis (includes basic plotting and testing for normality)

This continues my posts from co-teaching quantitative methods at the University of Manitoba.
#Start by setting your workspace and bringing in your data,
#and checking to make sure the variables are imported correctly
#with numbers as numeric, categories as factors, etc.
setwd("C:\\Username\\Dropbox\\")
#Your directory here.
#Or use the menu...
#Session>Set Working Directory>To Source File Location [or you can choose the directory]

#Import your data.
titmouse<- read.table(file="blog_banding_data.csv", header=TRUE, sep=",")
str(titmouse)
#shows the data types; always do this check.


#############################
##Exploratory data analysis##
#############################
#First, eyeball for weird values
View(titmouse) #You can also click on the dataframe in the environment panel.
#Check to see that all values filled correctly.
#Missing data: NA vs. NaN vs. Inf.
?NA #Missing data
?NaN #'Not a number', more complex.  Help file discusses +/- Inf as well.
#Inf usually encountered in calculations (such as for log transformations).

hist(titmouse) #Can't just use this plotting function on a data frame.
#Use titmouse$[columnyouwant] to plot an individual variable.
hist(titmouse$crestlength) #This gives the default axes and title.
hist(titmouse$crestlength,
     xlab="Crest length (mm)",
#label the x-axis with units
     main="") #Empty the main title
hist(titmouse[titmouse$indexsum==0,"crestlength"], #select rows where indexsum=0 and column is crestlength
     xlab="Crest length (mm)",
     main="Tufted Titmouse")

par(mfrow=c(1,3))
#par does lots of things to the plotting region.
#Here we tell it to do one row and three columns.
hist(titmouse[titmouse$indexsum==0,"crestlength"], #select rows where indexsum=0 and column is crestlength
     xlab="Crest length (mm)",
     main="Tufted Titmouse")
hist(titmouse[titmouse$indexsum>0&titmouse$indexsum<6,"crestlength"],
#select rows where indexsum=0 and column is crestlength
     xlab="Crest length (mm)",
     main="Hybrid titmouse")
hist(titmouse[titmouse$indexsum==6,"crestlength"],
#select rows where indexsum=0 and column is crestlength
     xlab="Crest length (mm)",
     main="Black-crested Titmouse")
par(mfrow=c(1,1))
#set it back to a 1 row, 1 column plot.

#Boxplots
boxplot(titmouse$crestlength) #no axes!  This is terrible!
boxplot(titmouse$crestlength,
        ylab="Crest length (mm)",
        xlab="All the titmice")
par(mfrow=c(2,1))
#two rows, one column for the plots.
boxplot(titmouse[titmouse$indexsum==0,"crestlength"])
boxplot(titmouse[titmouse$indexsum==6,"crestlength"])

#looks funky.
par(mfrow=c(1,2))
boxplot(titmouse[titmouse$indexsum==0,"crestlength"],
        xlab="Tufted Titmouse",
        ylab="Crest length (mm)")
boxplot(titmouse[titmouse$indexsum==6,"crestlength"],
        xlab="Black-crested Titmouse",
        ylab="Crest length (mm)",
        lty="solid")

#Much nicer!
par(mfrow=c(1,1)) #Reset par back to one column, one row plotting region.
#Let's do some categories like last time.
titmouse$species<-titmouse$indexsum # create a new column with indexsum data copied.
#Always copy your column or dataframe when doing something big like this.
#While it won't mess up your original file (this is only in your R session),
#it will save you trouble from having to reimport your file if you mess things up.
titmouse$species<-sub(0,"Tufted",titmouse$species)
titmouse$species<-sub(6,"Black-crested",titmouse$species)
titmouse[titmouse$species>0&titmouse$species<6,]$species = "hybrid"


plot(y=titmouse$crestlength, x=titmouse$species)
#Errors!  Why?
#It doesn't like x as "chr" or character.  You have to have it as a factor.
plot(y=titmouse$crestlength, x=titmouse$state) #Test it with a factor.
titmouse$species<-as.factor(titmouse$species) #Convert desired variable to a factor.
plot(y=titmouse$crestlength, x=titmouse$species) #It works, and plot also makes boxplots.
plot(titmouse$crestlength~titmouse$species) #This does the same thing and introduces R's formula notation, which you will need for many statistical functions.

#Scatterplots.  Plot is very flexible and also does scatterplots if your two variables are continuous.
plot(x=titmouse$taillength, y=titmouse$crestlength)
plot(titmouse$crestlength~titmouse$taillength)
#same as previous, formula notation.
plot(crestlength~taillength, data=titmouse) #same as previous, but referencing the dataframe in the data argument.
#(Each bit between the commas is an "argument" in the function.  So this is a plot function,
#with a formula argument, and a data argument.  We can add the axis label arguments too.

#So, make it have nice axis labels.
plot(crestlength~taillength,
     data=titmouse,
     xlab="Tail length (mm)",
     ylab="Crest length (mm)")


#Get subsets of data.
plot(crestlength~taillength,
     data=titmouse[titmouse$species=="Tufted",],
     xlab="Tail length (mm)",
     ylab="Crest length (mm)")

#You can combine or elaborate these plots, too, by adding points.
points(crestlength~taillength,
     data=titmouse[titmouse$species=="Black-crested",],
     xlab="Tail length (mm)",
     ylab="Crest length (mm)",

     pch=21, #choose the type of point.  ?pch shows all the possible values.
     bg="black") #choose the fill color.

#Default R colors are shown in a pdf here.
colours()
colors()
#These will list off the names (also shown in the pdf).
#There are also ways to set exact colors.

points(crestlength~taillength,
       data=titmouse[titmouse$species=="hybrid",],
       xlab="Tail length (mm)",
       ylab="Crest length (mm)",
       pch=2, #Another example of a symbol
       col=rgb(red=200,green=30,blue=30,
               maxColorValue=255))
#example of using rgb() function to generate color.
?par #"Color Specification" (near the end of the file) discusses other ways change color,
#including default and custom palettes.

#Draw some arbitrary lines in an empty plot.
plot(x=0,
     y=0,
     type="n",
#no symbols drawn
     pch=NULL,
     xlim=c(0,1),
     ylim=c(0,1))

#These are arbitrary values to demonstrate line drawing capabilities.
abline(h=0.5, v=0.25) #give horizontal line or vertical line, can do these separately as well.
abline(a=0, b=0.25,
       lty="dashed")

#give slope and intercept, in the formula for a straight line: y=bx+a.
#Then give the line type (lty=).  ?par lists the possible types.

#How to combine or elaborate these plots, say to get the boxplot with the histogram in the textbook.
#This code is adapted from here.

#Here we use par() to adjust the plot again, but remember to save the defaults first.
pardefault <- par(no.readonly = T)
#From this link (which contains other variations as well)
par(fig=c(0,0.8,0,0.8)) #Here par is used to adjust the plot area.
hist(titmouse$crestlength,
     xlab="Crest length (mm)",
     main="")
#an empty main title
par(fig=c(0,0.8,0.55,1), new=TRUE)
boxplot(titmouse$crestlength,
        horizontal=TRUE,
        axes=FALSE)


#Add to a scatterplot
par(fig=c(0,0.8,0,0.8))
plot(crestlength~taillength,
     data=titmouse,
     xlab="Tail length (mm)",
     ylab="Crest length (mm)",
     pch=21,
     bg="gray",
     col="black")
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(titmouse$crestlength, axes=FALSE)
par(fig=c(0,0.8,0.55,1), new=TRUE)
boxplot(titmouse$taillength,
        horizontal=TRUE,
        axes=FALSE)
par(pardefault)
#reset the default plotting parameters.

#############################
####Graphical diagnostics####
#############################
#Residuals, influence, outliers, and Q-Q plots.

#To get some residuals we'll run a simple linear regression model.
bill.wing.reg<-lm(titmouse$billlength~titmouse$wingchordr)
#Note the use of the formula here again... y~x.
summary(bill.wing.reg) #Get some basic data from it. 
plot(bill.wing.reg)
#look at residuals vs. predicted ("Fitted") values for homoscedascity.
#The first plot puts row numbers or label numbers on outliers.
#The second plot is the Q-Q plot.
#The third is standardized residuals (not discussed in lecture).
#The fourth plots Cook's distance.  Dashed red lines show the boundary of 0.5 Cook's D.

#Get residuals to do additional plots.
rawres<-residuals(bill.wing.reg) #raw residuals
stres<-rstudent(bill.wing.reg) #studentized residuals

#Plot residual vs. independent variable.
plot(rawres~na.omit(titmouse$wingchordr))
#we use na.omit because lm automatically removed NAs.

#Plot residuals in a histogram.
hist(rawres)

#Plot the data with the regression line.
plot(titmouse$billlength~titmouse$wingchordr)
abline(bill.wing.reg)

#Cook's distance
#Help says, "Cases which are influential with
#respect to any of these measures are marked with an asterisk."
im<-influence.measures(bill.wing.reg) #Cook's D is cook.d column.
str(im) #new format... a list.
im[1] #lists are accessed via the level of the list.
im[2] #is it an outlier by the measures in the table?
cookd<-im[[1]][,"cook.d"] #get column "cook.d" from the first item in the list.
plot(cookd, type="h",
     main="Observations",
     ylab="Cook's D",
     xlab="Index (=row number)")
abline(h=4/length(cookd), col="gray")

#This gets us a plot of Cook's D similar to the one SAS generates (shown in Nicky's lecture).

#Q-Q plot for just a raw variable.
qqnorm(titmouse$billlength)
qqline(titmouse$billlength)

#If you want to transform data,
#here's how (though none of our example data need it).
#Create a new column with an appropriate name.
titmouse$lncrestlength<-log(titmouse$crestlength)
#BEWARE!  R uses log() for the natural logarithm.  You must use log10() if you want base 10 logarithms.

#If you have a variable with a zero, you need to add a small constant first.
(titmouse$lnindexsum<-log(titmouse$indexsum+0.001))
#If you don't, you'll get Inf for any zero values.
(log(titmouse$indexsum))
#We'll go into more transformations in a later lab.

#############################
####Tests for normality, ####
####skewness, and kurtosis###
#############################
#Time to install a package.
#Click on "Packages" tab and then "Install".
#Install package "moments" from the CRAN repository (will only work with internet connection).
library(moments) #invoke the library

skewness(titmouse$crestlength, na.rm=TRUE)
kurtosis(titmouse$crestlength, na.rm=TRUE)
shapiro.test(titmouse$crestlength)

library(e1071)

#An error message about masking similarly named functions appears.
#This tells us we know that we are using e1071's functions now.
kurtosis(titmouse$crestlength,
         na.rm=TRUE,
         type=2)
skewness(titmouse$crestlength,
         na.rm=TRUE,
         type=2)

#This one provides more options, gives the algorithms in help,
#and can be compared with SAS.  Let's use this library and remove the 'moments' library.
detach("package:moments")


#Now make some plots
hist(titmouse$crestlength, prob=TRUE,
     xlab="Crest length for titmice",
     ylim=c(0,0.3))
lines(density(na.omit(titmouse$crestlength)))

#Add the normal distribution for your mean/sd.
curve(dnorm(x,
            mean=mean(titmouse$crestlength, na.rm=TRUE),
            sd=sd(titmouse$crestlength, na.rm=TRUE)),
      lty="dashed",
      col="red",
      add=TRUE)



#For the residuals
skewness(stres, type=2) #Let's use the studentized residuals for fun.
kurtosis(stres, type=2) #make sure type=2 if you want to compare to SAS.
shapiro.test(stres)
hist(stres, prob=TRUE,
     breaks=10,
#change the number of histogram bars if you want, see how it changes the graph
     main="Studentized residuals",
     xlab="Wing vs. bill residuals",
     ylim=c(0,0.5))
#This lengthens the y-axis so our density lines won't get cut off in next step.
lines(density(stres)) #Adds the kernel density function of your data.

#Add the normal distribution for your mean/sd.
curve(dnorm(x,
         mean=mean(stres),
         sd=sd(stres)),
      lty="dashed",
      col="red",
      add=TRUE)


##############################
###Testing for collinearity###
######(i.e. correlation)######
##############################

#How to test for correlation in the data.
cor(x=titmouse$crestlength,
    y=titmouse$indexsum,
    use="pairwise.complete.obs",
#another different way of getting rid of NAs
    method="pearson")
cor.test(x=titmouse$crestlength,
         y=titmouse$indexsum,
         na.action=na.rm,
         method="pearson")
#other methods are available, see the help file.

#Correlate more than one at a time, though you only get the correlation coefficient.
cor(titmouse[,c("wingchordr","taillength","crestlength","billlength")],
    use="pairwise.complete.obs")



##############################
######Saving your plots#######
##############################

#Can use RStudio's export feature.
#Easier for reproducibility to just directly save into preferred format.
#Set width, height, etc. FOR ANY PLOT THAT YOU MAY HAVE TO RE-DO!
#Examples: publication, dissertation, thesis, reports, anything where having
#similarly formatted figures are important
#(so later fixes don't show up as different).

#I used the instructions from a previous blog post, so I won't repeat that here.

Thursday, February 12, 2015

Dreaming of spring

Yellow Prairie Violet (Viola nutallii), near Brooks, Alberta.  21 May 2014.

Monday, February 09, 2015

Beginning R: setting your workspace, importing your data, and running basic descriptive statistics

#Step 1: Set your working directory.
#This allows you to import, export, and save plots all in the folder of your choice,
#ideally one that you can remember how to find again.

#You can set your working directory via the RStudio menus,
#but I suggest you hardcode it so that you don't have to remember
#all the steps every time or where you keep the files.

#What is your directory now?
getwd()

#Let's change it.
setwd("C:\Users\Username\Dropbox")
#gives an error.
#If you don't know what the problem is, copy the error message to your search engine.
#I find from the first link that R needs either forward slashes OR
#double backslashes.  It can't use single backslashes.

setwd("C:/Users/Username/Dropbox")
setwd("C:\\Users\\Username\\Dropbox")
setwd("C:\\Users\\Username\\Dropbox\\")
getwd()
#Both work!

#Step 2: Import your data.
#Download these data in .csv format and place the file in your working directory.
titmouse<- read.table(file="blog_banding_data.csv", header=TRUE, sep=",")
#These data are part of that we used in Curry and Patten (2014).
names(titmouse) #lists all the variable (column) names.
str(titmouse) #shows the data types.
#There are factors, integers ("int"), and numbers (num).

#Later, how will you bring your own data into R?
#It's best to have your data formatted as .csv/.txt as we do here.
#These are readable without any special programs and useable in any program.
#Most programs also allow export in some variation of those formats.
#It is possible to import from an Excel file but potentially difficult
#I find it much simpler to just save any spreadsheets I have as .csv or .txt,
#and it's a good habit to know how to work with those formats as they will be used
#in many programs as a basic file format.
?read.table #gives additional options, most notable read.csv, for importing text files.
#You can change the delimiter (the text that separates each column) to match your file.
#This can often be one of the most annoying/frustrating steps if the data aren't formatted just so,
#so don't feel silly if your data end up giving you trouble here.

#Step 3: Some descriptive statistics.
summary(titmouse) #lists all the summaries for each variable.
#Some are irrelevant (like datebanded, which is not really a factor, and bandnumber, which is arbitrary).
#First, we need some more categories.  Indexsum represents a hybrid index.  0 is one species and 6 is the other.
titmouse$species<-titmouse$indexsum # create a new column with indexsum data copied.
#Always copy your column or dataframe when doing something big like this.
#While it won't mess up your original file (this is only in your R session),
#it will save you trouble from having to reimport your file if you mess things up.
titmouse$species<-sub(0,"Tufted",titmouse$species)
titmouse$species<-sub(6,"Black-crested",titmouse$species)

#Hmm, that's pretty inefficient though it works for small numbers of changes.
titmouse[titmouse$species>0&titmouse$species<6,]$species = "hybrid"
str(titmouse)
#Now we have a nice species-based grouping variable.

#Mean
mean(titmouse$crestlength)
#Why does it give us NA?
?mean
(mean.crest<-mean(titmouse$crestlength, na.rm=TRUE))

#Yay!  It works when we remove NAs (missing data).
#You can also try out other variations on selecting data to get means.
mean(titmouse$crestlength[titmouse$indexsum==0])

#To get means for more than one column, there's a handy colMeans function.
colMeans(titmouse, na.rm=TRUE)
#Another error.  "x" must be numeric.  We have factors in this data frame.
colMeans(titmouse[,c("wingchordl","wingchordr","taillength","crestlength")], na.rm=TRUE)
#Oddly, several other functions use "na.action" with different options,
#but you can find those in the help files as needed.
#What about means for each group?

tapply(titmouse$crestlength, titmouse$species, FUN=mean)
#HEY IT DIDN'T WORK AGAIN.
#Wait, down in the help file
?tapply
#You can see an example that includes na.rm=TRUE.
tapply(titmouse$crestlength, #the variable we're interested in
       titmouse$species, #index (i.e., the grouping variable)
       FUN=mean, na.rm=TRUE)
#Now it works.  You also see in the help file several other options for sorting by
#a grouping (or "INDEX" as this function calls it).  Let's try aggregate.
?aggregate
#We'll test the third method, with formulas, to get you used to the common R formula.
aggregate(titmouse$crestlength~titmouse$species,
          #We want to see crest length grouped by species.
          #In later formulas (like for plotting and statistics),
          #it means that crest length is a function of species.
          data=titmouse,
          FUN= mean,
          na.action=na.omit)

#Ooh la la.  Data in a nice format!  Skimming help is always a good idea.

#Sample size/"length"
length(titmouse$crestlength)
#That includes all observations, even NAs.
length(titmouse$crestlength, na.rm=TRUE)
#Oh no!  It won't let us see how many samples we have without NAs.
length(na.omit(titmouse$crestlength))
#Go look at the data frame in the global environment, and you can see that indeed two rows
#have NA for crest length.

#Median
median(titmouse$crestlength, na.rm=TRUE)

#Variance
var(titmouse$crestlength, na.rm=TRUE)
aggregate(titmouse$crestlength~titmouse$species,
          data=titmouse,
          FUN=var,
          na.action=na.omit)

#Standard deviation
(sd.crest<-sd(titmouse$crestlength, na.rm=TRUE))

#Coefficient of variation
(cv.crest<-sd.crest*100/mean(titmouse$crestlength, na.rm=TRUE))
#Standard error
#Googling suggests that you'd have to install a package to get this automatically.
#Let's try making a function instead!
se <- function(x) sd(x)/sqrt(length(x))
se(titmouse$crestlength)

#Hmm, those pesky NAs.
se(titmouse$crestlength, na.rm=TRUE)
#This doesn't work either, because we didn't code in na.rm=TRUE into the function.
se.na.rm<-function(x) sd(x,na.rm=TRUE)/sqrt(length(x))
se.na.rm(titmouse$crestlength)
#It works, but what if we want the option to code na.rm=FALSE?
se.best<-function(x,na.rm) sd(x,na.rm=na.rm)/sqrt(length(x))
(se.crest<-se.best(x=titmouse$crestlength, na.rm=TRUE))

#perfect!

#Confidence interval (95%)
#For fitted models, R will do this automatically, but it's good to know how they are
#calculated by hand for the occasions you might need them
#(say, putting confidence intervals on a graph).
(upper95<-mean(mean.crest)+1.96*se.crest)
(lower95<-mean(mean.crest)-1.96*se.crest)
plot(mean.crest)
#A very simple graph!
points(upper95)  #add another point!
points(lower95)  #and another one!
#Obviously this is not what we'd want a confidence interval to look like, but we'll deal with those later.


#From where do people get 1.96?
n<-length(titmouse$crestlength) #what's our sample size?  We calculated that earlier, too.
qt(1-0.05/2, df=n-1) #t distribution, for our exact degrees of freedom.
qt(1-0.05/2, df=Inf) #the "idealized" case where we get the 1.96, as for the normal distribution next.
qnorm(1-0.05/2) #normal distribution.


#Hypothesis testing is discussed with t-tests.  R has good options in t.test().
t.test(crestlength~species,
       data=titmouse[titmouse$species!="hybrid",],
       na.action=na.omit
       )
?t.test
#indicates you can select the other options from the textbook, such as
#one-sided or two-sided, paired t-tests, and equal variance or unequal variance.
#Sample from a population.

#Step 4: Exiting without losing your work.
#Save early and often on your .R file (this is your code and you don't want to lose it!)
#When it's time to exit RStuido, save the workspace in the RStudio menu
#by "Session">"Save workspace as".  You can reload the .RData file later.
#It automatically goes to your working directory but you can move it if you want.

#Finally, a note on careful coding.  I have on occasion worked with a bunch of objects
#and values, then closed my code, and opened it later to get a different result.
#Somewhere I left an old variable value.  So, now I close RStudio once I'm done,
#never save the workspace, and then re-run the code from the start to make sure I
#didn't miss any new bits of code I might have added to correct problems.
#Finally, always comment!  It's easy to forget what certain parts of the code do,
#and commenting will save you days of trying to figure it out.
#Plus, if someone else needs to use your code/dataset, or you submit to a journal that
#wants data and/or code, you'll be ready!

#Sneak preview: next week we'll look at plots!
boxplot(titmouse$indexsum~titmouse$species)
plot(titmouse$crestlength~titmouse$wingchordr, data=na.omit(titmouse))
plot(titmouse$taillength~titmouse$wingchordr, data=na.omit(titmouse))

Monday, February 02, 2015