Monday, February 09, 2015

Beginning R: setting your workspace, importing your data, and running basic descriptive statistics

#Step 1: Set your working directory.
#This allows you to import, export, and save plots all in the folder of your choice,
#ideally one that you can remember how to find again.

#You can set your working directory via the RStudio menus,
#but I suggest you hardcode it so that you don't have to remember
#all the steps every time or where you keep the files.

#What is your directory now?

#Let's change it.
#gives an error.
#If you don't know what the problem is, copy the error message to your search engine.
#I find from the first link that R needs either forward slashes OR
#double backslashes.  It can't use single backslashes.

#Both work!

#Step 2: Import your data.
#Download these data in .csv format and place the file in your working directory.
titmouse<- read.table(file="blog_banding_data.csv", header=TRUE, sep=",")
#These data are part of that we used in Curry and Patten (2014).
names(titmouse) #lists all the variable (column) names.
str(titmouse) #shows the data types.
#There are factors, integers ("int"), and numbers (num).

#Later, how will you bring your own data into R?
#It's best to have your data formatted as .csv/.txt as we do here.
#These are readable without any special programs and useable in any program.
#Most programs also allow export in some variation of those formats.
#It is possible to import from an Excel file but potentially difficult
#I find it much simpler to just save any spreadsheets I have as .csv or .txt,
#and it's a good habit to know how to work with those formats as they will be used
#in many programs as a basic file format.
?read.table #gives additional options, most notable read.csv, for importing text files.
#You can change the delimiter (the text that separates each column) to match your file.
#This can often be one of the most annoying/frustrating steps if the data aren't formatted just so,
#so don't feel silly if your data end up giving you trouble here.

#Step 3: Some descriptive statistics.
summary(titmouse) #lists all the summaries for each variable.
#Some are irrelevant (like datebanded, which is not really a factor, and bandnumber, which is arbitrary).
#First, we need some more categories.  Indexsum represents a hybrid index.  0 is one species and 6 is the other.
titmouse$species<-titmouse$indexsum # create a new column with indexsum data copied.
#Always copy your column or dataframe when doing something big like this.
#While it won't mess up your original file (this is only in your R session),
#it will save you trouble from having to reimport your file if you mess things up.

#Hmm, that's pretty inefficient though it works for small numbers of changes.
titmouse[titmouse$species>0&titmouse$species<6,]$species = "hybrid"
#Now we have a nice species-based grouping variable.

#Why does it give us NA?
(mean.crest<-mean(titmouse$crestlength, na.rm=TRUE))

#Yay!  It works when we remove NAs (missing data).
#You can also try out other variations on selecting data to get means.

#To get means for more than one column, there's a handy colMeans function.
colMeans(titmouse, na.rm=TRUE)
#Another error.  "x" must be numeric.  We have factors in this data frame.
colMeans(titmouse[,c("wingchordl","wingchordr","taillength","crestlength")], na.rm=TRUE)
#Oddly, several other functions use "na.action" with different options,
#but you can find those in the help files as needed.
#What about means for each group?

tapply(titmouse$crestlength, titmouse$species, FUN=mean)
#Wait, down in the help file
#You can see an example that includes na.rm=TRUE.
tapply(titmouse$crestlength, #the variable we're interested in
       titmouse$species, #index (i.e., the grouping variable)
       FUN=mean, na.rm=TRUE)
#Now it works.  You also see in the help file several other options for sorting by
#a grouping (or "INDEX" as this function calls it).  Let's try aggregate.
#We'll test the third method, with formulas, to get you used to the common R formula.
          #We want to see crest length grouped by species.
          #In later formulas (like for plotting and statistics),
          #it means that crest length is a function of species.
          FUN= mean,

#Ooh la la.  Data in a nice format!  Skimming help is always a good idea.

#Sample size/"length"
#That includes all observations, even NAs.
length(titmouse$crestlength, na.rm=TRUE)
#Oh no!  It won't let us see how many samples we have without NAs.
#Go look at the data frame in the global environment, and you can see that indeed two rows
#have NA for crest length.

median(titmouse$crestlength, na.rm=TRUE)

var(titmouse$crestlength, na.rm=TRUE)

#Standard deviation
(sd.crest<-sd(titmouse$crestlength, na.rm=TRUE))

#Coefficient of variation
(cv.crest<-sd.crest*100/mean(titmouse$crestlength, na.rm=TRUE))
#Standard error
#Googling suggests that you'd have to install a package to get this automatically.
#Let's try making a function instead!
se <- function(x) sd(x)/sqrt(length(x))

#Hmm, those pesky NAs.
se(titmouse$crestlength, na.rm=TRUE)
#This doesn't work either, because we didn't code in na.rm=TRUE into the function.<-function(x) sd(x,na.rm=TRUE)/sqrt(length(x))$crestlength)
#It works, but what if we want the option to code na.rm=FALSE?<-function(x,na.rm) sd(x,na.rm=na.rm)/sqrt(length(x))
(se.crest<$crestlength, na.rm=TRUE))


#Confidence interval (95%)
#For fitted models, R will do this automatically, but it's good to know how they are
#calculated by hand for the occasions you might need them
#(say, putting confidence intervals on a graph).
#A very simple graph!
points(upper95)  #add another point!
points(lower95)  #and another one!
#Obviously this is not what we'd want a confidence interval to look like, but we'll deal with those later.

#From where do people get 1.96?
n<-length(titmouse$crestlength) #what's our sample size?  We calculated that earlier, too.
qt(1-0.05/2, df=n-1) #t distribution, for our exact degrees of freedom.
qt(1-0.05/2, df=Inf) #the "idealized" case where we get the 1.96, as for the normal distribution next.
qnorm(1-0.05/2) #normal distribution.

#Hypothesis testing is discussed with t-tests.  R has good options in t.test().
#indicates you can select the other options from the textbook, such as
#one-sided or two-sided, paired t-tests, and equal variance or unequal variance.
#Sample from a population.

#Step 4: Exiting without losing your work.
#Save early and often on your .R file (this is your code and you don't want to lose it!)
#When it's time to exit RStuido, save the workspace in the RStudio menu
#by "Session">"Save workspace as".  You can reload the .RData file later.
#It automatically goes to your working directory but you can move it if you want.

#Finally, a note on careful coding.  I have on occasion worked with a bunch of objects
#and values, then closed my code, and opened it later to get a different result.
#Somewhere I left an old variable value.  So, now I close RStudio once I'm done,
#never save the workspace, and then re-run the code from the start to make sure I
#didn't miss any new bits of code I might have added to correct problems.
#Finally, always comment!  It's easy to forget what certain parts of the code do,
#and commenting will save you days of trying to figure it out.
#Plus, if someone else needs to use your code/dataset, or you submit to a journal that
#wants data and/or code, you'll be ready!

#Sneak preview: next week we'll look at plots!
plot(titmouse$crestlength~titmouse$wingchordr, data=na.omit(titmouse))
plot(titmouse$taillength~titmouse$wingchordr, data=na.omit(titmouse))

No comments:

Post a Comment

Comments and suggestions welcome.