Saturday, April 26, 2014

Developing functions, getting results out of them, and running with an example on canonical correlation analysis

Functions, lists, canonical correlation analysis (CCA)... how much more fun can one person have?

R comes with a variety of built-in functions.  What if you have a transformation you want to make manually, though?  Or an equation?  Or a series of functions that always need performed together?  You can easily make your own function.  Here I'll show you how to how to develop a function that gets you all of the results if you need to run multiple CCAs.  You then can run a CCA on your own using the one-time code, using the function, or make your own function for something else.

#First, we need some sample data.
yourdata<-data.frame(
                     c(rep(c("place1", "place2"), each=10)),
                     c(rep(c(1.2,2.7,3.3,4.9), each=5)),
                     c(rnorm(20,mean=0, sd=1)),
                     c(rep(c(1.2,4.4,2.9,7.6), 5)),
                     c(rep(c(5.7,6.2,7.1,8.1), each=5))
                     )


#Generating four columns of random and non-random numbers and
#one column of grouping variables for later subsetting of the data.

colnames(yourdata)<-c("subset", "var1", "var2", "with1", "with2")
#Naming the columns for later use.

#Now test out canonical correlation.
vars<-as.matrix(cbind(yourdata[,c("var1","var2")]))
withs<-as.matrix(cbind(yourdata[,c("with1","with2")]))
results<-cancor(vars, withs)
results
#view the results

#Okay, we get some results.
#But what about when we want to run this with different variables?
#It might be easier, if we're going to need to do the same set of commands
#repeatedly, to make a function.  Functions automate a process.

#For example, let's say we always want to add two variables (a and b).
#Every time, we could just do this:

a<-1
b<-2

c<-a+b


#But what about next time when we need a=2 and b=5?
#We'd have to re-set the variables or make new ones.
#It'd be easier to create an adding function.

adding<-function(a,b){  #name the function ('adding')
  #function with the variables you can input in parentheses (here a and b)
  #followed by a curly bracket.
  a+b #This tells you what the function will do with your input variables.
} #Close the curly brackets.

adding(a=2,b=5)

#This is a very simple example, but the concept holds.
#Take a process and automate it.  Now we'll automate our CCA code from above.

canonical.corr.simple<-function (var, with, dataframe) {
  #function name and the variables we can input.
  #The next four lines are the same thing we did earlier to get CCA results,
  #but with the standardized variable names.
  var.original<-as.matrix(cbind(dataframe[,var]))
  with.original<-as.matrix(cbind(dataframe[,with]))
  results.cancor<-cancor(var.original,with.original)
  results.cancor
}

#Now you can run it using input variables.

var.variables<-c("var1", "var2")
with.variables<-c("with1", "with2")

function.results<-canonical.corr.simple(var=var.variables,
                                        with=with.variables,
                                        dataframe=yourdata)
function.results

function.results$cor
#You can access different pieces of the function results this way.
function.results$cor[1]
#Use this list notation (I'll get more into it later) to get very specific pieces of the results.

#This is useful because now you can run the same function
#more than once using different input objects.

#However, you'll notice that if you try to access any objects
#internal to the function, you get an error.
#This is annoying if we want to keep our results,
#especially once we make a more complicated function where we need results
#that are not in the final printout.

var.original

#Error: object 'var.original' not found

#This means that the internal objects haven't been created in our workspace.
#To access them for later analysis, you'll need to return these values as a list.

canonical.corr.listreturn<-function (var, with, dataframe) {
  var.original<-as.matrix(cbind(dataframe[,var]))
  with.original<-as.matrix(cbind(dataframe[,with]))
  results.cancor<-cancor(var.original,with.original)
  results.list<-list(results.cancor,
                     var.original,
                     with.original)
  return(results.list)
  }



#Test out the function!
results.yeah<-canonical.corr.listreturn(var=var.variables,
                                        with=with.variables,
                                        dataframe=yourdata)

#If you do not assign your results to an object,
#it will just print them and you can't access the list.
#Sometimes I forget this and try to ask for "results.list"
#in the function, but that won't work if you
#don't assign your results to an object like I did here.
#So then type out your object name.  It is now a list with three main elements.
results.yeah

#Yep.  Lots of info.  But how to we get individual pieces of the list?
#First it helps to know the structure of the list.
str(results.yeah)
#List of 3
#$ :List of 5
#..$ cor    : num [1:2] 0.98 0.104
#..$ xcoef  : num [1:2, 1:2] -0.1714 0.0226 -0.0144 0.1979
#.. ..- attr(*, "dimnames")=List of 2
#.. .. ..$ : chr [1:2] "var1" "var2"
#.. .. ..$ : NULL
#..$ ycoef  : num [1:2, 1:2] -0.00089 -0.24404 0.09633 -0.04365
#.. ..- attr(*, "dimnames")=List of 2
#.. .. ..$ : chr [1:2] "with1" "with2"
#.. .. ..$ : NULL
#..$ xcenter: Named num [1:2] 3.025 0.0977
#.. ..- attr(*, "names")= chr [1:2] "var1" "var2"
#..$ ycenter: Named num [1:2] 4.03 6.78
#.. ..- attr(*, "names")= chr [1:2] "with1" "with2"
#$ : num [1:20, 1:2] 1.2 1.2 1.2 1.2 1.2 2.7 2.7 2.7 2.7 2.7 ...
#..- attr(*, "dimnames")=List of 2
#.. ..$ : NULL
#.. ..$ : chr [1:2] "var1" "var2"
#$ : num [1:20, 1:2] 1.2 4.4 2.9 7.6 1.2 4.4 2.9 7.6 1.2 4.4 ...
#..- attr(*, "dimnames")=List of 2
#.. ..$ : NULL
#.. ..$ : chr [1:2] "with1" "with2"


#So, how do we get stuff out of this structure?

results.yeah[1]
#This is the whole first element.
#You can see its structure, too:
str(results.yeah[1])
#It has five elements, and more within each of those.
#To go farther into the list, use the square brackets.
#You need two around each except for the last element.
results.yeah[[1]][1]
#We got $cor!  Woohoo!  To get only the first $cor, try this:
results.yeah[[1]][[1]][1]
#Now just test out which pieces of the list you want using the data from str().
results.yeah[[1]][2]
#For example, the second element at this level is $xcoef.
#Sometimes I have to experiment with it a little to find what I want, but just remember
#it's got a structure, so you will be able to find what you need.

#Once you've messed around with that a bit to see
#how you can get to the various pieces you want,
#let's move on to making a more complete CCA function to use.
#This one will have spots for transformed and untransformed variables
#in case you need to transform any of your data and
#will output additional statistics to include in your work.

yourdata.transforms<-yourdata
yourdata.transforms$lnwith2<-log(yourdata.transforms$with2)

#R uses the natural log
#Now we have example data with one transformed variable.

#Select the columns you want.
var.variables<-c("var1","var2")
with.variables<-c("with1","lnwith2")
#note I've selected one transformed variable.
#These are your variables.  You are correlating the with and var variate
#for the with and var variables.  That is, the two var variables will be combined into a single
#variate.  The same thing is done to the with variables (they are combined into a with variate
#that extracts as much variance as possible from the variables).

#Select columns for untransformed variables.
var.variables.untransformed<-c("var1","var2")
with.variables.untransformed<-c("with1","with2")

#note I've selected both untransformed variables.
#You need to do this so that you can get loadings of the variables on the variates
#for interpretation of the variates. 

#Select any subsetting.  Here I will use data from either place1 or place2
#(all the data), but you can use this to select data from particular sites or groups.
subsetting.all<-c(yourdata.transforms$subset=='place1'|yourdata.transforms$subset=='place2') 
#That vertical line is "or".

#Make some variable names for when we get to graphing just to keep track.
graphing.variable.names<-c("var-variate", "with-variate")


#Make sure you have downloaded and installed the packages 'yacca' and 'CCP'
#if you don't already have them.
canonical.corr<- function (var,
                           with,
                           var.untransformed,
                           with.untransformed,
                           variablenames,
                           dataframe,
                           subsetting) {
 
 
  var.original<-as.matrix(cbind(dataframe[subsetting,var])) 
#subsetting selects rows
  with.original<-as.matrix(cbind(dataframe[subsetting,with]))
  results.cancor<-cancor(var.original,with.original)
  rho<-cancor(var.original,with.original)$cor
  N<-dim(var.original)[1]
  p<-dim(var.original)[2]
  q<-dim(with.original)[2]

  #These four lines are needed for the below two, to get a Wilks' lambda.
  library(CCP)
  cancorresults<-p.asym(rho,N,p,q,tstat="Wilks")
 

  library(yacca)
  results.yacca<-cca(var.original,with.original)

  #easiest version of cancor!  gives correlations.
  #Use CCP library for additional test stats if needed, because cca() will only give Rao's
  #(related to and same value as Wilks).
 
  #To see structural correlations (loadings) for how X and Y variables change with variates,
  #get redundancy.  Interpret loadings that are |r| equal to or greater than 0.33.
  #If input variables are transformed, then need to do a Pearson's correlation on original variable
  #and appropriate canonical variate.
  var_expl<-cbind('x.redundancy'=results.yacca$xvrd, 'y.redundancy'=results.yacca$yvrd)
  #Canonical redundancies for x variables
  #(i.e., total fraction of x variance accounted for by y variables, through each canonical variate)
  #yvrd = Canonical redundancies for y variables (i.e., total fraction of y variance accounted for by x variables, through each canonical variate).
  #y is with, so y.redundancy is amount of with variance accounted for by var variance.
  #x is var, so x.redundancy is amount of var variance accounted for by with variance.
  var_expl #prints
 
  cancor.variates<-cbind('canonical.correlation'=results.yacca$corr, 
#report this
                         'overlappingvariance.corrsq'=results.yacca$corrsq,
#canonical correlation with THIS % overlapping variance
                         'variance.in.x.from.x.cc'=results.yacca$xcanvad,
#amount of x variate accounted for by x variables
                         'variance.in.y.from.y.cc'=results.yacca$ycanvad
# amount of y variate accounted for by y variables
  )
 
 
  CC1x<-as.matrix(results.yacca$canvarx)
  CC1x
  CC1y<-as.matrix(results.yacca$canvary)
  with.loadings<-cor(CC1y,dataframe[subsetting,with.untransformed],method="pearson")

  # this gives hybrid index (y, with) loadings for untransformed data.
  #is same as loadings in the original redundancy analysis because there was no transformation.
  var.loadings<-cor(CC1x,dataframe[subsetting,var.untransformed],method="pearson")
  # this gives song (x, var) loadings for untransformed data. 
  #Used in interpreting graph because of transformation.
 
  Yplot<-data.frame(CC1x[,1])
  Xplot<-data.frame(CC1y[,1])
 

  graphing<-cbind(Xplot, Yplot) #This gives you data you can plot.
  colnames(graphing)<-variablenames #This gives you names of your variables.
  results.list<-list(cancorresults,
                     var_expl,
                     var.loadings,
                     with.loadings,
                     graphing,
                     cancor.variates)
 
  return(results.list)
}

long.results<-canonical.corr(var=var.variables,
                             with=with.variables,
                             var.untransformed=var.variables.untransformed,
                             with.untransformed=with.variables.untransformed,
                             variablenames=graphing.variable.names,
                             dataframe=yourdata.transforms,
                             subsetting=subsetting.all
)

long.results #look at our results here.  You can see the range of results we extracted.
str(long.results) #and the structure.


#First, is it significant?
long.results[[1]]$p.value[1] #This gets the first variate.
#long.results[[1]]$p.value[2] gets the second.
#long.results[[1]]$p.value gets all the variates' p-values.
#long.results[[1]] shows you all the degrees of freedom and Wilks test statistic that you need to report.

long.results[[2]] #shows the redundancy analysis.  How much of X is explained by y?
#This is what I commented in the function, but I'll repeat here:
#total fraction of y or x variance accounted for by x or y variables,
#through each canonical variate).
#y is with, so y.redundancy is amount of with variance accounted for by var variance.
#x is var, so x.redundancy is amount of var variance accounted for by with variance.

#Loadings for each variate.  Interpret loadings higher than |0.33| for significant variates.
long.results[[3]] #Loadings for var variables on each canonical variate (CV1 and CV2 in this case)
long.results[[4]] #Loadings for with variables on each canonical variate (CV1 and CV2 in this case)

#How about graphing?  Here's a very simple plot.  It uses the variable names we made earlier
#and plots CCI of each variate.
plot(long.results[[5]])

long.results[[6]] #Finally some more interpretative stats.
#The first two are the canonical correlation and overlapping variance (correlation squared).
#The last two are the variance explained by the variables within its own variate.
#variance.in.x.from.x.cc is how much of the "var" variables (x) is explained by its variate.
#Same for y ("with") variance.in.y.from.y.cc.


This last function (canonical.corr) is the code I use to run canonical correlation analysis in my work currently.  You can read the help files for the CCA functions within it if you need to extract additional statistics for your particular analysis; just write them into the function and add them to the results list that is returned.

Saturday, April 12, 2014

Generate a series of random numbers

I'm back, and now am the proud possessor of a doctorate!  I have a really quick post for now, and soon should get a much longer post up on functions, loops, and canonical correlation.

If you want to generate a series of random integers, you can use the sample function.

sample(10, 20, replace=FALSE)
#The first number is how many items you want to end up with
#Here we want 10 random numbers.
#The second number is the number of items to choose from.
#Here we sample out of the integers 1-20.
#replace=FALSE sets it so that we sample without replacement.
#If you draw 6, for example, you won't get 6 again.
#If you set both numbers to be the same thing, you get all the numbers
#in your series drawn in a random order.