## Saturday, April 26, 2014

### Developing functions, getting results out of them, and running with an example on canonical correlation analysis

Functions, lists, canonical correlation analysis (CCA)... how much more fun can one person have?

R comes with a variety of built-in functions.  What if you have a transformation you want to make manually, though?  Or an equation?  Or a series of functions that always need performed together?  You can easily make your own function.  Here I'll show you how to how to develop a function that gets you all of the results if you need to run multiple CCAs.  You then can run a CCA on your own using the one-time code, using the function, or make your own function for something else.

#First, we need some sample data.
yourdata<-data.frame(
c(rep(c("place1", "place2"), each=10)),
c(rep(c(1.2,2.7,3.3,4.9), each=5)),
c(rnorm(20,mean=0, sd=1)),
c(rep(c(1.2,4.4,2.9,7.6), 5)),
c(rep(c(5.7,6.2,7.1,8.1), each=5))
)

#Generating four columns of random and non-random numbers and
#one column of grouping variables for later subsetting of the data.

colnames(yourdata)<-c("subset", "var1", "var2", "with1", "with2")
#Naming the columns for later use.

#Now test out canonical correlation.
vars<-as.matrix(cbind(yourdata[,c("var1","var2")]))
withs<-as.matrix(cbind(yourdata[,c("with1","with2")]))
results<-cancor(vars, withs)
results
#view the results

#Okay, we get some results.
#But what about when we want to run this with different variables?
#It might be easier, if we're going to need to do the same set of commands
#repeatedly, to make a function.  Functions automate a process.

#For example, let's say we always want to add two variables (a and b).
#Every time, we could just do this:

a<-1
b<-2

c<-a+b

#But what about next time when we need a=2 and b=5?
#We'd have to re-set the variables or make new ones.
#It'd be easier to create an adding function.

#function with the variables you can input in parentheses (here a and b)
#followed by a curly bracket.
a+b #This tells you what the function will do with your input variables.
} #Close the curly brackets.

#This is a very simple example, but the concept holds.
#Take a process and automate it.  Now we'll automate our CCA code from above.

canonical.corr.simple<-function (var, with, dataframe) {
#function name and the variables we can input.
#The next four lines are the same thing we did earlier to get CCA results,
#but with the standardized variable names.
var.original<-as.matrix(cbind(dataframe[,var]))
with.original<-as.matrix(cbind(dataframe[,with]))
results.cancor<-cancor(var.original,with.original)
results.cancor
}

#Now you can run it using input variables.

var.variables<-c("var1", "var2")
with.variables<-c("with1", "with2")

function.results<-canonical.corr.simple(var=var.variables,
with=with.variables,
dataframe=yourdata)
function.results

function.results\$cor
#You can access different pieces of the function results this way.
function.results\$cor
#Use this list notation (I'll get more into it later) to get very specific pieces of the results.

#This is useful because now you can run the same function
#more than once using different input objects.

#However, you'll notice that if you try to access any objects
#internal to the function, you get an error.
#This is annoying if we want to keep our results,
#especially once we make a more complicated function where we need results
#that are not in the final printout.

var.original

#This means that the internal objects haven't been created in our workspace.
#To access them for later analysis, you'll need to return these values as a list.

canonical.corr.listreturn<-function (var, with, dataframe) {
var.original<-as.matrix(cbind(dataframe[,var]))
with.original<-as.matrix(cbind(dataframe[,with]))
results.cancor<-cancor(var.original,with.original)
results.list<-list(results.cancor,
var.original,
with.original)
return(results.list)
}

#Test out the function!
results.yeah<-canonical.corr.listreturn(var=var.variables,
with=with.variables,
dataframe=yourdata)

#If you do not assign your results to an object,
#it will just print them and you can't access the list.
#Sometimes I forget this and try to ask for "results.list"
#in the function, but that won't work if you
#don't assign your results to an object like I did here.
#So then type out your object name.  It is now a list with three main elements.
results.yeah

#Yep.  Lots of info.  But how to we get individual pieces of the list?
#First it helps to know the structure of the list.
str(results.yeah)
#List of 3
#\$ :List of 5
#..\$ cor    : num [1:2] 0.98 0.104
#..\$ xcoef  : num [1:2, 1:2] -0.1714 0.0226 -0.0144 0.1979
#.. ..- attr(*, "dimnames")=List of 2
#.. .. ..\$ : chr [1:2] "var1" "var2"
#.. .. ..\$ : NULL
#..\$ ycoef  : num [1:2, 1:2] -0.00089 -0.24404 0.09633 -0.04365
#.. ..- attr(*, "dimnames")=List of 2
#.. .. ..\$ : chr [1:2] "with1" "with2"
#.. .. ..\$ : NULL
#..\$ xcenter: Named num [1:2] 3.025 0.0977
#.. ..- attr(*, "names")= chr [1:2] "var1" "var2"
#..\$ ycenter: Named num [1:2] 4.03 6.78
#.. ..- attr(*, "names")= chr [1:2] "with1" "with2"
#\$ : num [1:20, 1:2] 1.2 1.2 1.2 1.2 1.2 2.7 2.7 2.7 2.7 2.7 ...
#..- attr(*, "dimnames")=List of 2
#.. ..\$ : NULL
#.. ..\$ : chr [1:2] "var1" "var2"
#\$ : num [1:20, 1:2] 1.2 4.4 2.9 7.6 1.2 4.4 2.9 7.6 1.2 4.4 ...
#..- attr(*, "dimnames")=List of 2
#.. ..\$ : NULL
#.. ..\$ : chr [1:2] "with1" "with2"

#So, how do we get stuff out of this structure?

results.yeah
#This is the whole first element.
#You can see its structure, too:
str(results.yeah)
#It has five elements, and more within each of those.
#To go farther into the list, use the square brackets.
#You need two around each except for the last element.
results.yeah[]
#We got \$cor!  Woohoo!  To get only the first \$cor, try this:
results.yeah[][]
#Now just test out which pieces of the list you want using the data from str().
results.yeah[]
#For example, the second element at this level is \$xcoef.
#Sometimes I have to experiment with it a little to find what I want, but just remember
#it's got a structure, so you will be able to find what you need.

#Once you've messed around with that a bit to see
#how you can get to the various pieces you want,
#let's move on to making a more complete CCA function to use.
#This one will have spots for transformed and untransformed variables
#in case you need to transform any of your data and

yourdata.transforms<-yourdata
yourdata.transforms\$lnwith2<-log(yourdata.transforms\$with2)

#R uses the natural log
#Now we have example data with one transformed variable.

#Select the columns you want.
var.variables<-c("var1","var2")
with.variables<-c("with1","lnwith2")
#note I've selected one transformed variable.
#These are your variables.  You are correlating the with and var variate
#for the with and var variables.  That is, the two var variables will be combined into a single
#variate.  The same thing is done to the with variables (they are combined into a with variate
#that extracts as much variance as possible from the variables).

#Select columns for untransformed variables.
var.variables.untransformed<-c("var1","var2")
with.variables.untransformed<-c("with1","with2")

#note I've selected both untransformed variables.
#You need to do this so that you can get loadings of the variables on the variates
#for interpretation of the variates.

#Select any subsetting.  Here I will use data from either place1 or place2
#(all the data), but you can use this to select data from particular sites or groups.
subsetting.all<-c(yourdata.transforms\$subset=='place1'|yourdata.transforms\$subset=='place2')
#That vertical line is "or".

#Make some variable names for when we get to graphing just to keep track.
graphing.variable.names<-c("var-variate", "with-variate")

#Make sure you have downloaded and installed the packages 'yacca' and 'CCP'
#if you don't already have them.
canonical.corr<- function (var,
with,
var.untransformed,
with.untransformed,
variablenames,
dataframe,
subsetting) {

var.original<-as.matrix(cbind(dataframe[subsetting,var]))
#subsetting selects rows
with.original<-as.matrix(cbind(dataframe[subsetting,with]))
results.cancor<-cancor(var.original,with.original)
rho<-cancor(var.original,with.original)\$cor
N<-dim(var.original)
p<-dim(var.original)
q<-dim(with.original)

#These four lines are needed for the below two, to get a Wilks' lambda.
library(CCP)
cancorresults<-p.asym(rho,N,p,q,tstat="Wilks")

library(yacca)
results.yacca<-cca(var.original,with.original)

#easiest version of cancor!  gives correlations.
#Use CCP library for additional test stats if needed, because cca() will only give Rao's
#(related to and same value as Wilks).

#To see structural correlations (loadings) for how X and Y variables change with variates,
#get redundancy.  Interpret loadings that are |r| equal to or greater than 0.33.
#If input variables are transformed, then need to do a Pearson's correlation on original variable
#and appropriate canonical variate.
var_expl<-cbind('x.redundancy'=results.yacca\$xvrd, 'y.redundancy'=results.yacca\$yvrd)
#Canonical redundancies for x variables
#(i.e., total fraction of x variance accounted for by y variables, through each canonical variate)
#yvrd = Canonical redundancies for y variables (i.e., total fraction of y variance accounted for by x variables, through each canonical variate).
#y is with, so y.redundancy is amount of with variance accounted for by var variance.
#x is var, so x.redundancy is amount of var variance accounted for by with variance.
var_expl #prints

cancor.variates<-cbind('canonical.correlation'=results.yacca\$corr,
#report this
'overlappingvariance.corrsq'=results.yacca\$corrsq,
#canonical correlation with THIS % overlapping variance
#amount of x variate accounted for by x variables
# amount of y variate accounted for by y variables
)

CC1x<-as.matrix(results.yacca\$canvarx)
CC1x
CC1y<-as.matrix(results.yacca\$canvary)

#is same as loadings in the original redundancy analysis because there was no transformation.
#Used in interpreting graph because of transformation.

Yplot<-data.frame(CC1x[,1])
Xplot<-data.frame(CC1y[,1])

graphing<-cbind(Xplot, Yplot) #This gives you data you can plot.
colnames(graphing)<-variablenames #This gives you names of your variables.
results.list<-list(cancorresults,
var_expl,
graphing,
cancor.variates)

return(results.list)
}

long.results<-canonical.corr(var=var.variables,
with=with.variables,
var.untransformed=var.variables.untransformed,
with.untransformed=with.variables.untransformed,
variablenames=graphing.variable.names,
dataframe=yourdata.transforms,
subsetting=subsetting.all
)

long.results #look at our results here.  You can see the range of results we extracted.
str(long.results) #and the structure.

#First, is it significant?
long.results[]\$p.value #This gets the first variate.
#long.results[]\$p.value gets the second.
#long.results[]\$p.value gets all the variates' p-values.
#long.results[] shows you all the degrees of freedom and Wilks test statistic that you need to report.

long.results[] #shows the redundancy analysis.  How much of X is explained by y?
#This is what I commented in the function, but I'll repeat here:
#total fraction of y or x variance accounted for by x or y variables,
#through each canonical variate).
#y is with, so y.redundancy is amount of with variance accounted for by var variance.
#x is var, so x.redundancy is amount of var variance accounted for by with variance.

long.results[] #Loadings for var variables on each canonical variate (CV1 and CV2 in this case)
long.results[] #Loadings for with variables on each canonical variate (CV1 and CV2 in this case)

#How about graphing?  Here's a very simple plot.  It uses the variable names we made earlier
#and plots CCI of each variate.
plot(long.results[])

long.results[] #Finally some more interpretative stats.
#The first two are the canonical correlation and overlapping variance (correlation squared).
#The last two are the variance explained by the variables within its own variate.
#variance.in.x.from.x.cc is how much of the "var" variables (x) is explained by its variate.
#Same for y ("with") variance.in.y.from.y.cc.

This last function (canonical.corr) is the code I use to run canonical correlation analysis in my work currently.  You can read the help files for the CCA functions within it if you need to extract additional statistics for your particular analysis; just write them into the function and add them to the results list that is returned.