## Tuesday, April 21, 2015

### Beginning R: selected multivariate statistics, part 2 of 3 (MANOVA and DFA)

This is the second part of a quick run-through of multivariate statistics I did for the quantitative methods class I'm co-teaching with Dr. Nicola Koper.

Go here for a link to the data and data import code.

######################################
##  Multivariate ANOVA (MANOVA) and ##
#Discriminant Function Analysis (DFA)#
######################################
##MANOVA##
##########
#Codes adapted from help file and here.

birdsize.manova<-lm(cbind(titmouse.na.omit\$wingchordr, titmouse.na.omit\$mass)~titmouse.na.omit\$species,
data=titmouse.na.omit)

#A few ways to do this.
summary(manova(birdsize.manova), test="Wilks")
#manova gives type I ANOVA.
#manova by itself gives the univariate ANOVAs but without significance testing.
manova(birdsize.manova)
#the 'car' package will also give you type II and type III MANOVAs.
library(car)
Anova(birdsize.manova, type=3, test="Wilks")
summary(Manova(birdsize.manova)) #gives you all the different test statistics.
#Default is type II without specifying test type.
#To look at which dependent variables are contributing, do all the univariate ANOVAs,
#or maybe the DFA (next up).

##########
###DFA####
##########

#use MANOVA first.  Remember DFA and MANOVA are two versions of the same analysis.
dfa.manova<-lm(cbind(titmouse.na.omit\$mass,
titmouse.na.omit\$wingchordr,
titmouse.na.omit\$taillength)~titmouse.na.omit\$species,
data=titmouse.na.omit)

library(candisc)
#note this masks your cancor that is used in the canonical correlation function,
#so we'll need to detach the library once we are done with the DFA if planning on using CCA again.

candiscrimin<-candisc(dfa.manova) #This function takes the MANOVA model and runs a canonical discriminant analysis on it.

candiscrimin #get the test statistics for each discriminant function.  Only the first one is significant.
#CanRsq is how much of the total relationship between groups and predictors.
#DF1 accounts for about 42.2% of the total relationship (overlapping variance)
#as in CCA (square root of this value would be canonical correlation),
#and 97.3% of between-groups variability.
#Annoyingly, the "eigenvalue" here is the "eigenvalue" used in SAS, I believe,
#which is not defined as in Tabachnick and Fidell (2007) but some sort of inverted matrix thing
#(birdsize.dfa below shows a slightly different number, about 98%, but I believe they use slightly different calculation methods).

summary(candiscrimin,
means=TRUE, #mean scores for each group
scores=TRUE, #the scores that will be plotted

plot(candiscrimin)
#Is in SAS, but same procedure.

#detach the package since we use cancor() that's not from the candisc package
#in our CCA function.

#Use the more specific discriminant function analysis in lda() to get classifications.
library(MASS)
birdsize.dfa <- lda(species~ mass+wingchordr+taillength,
data = titmouse.na.omit,
prior = c(1/3, 1/3, 1/3))
#lda stands for linear discriminant analysis.
#prior, from help: "the prior probabilities of class membership.
#If unspecified, the class proportions for the training set are used.
#If present, the probabilities should be specified in the order of the factor levels."
#Because we don't want the birds weighted by frequency of each type,
#just use 1/k (number of groups) k times in the order of the factor levels.
#You can find out your the order of levels in your factor using levels().
levels(titmouse.na.omit\$species)

plot(birdsize.dfa, dimen=1, type="b")      #Plots the 1st discriminant function scores in a histogram with a density line.
plot(birdsize.dfa) #Plots your discriminant functions with group names as the symbols
print(birdsize.dfa)     # Proportion of trace is the proportion of between-groups variance explained by adding in each DF.
#It's similar to what's calculated above in candisc but this method uses slightly different calculations.
predict(birdsize.dfa)\$x  #The individual DF scores.

#Let's test our classifications.
dfa.predict <- predict(birdsize.dfa,
prior=c(1/3,1/3,1/3))
#If you add a newdata= argument then you can a subset of your data as training data and another set as validation data,
#or predict from a totally new dataset.

#Store the results in a new list.
#The predicted groups are a factor.
dfa.predict\$class

#Get out the probabilities.
dfa.predict\$posterior # posterior probabilities of group classifications
dfa.predict\$x         # DF scores, same as you got earlier.
#If the newdata argument is not specified, it classifies the original dataset and classification rates may be high.

cor(dfa.predict\$x, titmouse.na.omit[,c("mass",
"wingchordr",
"taillength")])

(dfa.accuracy<-table("actually was"=titmouse.na.omit\$species,
"classified as"=dfa.predict\$class)) # accuracy of classification

sum(dfa.accuracy[row(dfa.accuracy) == col(dfa.accuracy)]) / sum(dfa.accuracy) #proportion of correctly classified rows.
accuracy.table<-diag(prop.table(dfa.accuracy,1)) #proportion of correctly classified individuals in each group.
#hybrids seem pretty hard to classify based on body size and tail length.

#Do jackknifing to test classifications with "leave-one-out".
birdsize.dfa.jk <- lda(species~ mass+wingchordr+taillength,
data = titmouse.na.omit,
prior = c(1/3, 1/3, 1/3),
CV=TRUE) #CV=TRUE specifies cross-validation (aka jackknifing)
birdsize.dfa.jk\$class #show the new jackknifed classifications.
(dfa.accuracy.jk<-table("actually was"=titmouse.na.omit\$species,
"jk classified as"=birdsize.dfa.jk\$class)) # accuracy of classification
jackknifing.accuracy.table<-diag(prop.table(dfa.accuracy.jk,1))

chisq.test(dfa.accuracy.jk) #better than random

The last of the multivariate code will be principle components analysis (PCA) next week!