This is the second part of a quick run-through of multivariate statistics I did for the quantitative methods class I'm co-teaching with Dr. Nicola Koper.

Go here for a link to the data and data import code.

######################################

## Multivariate ANOVA (MANOVA) and ##

#Discriminant Function Analysis (DFA)#

######################################

##MANOVA##

##########

#Codes adapted from help file and here.

birdsize.manova<-lm(cbind(titmouse.na.omit$wingchordr, titmouse.na.omit$mass)~titmouse.na.omit$species,

data=titmouse.na.omit)

#A few ways to do this.

summary(manova(birdsize.manova), test="Wilks")

#manova gives type I ANOVA.

#manova by itself gives the univariate ANOVAs but without significance testing.

manova(birdsize.manova)

#the 'car' package will also give you type II and type III MANOVAs.

library(car)

Anova(birdsize.manova, type=3, test="Wilks")

summary(Manova(birdsize.manova)) #gives you all the different test statistics.

#Default is type II without specifying test type.

#To look at which dependent variables are contributing, do all the univariate ANOVAs,

#or maybe the DFA (next up).

##########

###DFA####

##########

#Code adapted from four sources.

#use MANOVA first. Remember DFA and MANOVA are two versions of the same analysis.

dfa.manova<-lm(cbind(titmouse.na.omit$mass,

titmouse.na.omit$wingchordr,

titmouse.na.omit$taillength)~titmouse.na.omit$species,

data=titmouse.na.omit)

library(candisc)

#note this masks your cancor that is used in the canonical correlation function,

#so we'll need to detach the library once we are done with the DFA if planning on using CCA again.

candiscrimin<-candisc(dfa.manova) #This function takes the MANOVA model and runs a canonical discriminant analysis on it.

candiscrimin #get the test statistics for each discriminant function. Only the first one is significant.

#CanRsq is how much of the total relationship between groups and predictors.

#DF1 accounts for about 42.2% of the total relationship (overlapping variance)

#as in CCA (square root of this value would be canonical correlation),

#and 97.3% of between-groups variability.

#Annoyingly, the "eigenvalue" here is the "eigenvalue" used in SAS, I believe,

#which is not defined as in Tabachnick and Fidell (2007) but some sort of inverted matrix thing

#that we won't worry about.

#(birdsize.dfa
below shows a slightly different number, about 98%, but I believe they
use slightly different calculation methods).

summary(candiscrimin,

means=TRUE, #mean scores for each group

scores=TRUE, #the scores that will be plotted

coef=c("structure")) #loading matrix.

plot(candiscrimin)

#More info on understanding output: http://www.ats.ucla.edu/stat/sas/output/sas_discrim.htm

#Is in SAS, but same procedure.

detach(package:candisc, unload=TRUE)

#detach the package since we use cancor() that's not from the candisc package

#in our CCA function.

#Use the more specific discriminant function analysis in lda() to get classifications.

library(MASS)

birdsize.dfa <- lda(species~ mass+wingchordr+taillength,

data = titmouse.na.omit,

prior = c(1/3, 1/3, 1/3))

#lda stands for linear discriminant analysis.

#prior, from help: "the prior probabilities of class membership.

#If unspecified, the class proportions for the training set are used.

#If present, the probabilities should be specified in the order of the factor levels."

#Because we don't want the birds weighted by frequency of each type,

#just use 1/k (number of groups) k times in the order of the factor levels.

#You can find out your the order of levels in your factor using levels().

levels(titmouse.na.omit$species)

plot(birdsize.dfa, dimen=1, type="b") #Plots the 1st discriminant function scores in a histogram with a density line.

plot(birdsize.dfa) #Plots your discriminant functions with group names as the symbols

print(birdsize.dfa) # Proportion of trace is the proportion of between-groups variance explained by adding in each DF.

#It's similar to what's calculated above in candisc but this method uses slightly different calculations.

predict(birdsize.dfa)$x #The individual DF scores.

#Let's test our classifications.

dfa.predict <- predict(birdsize.dfa,

prior=c(1/3,1/3,1/3))

#If you add a newdata= argument then you can a subset of your data as training data and another set as validation data,

#or predict from a totally new dataset.

#Store the results in a new list.

#The predicted groups are a factor.

dfa.predict$class

#Get out the probabilities.

dfa.predict$posterior # posterior probabilities of group classifications

dfa.predict$x # DF scores, same as you got earlier.

#If the newdata argument is not specified, it classifies the original dataset and classification rates may be high.

#produce the loadings manually. This is similar to the loadings generated by candisc().

cor(dfa.predict$x, titmouse.na.omit[,c("mass",

"wingchordr",

"taillength")])

(dfa.accuracy<-table("actually was"=titmouse.na.omit$species,

"classified as"=dfa.predict$class)) # accuracy of classification

sum(dfa.accuracy[row(dfa.accuracy) == col(dfa.accuracy)]) / sum(dfa.accuracy) #proportion of correctly classified rows.

accuracy.table<-diag(prop.table(dfa.accuracy,1)) #proportion of correctly classified individuals in each group.

#hybrids seem pretty hard to classify based on body size and tail length.

#Do jackknifing to test classifications with "leave-one-out".

birdsize.dfa.jk <- lda(species~ mass+wingchordr+taillength,

data = titmouse.na.omit,

prior = c(1/3, 1/3, 1/3),

CV=TRUE) #CV=TRUE specifies cross-validation (aka jackknifing)

birdsize.dfa.jk$class #show the new jackknifed classifications.

(dfa.accuracy.jk<-table("actually was"=titmouse.na.omit$species,

"jk classified as"=birdsize.dfa.jk$class)) # accuracy of classification

jackknifing.accuracy.table<-diag(prop.table(dfa.accuracy.jk,1))

chisq.test(dfa.accuracy.jk) #better than random

The last of the multivariate code will be principle components analysis (PCA) next week!

## No comments:

## Post a Comment

Comments and suggestions welcome.