Sunday, January 19, 2014

Profile analysis in SAS with R-generated figure

I was tickled to find a set of code to do a profile analysis in R hiding in a larger set of code examples.  A profile analysis is a multivariate ANOVA for repeated measures (Tabachnik and Fidell 2007).  I haven't been able to make the flatness test in SAS and R match yet using the code from the linked site, though the tests for parallelism and levels give the same results.  (I'll update the post if I figure that out).  Regardless, I really liked how the linked site made an interaction plot for a profile analysis, with pale lines for individuals shown in the background behind averages for each level.  So, what I've been doing for my profile analyses lately is using R to format the data, running the profile analysis in SAS and then graphing in R.


1. R code for formatting the data

My data are usually in column form, with each level and group as factors in two additional columns.  To get to the one-column-per-group needed for a profile analysis, I use the reshape2 package.  Follow the instructions in the melting post.  (I'll use that dataset here.)

Once you have your data in the correct format, you can export it to a .csv text file.  Make sure you have your workspace set to where you want the file to go.  (I usually do this step at the beginning when I import data, but since we generated our dataset in R, the workspace needs to be set here.)

setwd("C:\\Users\\Claire\\Documents\\")
#make sure you use two backslashes, not one.
write.csv(data.columns,file="20140116_profile_analysis.csv", na=".")
#Writes a file that can easily be imported into SAS.
#Note the na="." part, which gives periods for missing data as SAS prefers.
#My sample dataset doesn't have this, but I've left it in for when I do have a set with missing data.
#(SAS will not use those individuals in the profile analysis.)
#The file will be written to your working directory.
#You can now import it into SAS.

 

2. SAS code for analysis


The code is adapted with little change from Tabachnik and Fidell (2007), who also explain further tests you can do for your analysis.

proc glm data=whereyoukeepyourdata.dataset;
/*The profile analysis uses proc glm.*/
class levels;

/*In the class statement, levels is the variable name of you want your levels to be.*/model group1 group2 group3 group4 = levels;
/*The model statement tells you what you are testing. Here you want levels to be run on the four test grouping variables.*/
repeated teststimulus 4 profile; 

 /*This names the test1-4 variables (grouping/subtest/whatever you want to call it), so pick a descriptive name that will make interpreting your results easier.  The number after it (4) is how many groups you have.  The 'profile' turns it into a profile analysis */
title 'profile analysis';

/*This line is not required, but it makes the results easier to find if you are running more than one analysis.*/
run;

 /*runs the test*/


Here is where you get the output.  I highlight the tests for flatness (gray), parallelism (blue), and levels (yellow),



















MANOVA Test Criteria and Exact F Statistics for the Hypothesis of no teststimulus Effect
                             H = Type III SSCP Matrix for teststimulus
                                       E = Error SSCP Matrix

                                        S=1    M=0.5    N=2

          Statistic                        Value    F Value    Num DF    Den DF    Pr > F

          Wilks' Lambda               0.37200666       3.38         3         6    0.0955
          Pillai's Trace              0.62799334       3.38         3         6    0.0955
          Hotelling-Lawley Trace      1.68812389       3.38         3         6    0.0955
          Roy's Greatest Root         1.68812389       3.38         3         6    0.0955



MANOVA Test Criteria and Exact F Statistics for the Hypothesis of no teststimulus*levels Effect

                         H = Type III SSCP Matrix for teststimulus*levels
                                       E = Error SSCP Matrix

                                        S=1    M=0.5    N=2

          Statistic                        Value    F Value    Num DF    Den DF    Pr > F

          Wilks' Lambda               0.37476423       3.34         3         6    0.0975
          Pillai's Trace              0.62523577       3.34         3         6    0.0975
          Hotelling-Lawley Trace      1.66834432       3.34         3         6    0.0975
          Roy's Greatest Root         1.66834432       3.34         3         6    0.0975


The GLM Procedure
Repeated Measures Analysis of Variance

Tests of Hypotheses for Between Subjects Effects

        Source                      DF     Type III SS     Mean Square    F Value    Pr > F

        levels                          57.34512196     57.34512196      19.96    0.0021
        Error                           22.98499258      2.87312407


You then use the data from each to calculate your effect sizes.


Flatness  η2= 1 - lambda
η2= 1- 0.37200666
η2= 0.6279933


Parallelism partial η2= 1 - lambda ^ (1/2)
partial η2= 1- 0.37476423 ^ (1/2)
partial η2= 0.3878201


Levels   
η2= SS between groups / (SS between groups + SS within groups)
η2= 57.34512196 / (57.34512196 + 22.98499258)
η2= 0.7138683

3. R code for figure


This is based on the U. Florida page that I mentioned above.

#First take the data.columns and add some information on any symbols or color scheme.
#Here I'm coloring the two levels and
#adding line types (even-numbered individuals are dashed, odd are solid).

data.columns$levels.color<-data.columns$levels
data.columns$levels.color<-sub("A","blue",data.columns$levels.color)
data.columns$levels.color<-sub("B","orange",data.columns$levels.color)
data.columns$additional.linetype<-rep(c("solid", "dashed"), 5)


figure.profile<-data.columns[,c("group1",
"group2", "group3", "group4",
"additional.linetype", "levels.color")]
#Make a small matrix with just the variables needed, using the melted dataset from step 1.
#I also include two additional variables for graphing: linetype and color.
#You can code these into a new column using your factors from the original dataset.


#Some objects need to be calculated for later plotting.
(p<-ncol(figure.profile[,1:4]))
#This counts number of subscores/groups (test stimuli),
#but only the first four columns because the last two are not the data.

#Now you can create the main body of the figure.
matplot(1:p, t(as.matrix(figure.profile[,1:4])),
#1:p counts the groups, here used as the number of positions on the x-axis.
#t() transposes the matrix (created from the dataframe by as.matrix
#and using the column selection [,1:4] makes sure you only select the data columns
#and not our symbol information columns.
type="l", lwd=1, # says this is a line graph (type="l") with line width of 1.
lty=as.character(figure.profile$additional.linetype),
#the line type is given by a variable in my dataset.
#In this case, it is not part of the profile analysis,
#but is there just for visual interpretation.
#You could also use different line types to represent
#the different levels instead of color.  
#That would be particularly good
#if you need a black-and-white figure for publication.
#as.numeric is needed for it interpret it correctly.
col=as.character(figure.profile$levels.color), 
#Again, another symbol defined by a column in the dataset.
#I needed as.character to make sure it reads the factor as 
#characters, otherwise it (sometimes) gives an error
#depending on the data type in the column.


xaxt="n", #no x-axis (you'll be adding this manually soon)
main="Profile analysis",
xlab="Subtest/the repeated measurement/groups",

ylab="Dependent variable")

axis(1, 1:p, colnames(figure.profile[,1:4]))
# Add x-axis labels.
#1 is the position (the x-axis on bottom), 1:p says how many labels to add,
#and colnames takes the names from the dataset's grouping columns.
#You could also write replace colnames(figure.profile[,1:4]) with a list
#if your column names aren't tidy enough.
#axis(1, 1:p, c("Group 1", "Group 2", "Group 3", "Group 4"))



(x.bar <- by(figure.profile[,1:4], data.columns$levels, colMeans))

#Take the mean for each level across the columns 1:4 (the groups) of the matrix from above.
#The original code says to use mean but mean only works with vectors;
#colMeans works on a dataframe, which is what we have.
lines(1:p, x.bar[[1]], lwd=3, col="blue")
lines(1:p, x.bar[[2]], lwd=3, col="orange")


#These have thicker line widths (lwd=3 instead of 1) to make them stand out
#from the lines that we plotted to represent each sample.

legend("topleft", c("A", "B"), col=c("blue","orange"), lwd=3, title="Levels")
#If appropriate, add a legend to some position (here top left, but you can change this).


You now have significance tests for parallelism, levels, and groups (shown in the caption), and a clear figure showing the levels and groups for each sample and the averages.  (Your data will turn out a little different because the dataset was randomly generated in R, though the approximate differences should be the same since we set the means and standard deviations at the same values each time.)

There is no significant interaction between levels and groups (parallelism test) (Wilks' Λ=0.37, df=3,6, p=0.10, partial η2=0.39).  There is no significant difference between responses to the test groups (flatness test) (Wilks' Λ=0.37, df=3,6, p=0.10, η2=0.). There is a significant difference between levels (difference in levels test) (F1,8=19.96, p=0.002, η2=0.71).


Monday, January 13, 2014

Melting your data with FIRE!

Actually, it's just with reshape2, a package for restructuring your data.  This is handy if you have an analysis that requires data in a format other than how you have it, or if you are given data in a format not to your liking.  So far my main use of this has been to shift my data for profile analysis (multivariate repeated measures).

First, let's generate a dataset.

groups<-rep(c("group1","group2","group3","group4"),10)
individual<-rep(1:10,each=4)
levels<-rep(c("A","B"),each=20)
response<-c((rnorm(20, mean = 2.3, sd = 1)),rnorm(20, mean = 4.7, sd = 2.3))

#Generating some random normal distributions using rnorm.
#That first number is the length (i.e., how many numbers you want in each set).
(original.data<-data.frame(individual,levels,groups, response))

#So, here's how I have my data to begin with:


#   individual levels groups   response
#1           1      A group1  3.9294112
#2           1      A group2  1.7416195
#3           1      A group3  3.2474873
# etc. to row 40 (40 observations of 10 individuals with
#4 observations each and 1 observation per group).


#Use str(original.data) to check if the response variable is numeric and the others are factors.
str(original.data) #yep!

#And here's what I want it to look like:


#   individual levels    group1    group2   group3   group4
#1           1      A  2.510969 2.6601334 2.968813 4.294844
#2           2      A  3.096240 2.5438054 2.316189 2.561755
#3           3      A  2.641018 1.7682642 1.540212 1.674456
#4           4      A  3.720303 1.7542154 2.829152 3.165893
#5           5      A  2.339273 0.7974890 3.084249 3.243764
#6           6      B  6.960439 8.5711567 5.805974 7.056487
#7           7      B  5.642962 4.5452340 6.922783 5.644429
#8           8      B  4.449354 6.0973501 5.214315 3.128638
#9           9      B  5.358301 5.5186123 6.792033 5.655760
#10         10      B -2.564175 0.8425047 7.772034 5.613354

 
#Start the reshape2 package.
library(reshape2)

melt.data<-melt(original.data,id=c("levels", "groups", "individual"),
  measured=c("response")) #variable measured is response
melt.data #show data



#   levels groups individual variable      value
#1       A group1          1 response  0.8034790
#2       A group2          1 response  2.7244518
#3       A group3          1 response  2.0623880
#Etc. on to 40 rows.

melt.data$variable<-melt.data$groups
#Set the variable here.  "groups" is the factor that we want to make #up our four new columns.


melt.data
#Here is what melt.data looks like now:


#   levels groups individual variable      value
#1       A group1          1   group1  0.8034790
#2       A group2          1   group2  2.7244518
#3       A group3          1   group3  2.0623880
#Etc. on to 40 rows.
  
data.columns<-dcast(melt.data, individual+levels~variable,mean)
#List the factors that do NOT go as the new columns.
#Because we want groups to be the new columns, you list only individual and levels.
#Then after ~ put "variable"and that you want the means of this measurement for each combination of individual and levels.
#In this example, you just have one observation for each group in each individual in each level,
#so it should be the same as your original data.
#If you took more than one observation for each individual in each level and each group,
#and didn't include it in the list with individual and levels,
#using mean would allow you to average across those.
#This might happen if you took multiple observations and hadn't averaged them yet,
#or had an additional factor that you are not examining in this analysis.

#Use dcast if you want a data frame as the resulting object;
#use acast if you want a matrix or vector.
#("cast" is the old function from the original reshape package;
#with the old function you had to specify data.frame if you wanted a data frame).


data.columns
#shows your dataset, now with measurements for each individual in a row,

#with one column each for group1, group2, group3, and group4
#instead of classified by factors in the old "groups" column.

#It's just like we wanted at the beginning!


#Did you start with data in the column form?
#Or do you want to put your data back the way it was 

#but now as means for each combination of factors?
#For example, one use I've found for this is to 
#put my data in the multi-column form, use na.omit
#to get rid of any individuals with missing data,
#and then convert back to long form.
 
data.longform<-data.frame(melt(data.columns, id=c("individual","levels")))
data.longform #view the dataset back in a long one-column form.



   individual levels variable      value
#1           1      A   group1  2.5109692
#2           2      A   group1  3.0962398
#3           3      A   group1  2.6410184
#4           4      A   group1  3.7203027

#Etc. until row 40 (end of data).

Friday, January 10, 2014

Calculating effect sizes using the compute.es mes() function

I wanted to make a table of effect sizes for a series of comparisons among three groups.  I use the package compute.es to do this. If you have a lot of comparisons, it'd probably be easier to do a loop, but this is what I have for now.  Of course you can modify this so you are only comparing two groups as well, or compare more than three groups.




library(compute.es)

#Sample data on which to test the code.

group1<-c(0,1,2,2,3,5,2)
group2<-c(2,4,6,5,7,8,9)
group3<-c(2,3,4,5,6,7,8)

#Calculate the necessary mean, standard deviation, and sample size for all three datasets.
#You could have your data in columns, too; this is just how I set up the sample data.

m1<-mean(group1)
sd1<-sd(
group1)
l1<-length(
group1)

m2<-mean(group2)
sd2<-sd(
group2)
l2<-length(
group2)

m3<-mean(group3)
sd3<-sd(
group3)
l3<-length(
group3


#The objects you've created will be used as follows:
#mes(meanA, meanB, standarddeviationA, standarddeviationB, samplesizeA, samplesizeB)

mes(m1,m2,sd1,sd2,l1,l2) #compares data1 and data2
mes(m1,m3,sd1,sd3,l1,l3) #compares data1 and data3
mes(m2,m3,sd2,sd3,l2,l3) #compares data2 and data3

Results for the last comparision (2 vs. 3) look like this:


    EFFECT SIZE CALCULATION (FOR SINGLE INPUT)

Mean Differences ES: 
 
 d [ 95 %CI] = 0.37 [ -0.8 , 1.55 ] 
  var(d) = 0.29 
  p-value(d) = 0.5 
  U3(d) = 64.6 % 
  CLES(d) = 60.44 % 
  Cliff's Delta = 0.21 
 
 g [ 95 %CI] = 0.35 [ -0.75 , 1.45 ] 
  var(g) = 0.25 
  p-value(g) = 0.5 
  U3(g) = 63.71 % 
  CLES(g) = 59.79 % 
 
 Correlation ES: 
 
 r [ 95 %CI] = 0.18 [ -0.44 , 0.69 ] 
  var(r) = 0.07 
  p-value(r) = 0.55 
 
 z [ 95 %CI] = 0.19 [ -0.47 , 0.84 ] 
  var(z) = 0.09 
  p-value(z) = 0.55 
 
 Odds Ratio ES: 
 
 OR [ 95 %CI] = 1.97 [ 0.23 , 16.61 ] 
  p-value(OR) = 0.5 
 
 Log OR [ 95 %CI] = 0.68 [ -1.45 , 2.81 ] 
  var(lOR) = 0.96 
  p-value(Log OR) = 0.5 
 
 Other: 
 
 NNT = 4.79 
 Total N = 14
 



For my application, I used Cohen's d (the first entry, highlighted in bold above).  For additional information on effect sizes (not just Cohen's d), Cohen (1992) is a very useful reference.

Monday, January 06, 2014

Pretty colors

Black-crested Titmouse
(Baeolophus atricristatus)
hybrid index=6
Tufted Titmouse
(Baeolophus bicolor)
hybrid index=0
I wanted a grayscale color ramp to use in my figures to color symbols to match the hybrid index used for my two study species.  This gray ramp should end up having black for Black-crested Titmouse (hybrid index of 6) and white for Tufted (hybrid index of 0), with hybrid index values 1-5 represented as shades of gray.

First, in case you want to go back to the way things were before, set the name of your default colors before you've changed them.

(default.colors<-palette()) #saves the color scheme you already have.
#The parentheses around it also prints it so you can see what's already happening.

#[1] "black"   "red"     "green3"  "blue"    "cyan"    "magenta" "yellow"  "gray"

#Now it's safe to change to what you want.  You can always go back if needed now.

#This function, gray.colors, asks for
#([number of colors], start=[start of gray ramp], end=[ending value of gray ramp],
#alpha=[optional opacity/transparency value, which I did not use in this example]).
#The gray ramp start and end values must be between 0 (black) and 1 (white).
#Here I had mine start with white (because I wanted "1" to equal "white"
#and "7" to equal "black" in the finished ramp.
#Essentially, I've asked for a seven-unit gray scale going from white to black.

(gray<-gray.colors(7, start=1, end=0))

#[1] "#FFFFFF" "#EBEBEB" "#D4D4D4" "#BABABA" "#9B9B9B" "#717171" "#000000"
#These are in hex codes, though some will show up as descriptions later.

palette(gray)
#the argument to palette() is a value.  We have used "gray" that we created above.

#Since the hybrid index goes from 0 to 6, if I use the palette without adding 1 to my index values,
#it will put 0 as no color and no bird will get 7 (black).
#This is easily solved by pretending the hybrid index values are 1 through 7.
#In my example, this can be done by indexsum+1 when I'm inputting color options in the plotting function.
#You can make whatever adjustments are needed to your variable that specifies color
#To get it to match the ramp.  Copy your color variable first and make changes on the copy.

palette()#shows your current palette (either description or hex codes).

#[1] "white"   "gray92"  "gray83"  "gray73"  "#9B9B9B" "#717171" "black"

#I believe it changes the hex codes to named colors where available.
#It will be used until you say otherwise.
#If you want to go back to the original color palette after this section of code is over,
#just use your default that you saved at the start.

palette(default.colors) #return to original color scheme.  You can check this using...
palette() #See, it works!

Posting conventions I'll use for R

Since my main goal here is to provide usable code snippets, I'll try to standardize how I present said code.

I'll put all code in the Courier font.

citation() #the pound sign here is used for commenting in R code,
#so I will comment the code extensively so you can directly copy it into your 
#R script to modify as you need.
#this command in R gives you the citation for the R program.
#You can use it to get citations for individual packages that you have installed, too.
citation("lme4") #this cites the package lme4 if you have it installed.

#The quotes seem to be needed.  If you do not use the quotes around the package name, it gives this error #message:
#>Error in system.file(package = package, lib.loc = lib.loc) :
#>  object 'lme4' not found

#That brings me to output; I'll use ">" to indicate lines of output when it appears that way in R
# (as I just showed the error message).  If you ask to see data,
#it shows up as either the row names or as a number.

test.data<-c(0,1,2,3,4)
test.data
#[1] 0 1 2 3 4
#I'll usually comment the output and all the comments in and around the code so you can copy it directly
#into your script and edit the comments as you need without having to copy around it.

R.Version() #the caps are important.  R is case-sensitive.

I am currently using RStudio 0.97.314 with R 3.0.1, so my code is written and run under these conditions.  To find out what version you are using, if you're in the middle of the session and you don't see it in your output anymore, the above function lets you know.


Blog purpose and background


My goal is to provide code samples and other tutorials to help others in their analyses. Right now my coding is in three programs/languages: R, SAS, and postgreSQL.  I will also post photos from my field work and other outdoor adventures from time to time.

Where I started for each and what I'd do differently


R


I opened up a A Beginner's Guide to R (Use R!) on a rainy day in spring 2012 and spent an afternoon following the examples.  Since then, I've used a combination of google searching "what I want to do" + "in R" and a few books.  I'm pretty happy with my approach to this so far.

Details on how I'll present R code on this blog are here.

Relational databases


I grappled with this one a while and still do, so no good beginner book recommendations other than stay away from the pretty graphs in Access.  I started with Microsoft Access and those point-and-click queries in ... 2006?  Anyway, point-and-click got frustrating quickly, as that never did quite what I wanted.  I was told to query properly.  So I used JetSQL in Access.  A few years ago on the advice of a real programmer, I migrated to postgreSQL and pgadmin.

What I'd do differently: start with postgreSQL instead of Access.  It's more complicated, but so much easier to not have to convert your database from the partially automated drop-down menus in Access to an easier-to-query format later.  postgreSQL also is more powerful, which I ran into recently while trying to help a colleague use JetSQL.  Some things I was used to doing in postgreSQL were more complex or impossible in JetSQL.  I'd also have read up a little on database structuring and have thought out my database structure more.  I've had to do several big re-structurings of the architecture since I started.  I think I've finally got it where I can easily integrate new tables if needed.

Long story short: it's easier and more efficient in the long run if you just go ahead and learn to query using SQL (Structured Query Language).

SAS


I learned using code snippets from a class in 2008 and, since then, deep ventures into the help files.  I am trying to convert to R mostly, so mainly I will post SAS code to compare with R code.  Using Multivariate Statistics (Tabachnik and Fidell 2007) has lots of good SAS code to use.  So, sometimes it's simplest just to follow their examples if you have that book and a problem that it addresses.

Friday, January 03, 2014

Hello, world

I am an evolutionary ecologist learning new ways to perform analyses.  I've found various websites and blogs with sample code to be extremely valuable in conducting my own analyses.  Here, I am freeing some of my code and new-programmer experiences into the wild in hopes that it might help you, as others' sharing has helped me.