## Tuesday, February 18, 2014

### Getting summary statistics for datasets by factor in R

If you have a dataset with factors, you might want to get some descriptive summaries grouped by each factor. This took me a while to figure out in R but turned out to be reasonably simple.

categories<-rep(c("a","b"), 4)
morecategories<-rep(c("this", "that"), each=4)
thing1<-c(1,2,3,4,5,6,7,8)
thing2<-c(2,3,4,5,6,NA,8,9)
thing3<-c(3,4,5,6,7,8,9,12)
data<-data.frame(categories,
morecategories,
thing1,
thing2,
thing3,
)

#The above lines generate a small example dataset.

#View your data to ensure factors are factors, numbers are numeric, and so on.
str(data)

#The first way to get some general summary data is to use the summary() function.
summary(data)

#It doesn't give you anything grouped by your categories ("categories" and "morecategories") though.
#aggregate() will do this.
#I show it here in the formula version, with the function as mean.
#You can also use sd (standard deviation).
aggregate(data\$thing2~data\$categories, FUN=mean)

#An example with a different formula and at least one NA in the data.
#Note that this automatically removes the NA from thing2; you can tell using the length function.
#(You can also use length to get the sample size.)
aggregate(data\$thing2~data\$categories, FUN=length)

#What if we need to average the thing columns?
#(I've needed to do this if I take more than one measurement,
#such as north, south, east, and west measurements and then average them.)

#First get the columns you want.
things<-c("thing1", "thing2", "thing3")

#data\$avgthing is the new column you are putting your summary into.
#The 'things' object you just created selects columns
#(that's why it goes after the comma within the square brackets;
#before the comma selects rows.)
#Use na.rm=TRUE if you have NAs; otherwise it'll
#give you an NA for the whole row that contains the NA.
data\$avgthing<-rowMeans(data[,things], na.rm=TRUE)
data\$sd.thing<-apply(data[,things], MARGIN=1, sd, na.rm=TRUE)

#To get standard deviation, you need to use apply.
#MARGIN=1 means you are applying across rows.  (2 would mean columns.)
#view the dataset including the new variables you've generated (mean and standard deviation).

data