If you have a dataset with factors, you might want to get some descriptive summaries grouped by each factor. This took me a while to figure out in R but turned out to be reasonably simple.

categories<-rep(c("a","b"), 4)

morecategories<-rep(c("this", "that"), each=4)

thing1<-c(1,2,3,4,5,6,7,8)

thing2<-c(2,3,4,5,6,NA,8,9)

thing3<-c(3,4,5,6,7,8,9,12)

adifferentone<-c(10,1,20,4,19,2,34,1)

data<-data.frame(categories,

morecategories,

thing1,

thing2,

thing3,

adifferentone

)

#The above lines generate a small example dataset.

#View your data to ensure factors are factors, numbers are numeric, and so on.

str(data)

#The first way to get some general summary data is to use the summary() function.

summary(data)

#It doesn't give you anything grouped by your categories ("categories" and "morecategories") though.

#aggregate() will do this.

#I show it here in the formula version, with the function as mean.

#You can also use sd (standard deviation).

aggregate(data$adifferentone~data$categories+data$morecategories, FUN=mean)

aggregate(data$thing2~data$categories, FUN=mean)

#An example with a different formula and at least one NA in the data.

#Note that this automatically removes the NA from thing2; you can tell using the length function.

#(You can also use length to get the sample size.)

aggregate(data$thing2~data$categories, FUN=length)

#What if we need to average the thing columns?

#(I've needed to do this if I take more than one measurement,

#such as north, south, east, and west measurements and then average them.)

#First get the columns you want.

things<-c("thing1", "thing2", "thing3")

#data$avgthing is the new column you are putting your summary into.

#The 'things' object you just created selects columns

#(that's why it goes after the comma within the square brackets;

#before the comma selects rows.)

#Use na.rm=TRUE if you have NAs; otherwise it'll

#give you an NA for the whole row that contains the NA.

data$avgthing<-rowMeans(data[,things], na.rm=TRUE)

data$sd.thing<-apply(data[,things], MARGIN=1, sd, na.rm=TRUE)

#To get standard deviation, you need to use apply.

#MARGIN=1 means you are applying across rows. (2 would mean columns.)

#view the dataset including the new variables you've generated (mean and standard deviation).

data

## No comments:

## Post a Comment

Comments and suggestions welcome.