Wednesday, October 15, 2014

Using R to work through Sokal and Rohlf's Biometry: Chapter 4 (Descriptive Statistics), sections 4.6-4.9

Previously in this series: Chapter 4 (sections 4.1-4.5 in Sokal and Rohlf's Biometry).  The data for aphid.femur and birthweights are available in that post.

Section 4.6: the range

#This section begins our study of dispersion and spread.
#The range is the difference between the largest and smallest items in a sample,
#so it should be pretty easy to calculate.
max(aphid.femur)-min(aphid.femur) #bottom of pg. 49
max(birthweights\$classmark)-min(birthweights\$classmark) #pg. 50

#R has a function for this too which gives the minimum and maximum values (not the difference between them).
#If your data contain any NA ("not available") or NaN ("not a number"), and na.rm=FALSE,
#then NA values will be returned, so go ahead and set na.rm=TRUE if you need want to see min and max

range(aphid.femur)

Section 4.7: the standard deviation

#The standard deviation measures a distance from the center of the distribution.
#Table 4.1 calculates the basics using our aphid data.

#The first column is each observed value.
#We need to move our set of numbers into a column,
#which can be done here by simply transforming it into a data frame.
#In more complex cases we might need to use a matrix and set dimensions,
#but let's not worry about it here when we can do it a simpler way.

aphid.femur.sd<-data.frame(aphid.femur)

#The deviates are individual observations minus the mean.

(aphid.femur.sd\$deviates<-aphid.femur.sd\$aphid.femur-mean(aphid.femur))
#The authors show sums in Table 4.1 as well.

sum(aphid.femur.sd\$aphid.femur)

#they sum to 100.1.
#Perhaps summing deviates will get us a measure of dispersion.

sum(aphid.femur.sd\$deviates)

#Nope!  As the book points out, this is essentially zero if not for rounding errors.
#Appendix A.1 on pg. 869 shows a nice proof of this.
#We'll now make the absolute value column in Table 4.1.

aphid.femur.sd\$abs.deviates<-abs(aphid.femur.sd\$deviates)
sum(aphid.femur.sd\$abs.deviates)
mean(aphid.femur.sd\$abs.deviates)

#This average deviate is apparently not a very common measure of dispersion though.
#Instead we use, as you might have guessed from the section title,
#standard deviations, which, the authors state, have nifty statistical properties based
#on the sum of squares.  Presumably they will explain said properties in a later chapter.

aphid.femur.sd\$sqr.deviates<-(aphid.femur.sd\$deviates)^2
sum(aphid.femur.sd\$sqr.deviates)

#Variance is the mean of the squared deviates.
mean(aphid.femur.sd\$sqr.deviates)
#or equivalently,
(variance.aphid<-sum(aphid.femur.sd\$sqr.deviates)/length(aphid.femur.sd\$sqr.deviates))
#The sum of the items divided by the number of the items is the mean,
#hence the mean of the squared deviates.
#The square root of this is the standard deviation.
(sqrt(variance.aphid))
#It is in original units of measurement
#because we brought it back to that by the square root.

#Bias-corrected variance and standard deviation are apparently needed,
#and are shown in the bottom of Table 4.1 and on pg. 53
(bias.corrected.variance.aphid<-sum(aphid.femur.sd\$sqr.deviates)/
(length(aphid.femur.sd\$sqr.deviates)-1))
#equation 4.6
(sd.aphid<-sqrt(bias.corrected.variance.aphid)) #equation 4.7

#The differences between bias-corrected and uncorrected variance and
#standard deviation will decline as sample size decreases.
#The quantity of n-1 (which we are showing as length(oursample)-1)
#is the degrees of freedom.  The authors state that we are
#only to use n to get uncorrected variance and standard deviation
#if evaluating a parameter (i.e. we have all items from the population).

#There is an additional and even more accurate correction for estimating standard deviation.
#Multiply the standard deviation by the correction factor found in the Statistical Tables book,
#Statistical Table II (pg. 204).  This is Gurland and Tripathi's correction, and approaches 1
#above a sample size of 100.
#R allows easy calculation of both variance and standard devation.
sd(aphid.femur) #standard devation
var(aphid.femur) #variance
#Be careful with var(), as the R function will also work with matrices.

#On pg. 54 the authors suggest using the midrange to estimate where your mean should be.
#You can use this to see if your mean calculations look correct ("detect gross errors in computation").
#Just average the largest and smallest values.

mean(range(aphid.femur))

#On pg. 54, what if you want to get an approximation of mean and standard deviation?
#This is useful to see if your calculated values are about right.
#For the mean, obtain the midrange.
(min(aphid.femur)+max(aphid.femur))/2

#To estimate the standard deviation, you should be able to use their little table on pg. 54.
#There are 25 aphid.femur samples.  This is closest to 30, so divide the range by 4.
(max(aphid.femur)-min(aphid.femur))/4
#They also mention Statistical Table I as a source of mean ranges for various sample sizes with a normal distribution.
#The statistical tables come as a separate book: Statistical tables, 4th ed
#(table I is found pp. 62-63).
#http://www.amazon.com/Statistical-Tables-F-James-Rohlf/dp/1429240318/ref=sr_1_1?ie=UTF8&qid=1404660488&sr=8-1&keywords=biometry+statistical+tables
#That table provides a more accurate method than the pg. 54 Biometry table,
#though they are based on similar assumptions.
#The table is read by selecting your sample size from the vertical and horizontal margins.
#For a sample size of 25, as in the aphid example, select 20 from the vertical margin and
#5 from the horizontal margin.  The mean range in a normal distribution from a sample of
#this size is 3.931.
(max(aphid.femur)-min(aphid.femur))/3.931
#The answer is similar to the actual value of 0.3657.

Section 4.8: coding data before computation

#I can't really think of many applications for why you'd use coded averaging with a computer,
#but let's try it anyway.  This box is also on pg. 43 with the non-coding application of averaging.

#The original coding to 0-14 iS made by subtracting the lowest class mark
#and then dividing (by 8, in this case) so that the series is 0-n (pg. 55).

coded.classmarks<-seq(from=0, to=14, by=1)
#We still use the frequencies and sample size as before.
coded.classsums<-coded.classmarks*frequencies
(coded.summing<-sum(coded.classsums))

#Sure enough, we get 59629.
#Divide by sample size and get the coded average of 6.300.
(coded.mean.birthweights<-coded.summing/samplesize)
#(It's 6.299947 but apparently the book was the one to round.)

#Box 4.2 showed how to calculate mean and dispersion from frequency distributions,
#both coded and uncoded.
#Continue on to coded standard deviation and variance.
(SS.coded<-sum(frequencies*(coded.classmarks-(coded.summing/samplesize))^2)) #sum of squares
(variance.box4.2<-SS.coded/(sum(frequencies)-1)) #variance coded
(coded.sd<-sqrt(variance.box4.2)) #standard deviation coded

#To decode, refer to Appendix A.2 which discusses multiplicative,
#additive, and combination properties of codes for means and dispersion.
#In the box 4.2 example, to decode the mean, you just reverse the coding
#using algebra.
(decoded.mean<-8*coded.mean.birthweights+59.5)
#The sum of squares, standard deviation, and variance follow slightly different rules,
#per the proofs (pg. 871).  Simple additive coding would not require any changes.
#Here we need to divide by the factor we coded with (1/8) (remember, additive is ignored).
#(Dividing by 1/8 equals multiplying by 8.)
(decoded.sd<-8*coded.sd)

Section 4.9: the coefficient of variation

#The coefficient of variation is the standard deviation as a percentage of the mean.
#This allows comparision of variation in different populations
#which might have larger or smaller absolute values of measurements.

(coefficient.variation.aphid<-sd.aphid*100/mean(aphid.femur))

#As for variance and standard deviation, there is a bias correction to use with small samples.

(corrected.cv.aphid<-coefficient.variation.aphid*(1+(1/(4*(length(aphid.femur))))))
#If you use Statistical Table II and its correction factor to calculate a standard deviation,
#as discussed on the lower part of pg. 53,
#then do not use the corrected coefficient of variation.
#If you look at Statistical table II, you'll see that the correction factor for n>30
#is similar to the correction factor in equation 4.10.