Wednesday, October 15, 2014

Using R to work through Sokal and Rohlf's Biometry: Chapter 4 (Descriptive Statistics), sections 4.6-4.9

Previously in this series: Chapter 4 (sections 4.1-4.5 in Sokal and Rohlf's Biometry).  The data for aphid.femur and birthweights are available in that post.

Section 4.6: the range

#This section begins our study of dispersion and spread.
#The range is the difference between the largest and smallest items in a sample,
#so it should be pretty easy to calculate.
max(aphid.femur)-min(aphid.femur) #bottom of pg. 49
max(birthweights$classmark)-min(birthweights$classmark) #pg. 50

#R has a function for this too which gives the minimum and maximum values (not the difference between them).
#If your data contain any NA ("not available") or NaN ("not a number"), and na.rm=FALSE,
#then NA values will be returned, so go ahead and set na.rm=TRUE if you need want to see min and max
#AFTER ignoring your NA/NaN values.


Section 4.7: the standard deviation

#The standard deviation measures a distance from the center of the distribution.
#Table 4.1 calculates the basics using our aphid data.

#The first column is each observed value.
#We need to move our set of numbers into a column,
#which can be done here by simply transforming it into a data frame.
#In more complex cases we might need to use a matrix and set dimensions,
#but let's not worry about it here when we can do it a simpler way.<-data.frame(aphid.femur)

#The deviates are individual observations minus the mean.

#The authors show sums in Table 4.1 as well.


#they sum to 100.1.
#Perhaps summing deviates will get us a measure of dispersion.


#Nope!  As the book points out, this is essentially zero if not for rounding errors.
#Appendix A.1 on pg. 869 shows a nice proof of this.
#We'll now make the absolute value column in Table 4.1.$abs.deviates<-abs($deviates)

#This average deviate is apparently not a very common measure of dispersion though.
#Instead we use, as you might have guessed from the section title,
#standard deviations, which, the authors state, have nifty statistical properties based
#on the sum of squares.  Presumably they will explain said properties in a later chapter.$sqr.deviates<-($deviates)^2

#Variance is the mean of the squared deviates.
#or equivalently,
#The sum of the items divided by the number of the items is the mean,
#hence the mean of the squared deviates.
#The square root of this is the standard deviation.
#It is in original units of measurement
#because we brought it back to that by the square root.

#Bias-corrected variance and standard deviation are apparently needed,
#and are shown in the bottom of Table 4.1 and on pg. 53
#equation 4.6
(sd.aphid<-sqrt(bias.corrected.variance.aphid)) #equation 4.7

#The differences between bias-corrected and uncorrected variance and
#standard deviation will decline as sample size decreases.
#The quantity of n-1 (which we are showing as length(oursample)-1)
#is the degrees of freedom.  The authors state that we are
#only to use n to get uncorrected variance and standard deviation
#if evaluating a parameter (i.e. we have all items from the population).

#There is an additional and even more accurate correction for estimating standard deviation.
#Multiply the standard deviation by the correction factor found in the Statistical Tables book,
#Statistical Table II (pg. 204).  This is Gurland and Tripathi's correction, and approaches 1
#above a sample size of 100.
#R allows easy calculation of both variance and standard devation.
sd(aphid.femur) #standard devation
var(aphid.femur) #variance
#Be careful with var(), as the R function will also work with matrices.
#See ?var for more info.

#On pg. 54 the authors suggest using the midrange to estimate where your mean should be.
#You can use this to see if your mean calculations look correct ("detect gross errors in computation").
#Just average the largest and smallest values.


#On pg. 54, what if you want to get an approximation of mean and standard deviation?
#This is useful to see if your calculated values are about right.
#For the mean, obtain the midrange.

#To estimate the standard deviation, you should be able to use their little table on pg. 54.
#There are 25 aphid.femur samples.  This is closest to 30, so divide the range by 4.
#They also mention Statistical Table I as a source of mean ranges for various sample sizes with a normal distribution.
#The statistical tables come as a separate book: Statistical tables, 4th ed
#(table I is found pp. 62-63).
#That table provides a more accurate method than the pg. 54 Biometry table,
#though they are based on similar assumptions.
#The table is read by selecting your sample size from the vertical and horizontal margins.
#For a sample size of 25, as in the aphid example, select 20 from the vertical margin and
#5 from the horizontal margin.  The mean range in a normal distribution from a sample of
#this size is 3.931.
#The answer is similar to the actual value of 0.3657.

Section 4.8: coding data before computation

#I can't really think of many applications for why you'd use coded averaging with a computer,
#but let's try it anyway.  This box is also on pg. 43 with the non-coding application of averaging.

#The original coding to 0-14 iS made by subtracting the lowest class mark
#and then dividing (by 8, in this case) so that the series is 0-n (pg. 55).

coded.classmarks<-seq(from=0, to=14, by=1)
#We still use the frequencies and sample size as before.

#Sure enough, we get 59629.
#Divide by sample size and get the coded average of 6.300.
#(It's 6.299947 but apparently the book was the one to round.)

#Box 4.2 showed how to calculate mean and dispersion from frequency distributions,
#both coded and uncoded.
#Continue on to coded standard deviation and variance.
(SS.coded<-sum(frequencies*(coded.classmarks-(coded.summing/samplesize))^2)) #sum of squares
(variance.box4.2<-SS.coded/(sum(frequencies)-1)) #variance coded
(<-sqrt(variance.box4.2)) #standard deviation coded

#To decode, refer to Appendix A.2 which discusses multiplicative,
#additive, and combination properties of codes for means and dispersion.
#In the box 4.2 example, to decode the mean, you just reverse the coding
#using algebra.
#The sum of squares, standard deviation, and variance follow slightly different rules,
#per the proofs (pg. 871).  Simple additive coding would not require any changes.
#Here we need to divide by the factor we coded with (1/8) (remember, additive is ignored).
#(Dividing by 1/8 equals multiplying by 8.)

Section 4.9: the coefficient of variation

#The coefficient of variation is the standard deviation as a percentage of the mean.
#This allows comparision of variation in different populations
#which might have larger or smaller absolute values of measurements.


#As for variance and standard deviation, there is a bias correction to use with small samples.

#If you use Statistical Table II and its correction factor to calculate a standard deviation,
#as discussed on the lower part of pg. 53,
#then do not use the corrected coefficient of variation.
#If you look at Statistical table II, you'll see that the correction factor for n>30
#is similar to the correction factor in equation 4.10.

No comments:

Post a Comment

Comments and suggestions welcome.