Claire M. Curry: October 2014

This weekend it was sunny and relatively pleasant (64ºF/18ºC), so I took my faithful canine companion out and about. We went to the Assiniboine Forest which has quite a few kilometers of paved, graveled, and woodchipped trails. Although there were lots of people and other dogs out, it was quieter and more peaceful on the farther trails. I saw a few critters out: two Mourning Cloak butterflies, one sulphur butterfly that flew away before I could photograph it, an American Robin, and several fine Black-capped Chickadees.

Fortunately the trail was not nearly so threatening as the sign implied. Assiniboine Forest, Winnipeg, Manitoba, 19 October 2014.

The dog is for scale. Assiniboine Forest, Winnipeg, Manitoba, 19 October 2014.

Previously in this series: Chapter 4 (sections 4.1-4.5 in Sokal and Rohlf's Biometry). The data for aphid.femur and birthweights are available in that post.

Section 4.6: the range

#This section begins our study of dispersion and spread.
#The range is the difference between the largest and smallest items in a sample,
#so it should be pretty easy to calculate.
max(aphid.femur)-min(aphid.femur) #bottom of pg. 49
max(birthweights$classmark)-min(birthweights$classmark) #pg. 50

#R has a function for this too which gives the minimum and maximum values (not the difference between them).
#If your data contain any NA ("not available") or NaN ("not a number"), and na.rm=FALSE,
#then NA values will be returned, so go ahead and set na.rm=TRUE if you need want to see min and max
#AFTER ignoring your NA/NaN values.

range(aphid.femur)

Section 4.7: the standard deviation

#The standard deviation measures a distance from the center of the distribution.
#Table 4.1 calculates the basics using our aphid data.

#The first column is each observed value.
#We need to move our set of numbers into a column,
#which can be done here by simply transforming it into a data frame.
#In more complex cases we might need to use a matrix and set dimensions,
#but let's not worry about it here when we can do it a simpler way.

aphid.femur.sd<-data.frame(aphid.femur)

#The deviates are individual observations minus the mean.

(aphid.femur.sd$deviates<-aphid.femur.sd$aphid.femur-mean(aphid.femur))
#The authors show sums in Table 4.1 as well.

sum(aphid.femur.sd$aphid.femur)

#they sum to 100.1.
#Perhaps summing deviates will get us a measure of dispersion.

sum(aphid.femur.sd$deviates)

#Nope! As the book points out, this is essentially zero if not for rounding errors.
#Appendix A.1 on pg. 869 shows a nice proof of this.
#We'll now make the absolute value column in Table 4.1.

aphid.femur.sd$abs.deviates<-abs(aphid.femur.sd$deviates)
sum(aphid.femur.sd$abs.deviates)
mean(aphid.femur.sd$abs.deviates)

#This average deviate is apparently not a very common measure of dispersion though.
#Instead we use, as you might have guessed from the section title,
#standard deviations, which, the authors state, have nifty statistical properties based
#on the sum of squares. Presumably they will explain said properties in a later chapter.

aphid.femur.sd$sqr.deviates<-(aphid.femur.sd$deviates)^2
sum(aphid.femur.sd$sqr.deviates)

#Variance is the mean of the squared deviates.
mean(aphid.femur.sd$sqr.deviates)
#or equivalently,
(variance.aphid<-sum(aphid.femur.sd$sqr.deviates)/length(aphid.femur.sd$sqr.deviates))
#The sum of the items divided by the number of the items is the mean,
#hence the mean of the squared deviates.
#The square root of this is the standard deviation.
(sqrt(variance.aphid))
#It is in original units of measurement
#because we brought it back to that by the square root.

#Bias-corrected variance and standard deviation are apparently needed,
#and are shown in the bottom of Table 4.1 and on pg. 53
(bias.corrected.variance.aphid<-sum(aphid.femur.sd$sqr.deviates)/
(length(aphid.femur.sd$sqr.deviates)-1)) #equation 4.6
(sd.aphid<-sqrt(bias.corrected.variance.aphid)) #equation 4.7

#The differences between bias-corrected and uncorrected variance and
#standard deviation will decline as sample size decreases.
#The quantity of n-1 (which we are showing as length(oursample)-1)
#is the degrees of freedom. The authors state that we are
#only to use n to get uncorrected variance and standard deviation
#if evaluating a parameter (i.e. we have all items from the population).

#There is an additional and even more accurate correction for estimating standard deviation.
#Multiply the standard deviation by the correction factor found in the Statistical Tables book,
#Statistical Table II (pg. 204). This is Gurland and Tripathi's correction, and approaches 1
#above a sample size of 100.
#R allows easy calculation of both variance and standard devation.
sd(aphid.femur) #standard devation
var(aphid.femur) #variance
#Be careful with var(), as the R function will also work with matrices.
#See ?var for more info.

#On pg. 54 the authors suggest using the midrange to estimate where your mean should be.
#You can use this to see if your mean calculations look correct ("detect gross errors in computation").
#Just average the largest and smallest values.

mean(range(aphid.femur))

#On pg. 54, what if you want to get an approximation of mean and standard deviation?
#This is useful to see if your calculated values are about right.
#For the mean, obtain the midrange.
(min(aphid.femur)+max(aphid.femur))/2

#To estimate the standard deviation, you should be able to use their little table on pg. 54.
#There are 25 aphid.femur samples. This is closest to 30, so divide the range by 4.
(max(aphid.femur)-min(aphid.femur))/4
#They also mention Statistical Table I as a source of mean ranges for various sample sizes with a normal distribution.
#The statistical tables come as a separate book: Statistical tables, 4th ed
#(table I is found pp. 62-63).
#http://www.amazon.com/Statistical-Tables-F-James-Rohlf/dp/1429240318/ref=sr_1_1?ie=UTF8&qid=1404660488&sr=8-1&keywords=biometry+statistical+tables
#That table provides a more accurate method than the pg. 54 Biometry table,
#though they are based on similar assumptions.
#The table is read by selecting your sample size from the vertical and horizontal margins.
#For a sample size of 25, as in the aphid example, select 20 from the vertical margin and
#5 from the horizontal margin. The mean range in a normal distribution from a sample of
#this size is 3.931.
(max(aphid.femur)-min(aphid.femur))/3.931
#The answer is similar to the actual value of 0.3657.

Section 4.8: coding data before computation

#I can't really think of many applications for why you'd use coded averaging with a computer,
#but let's try it anyway. This box is also on pg. 43 with the non-coding application of averaging.

#The original coding to 0-14 iS made by subtracting the lowest class mark
#and then dividing (by 8, in this case) so that the series is 0-n (pg. 55).

coded.classmarks<-seq(from=0, to=14, by=1)
#We still use the frequencies and sample size as before.
coded.classsums<-coded.classmarks*frequencies
(coded.summing<-sum(coded.classsums))
#Sure enough, we get 59629.
#Divide by sample size and get the coded average of 6.300.
(coded.mean.birthweights<-coded.summing/samplesize)
#(It's 6.299947 but apparently the book was the one to round.)

#Box 4.2 showed how to calculate mean and dispersion from frequency distributions,
#both coded and uncoded.
#Continue on to coded standard deviation and variance.
(SS.coded<-sum(frequencies*(coded.classmarks-(coded.summing/samplesize))^2)) #sum of squares
(variance.box4.2<-SS.coded/(sum(frequencies)-1)) #variance coded
(coded.sd<-sqrt(variance.box4.2)) #standard deviation coded

#To decode, refer to Appendix A.2 which discusses multiplicative,
#additive, and combination properties of codes for means and dispersion.
#In the box 4.2 example, to decode the mean, you just reverse the coding
#using algebra.
(decoded.mean<-8*coded.mean.birthweights+59.5)
#The sum of squares, standard deviation, and variance follow slightly different rules,
#per the proofs (pg. 871). Simple additive coding would not require any changes.
#Here we need to divide by the factor we coded with (1/8) (remember, additive is ignored).
#(Dividing by 1/8 equals multiplying by 8.)
(decoded.sd<-8*coded.sd)

Section 4.9: the coefficient of variation

#The coefficient of variation is the standard deviation as a percentage of the mean.
#This allows comparision of variation in different populations
#which might have larger or smaller absolute values of measurements.

(coefficient.variation.aphid<-sd.aphid*100/mean(aphid.femur))

#As for variance and standard deviation, there is a bias correction to use with small samples.

(corrected.cv.aphid<-coefficient.variation.aphid*(1+(1/(4*(length(aphid.femur))))))
#If you use Statistical Table II and its correction factor to calculate a standard deviation,
#as discussed on the lower part of pg. 53,
#then do not use the corrected coefficient of variation.
#If you look at Statistical table II, you'll see that the correction factor for n>30
#is similar to the correction factor in equation 4.10.

Claire M. Curry

Pages

Friday, October 31, 2014

The taming of the shrew

Monday, October 27, 2014

Most of the leaves are gone

Friday, October 24, 2014

White and pink yarrow in Alberta

Monday, October 20, 2014

Exploring the Assiniboine Forest

Friday, October 17, 2014

More from the summer: prairie fog

Wednesday, October 15, 2014

Using R to work through Sokal and Rohlf's Biometry: Chapter 4 (Descriptive Statistics), sections 4.6-4.9

Section 4.6: the range

Section 4.7: the standard deviation

Section 4.8: coding data before computation

Section 4.9: the coefficient of variation

Monday, October 13, 2014

Happy Canadian Thanksgiving!

Sunday, October 12, 2014

Reviving of the blog

Pages

Friday, October 31, 2014

Monday, October 27, 2014

Friday, October 24, 2014

Monday, October 20, 2014

Friday, October 17, 2014

Wednesday, October 15, 2014

Section 4.6: the range

Section 4.7: the standard deviation

Section 4.8: coding data before computation

Section 4.9: the coefficient of variation

Monday, October 13, 2014

Sunday, October 12, 2014

Subscribe To