### Section 4.6: the range

#This section begins our study of dispersion and spread.#The range is the difference between the largest and smallest items in a sample,

#so it should be pretty easy to calculate.

max(aphid.femur)-min(aphid.femur) #bottom of pg. 49

max(birthweights$classmark)-min(birthweights$classmark) #pg. 50

#R has a function for this too which gives the minimum and maximum values (not the difference between them).

#If your data contain any NA ("not available") or NaN ("not a number"), and na.rm=FALSE,

#then NA values will be returned, so go ahead and set na.rm=TRUE if you need want to see min and max

#AFTER ignoring your NA/NaN values.

range(aphid.femur)

### Section 4.7: the standard deviation

#The standard deviation measures a distance from the center of the distribution.#Table 4.1 calculates the basics using our aphid data.

#The first column is each observed value.

#We need to move our set of numbers into a column,

#which can be done here by simply transforming it into a data frame.

#In more complex cases we might need to use a matrix and set dimensions,

#but let's not worry about it here when we can do it a simpler way.

aphid.femur.sd<-data.frame(aphid.femur)

#The deviates are individual observations minus the mean.

(aphid.femur.sd$deviates<-aphid.femur.sd$aphid.femur-mean(aphid.femur))

#The authors show sums in Table 4.1 as well.

sum(aphid.femur.sd$aphid.femur)

#they sum to 100.1.

#Perhaps summing deviates will get us a measure of dispersion.

sum(aphid.femur.sd$deviates)

#Nope! As the book points out, this is essentially zero if not for rounding errors.

#Appendix A.1 on pg. 869 shows a nice proof of this.

#We'll now make the absolute value column in Table 4.1.

aphid.femur.sd$abs.deviates<-abs(aphid.femur.sd$deviates)

sum(aphid.femur.sd$abs.deviates)

mean(aphid.femur.sd$abs.deviates)

#This average deviate is apparently not a very common measure of dispersion though.

#Instead we use, as you might have guessed from the section title,

#standard deviations, which, the authors state, have nifty statistical properties based

#on the sum of squares. Presumably they will explain said properties in a later chapter.

aphid.femur.sd$sqr.deviates<-(aphid.femur.sd$deviates)^2

sum(aphid.femur.sd$sqr.deviates)

#Variance is the mean of the squared deviates.

mean(aphid.femur.sd$sqr.deviates)

#or equivalently,

(variance.aphid<-sum(aphid.femur.sd$sqr.deviates)/length(aphid.femur.sd$sqr.deviates))

#The sum of the items divided by the number of the items is the mean,

#hence the mean of the squared deviates.

#The square root of this is the standard deviation.

(sqrt(variance.aphid))

#It is in original units of measurement

#because we brought it back to that by the square root.

#Bias-corrected variance and standard deviation are apparently needed,

#and are shown in the bottom of Table 4.1 and on pg. 53

(bias.corrected.variance.aphid<-sum(aphid.femur.sd$sqr.deviates)/

(length(aphid.femur.sd$sqr.deviates)-1)) #equation 4.6

(sd.aphid<-sqrt(bias.corrected.variance.aphid)) #equation 4.7

#The differences between bias-corrected and uncorrected variance and

#standard deviation will decline as sample size decreases.

#The quantity of n-1 (which we are showing as length(oursample)-1)

#is the degrees of freedom. The authors state that we are

#only to use n to get uncorrected variance and standard deviation

#if evaluating a parameter (i.e. we have all items from the population).

#There is an additional and even more accurate correction for estimating standard deviation.

#Multiply the standard deviation by the correction factor found in the Statistical Tables book,

#Statistical Table II (pg. 204). This is Gurland and Tripathi's correction, and approaches 1

#above a sample size of 100.

#R allows easy calculation of both variance and standard devation.

sd(aphid.femur) #standard devation

var(aphid.femur) #variance

#Be careful with var(), as the R function will also work with matrices.

#See ?var for more info.

#On pg. 54 the authors suggest using the midrange to estimate where your mean should be.

#You can use this to see if your mean calculations look correct ("detect gross errors in computation").

#Just average the largest and smallest values.

mean(range(aphid.femur))

#On pg. 54, what if you want to get an approximation of mean and standard deviation?

#This is useful to see if your calculated values are about right.

#For the mean, obtain the midrange.

(min(aphid.femur)+max(aphid.femur))/2

#To estimate the standard deviation, you should be able to use their little table on pg. 54.

#There are 25 aphid.femur samples. This is closest to 30, so divide the range by 4.

(max(aphid.femur)-min(aphid.femur))/4

#They also mention Statistical Table I as a source of mean ranges for various sample sizes with a normal distribution.

#The statistical tables come as a separate book: Statistical tables, 4th ed

#(table I is found pp. 62-63).

#http://www.amazon.com/Statistical-Tables-F-James-Rohlf/dp/1429240318/ref=sr_1_1?ie=UTF8&qid=1404660488&sr=8-1&keywords=biometry+statistical+tables

#That table provides a more accurate method than the pg. 54 Biometry table,

#though they are based on similar assumptions.

#The table is read by selecting your sample size from the vertical and horizontal margins.

#For a sample size of 25, as in the aphid example, select 20 from the vertical margin and

#5 from the horizontal margin. The mean range in a normal distribution from a sample of

#this size is 3.931.

(max(aphid.femur)-min(aphid.femur))/3.931

#The answer is similar to the actual value of 0.3657.

### Section 4.8: coding data before computation

#I can't really think of many applications for why you'd use coded averaging with a computer,#but let's try it anyway. This box is also on pg. 43 with the non-coding application of averaging.

#The original coding to 0-14 iS made by subtracting the lowest class mark

#and then dividing (by 8, in this case) so that the series is 0-n (pg. 55).

coded.classmarks<-seq(from=0, to=14, by=1)

#We still use the frequencies and sample size as before.

coded.classsums<-coded.classmarks*frequencies

(coded.summing<-sum(coded.classsums))

#Sure enough, we get 59629.

#Divide by sample size and get the coded average of 6.300.

(coded.mean.birthweights<-coded.summing/samplesize)

#(It's 6.299947 but apparently the book was the one to round.)

#Box 4.2 showed how to calculate mean and dispersion from frequency distributions,

#both coded and uncoded.

#Continue on to coded standard deviation and variance.

(SS.coded<-sum(frequencies*(coded.classmarks-(coded.summing/samplesize))^2)) #sum of squares

(variance.box4.2<-SS.coded/(sum(frequencies)-1)) #variance coded

(coded.sd<-sqrt(variance.box4.2)) #standard deviation coded

#To decode, refer to Appendix A.2 which discusses multiplicative,

#additive, and combination properties of codes for means and dispersion.

#In the box 4.2 example, to decode the mean, you just reverse the coding

#using algebra.

(decoded.mean<-8*coded.mean.birthweights+59.5)

#The sum of squares, standard deviation, and variance follow slightly different rules,

#per the proofs (pg. 871). Simple additive coding would not require any changes.

#Here we need to divide by the factor we coded with (1/8) (remember, additive is ignored).

#(Dividing by 1/8 equals multiplying by 8.)

(decoded.sd<-8*coded.sd)

### Section 4.9: the coefficient of variation

#The coefficient of variation is the standard deviation as a percentage of the mean.#This allows comparision of variation in different populations

#which might have larger or smaller absolute values of measurements.

(coefficient.variation.aphid<-sd.aphid*100/mean(aphid.femur))

#As for variance and standard deviation, there is a bias correction to use with small samples.

(corrected.cv.aphid<-coefficient.variation.aphid*(1+(1/(4*(length(aphid.femur))))))

#If you use Statistical Table II and its correction factor to calculate a standard deviation,

#as discussed on the lower part of pg. 53,

#then do not use the corrected coefficient of variation.

#If you look at Statistical table II, you'll see that the correction factor for n>30

#is similar to the correction factor in equation 4.10.

## No comments:

## Post a Comment

Comments and suggestions welcome.