Section 4.6: the range#This section begins our study of dispersion and spread.
#The range is the difference between the largest and smallest items in a sample,
#so it should be pretty easy to calculate.
max(aphid.femur)-min(aphid.femur) #bottom of pg. 49
max(birthweights$classmark)-min(birthweights$classmark) #pg. 50
#R has a function for this too which gives the minimum and maximum values (not the difference between them).
#If your data contain any NA ("not available") or NaN ("not a number"), and na.rm=FALSE,
#then NA values will be returned, so go ahead and set na.rm=TRUE if you need want to see min and max
#AFTER ignoring your NA/NaN values.
Section 4.7: the standard deviation#The standard deviation measures a distance from the center of the distribution.
#Table 4.1 calculates the basics using our aphid data.
#The first column is each observed value.
#We need to move our set of numbers into a column,
#which can be done here by simply transforming it into a data frame.
#In more complex cases we might need to use a matrix and set dimensions,
#but let's not worry about it here when we can do it a simpler way.
#The deviates are individual observations minus the mean.
#The authors show sums in Table 4.1 as well.
#they sum to 100.1.
#Perhaps summing deviates will get us a measure of dispersion.
#Nope! As the book points out, this is essentially zero if not for rounding errors.
#Appendix A.1 on pg. 869 shows a nice proof of this.
#We'll now make the absolute value column in Table 4.1.
#This average deviate is apparently not a very common measure of dispersion though.
#Instead we use, as you might have guessed from the section title,
#standard deviations, which, the authors state, have nifty statistical properties based
#on the sum of squares. Presumably they will explain said properties in a later chapter.
#Variance is the mean of the squared deviates.
#The sum of the items divided by the number of the items is the mean,
#hence the mean of the squared deviates.
#The square root of this is the standard deviation.
#It is in original units of measurement
#because we brought it back to that by the square root.
#Bias-corrected variance and standard deviation are apparently needed,
#and are shown in the bottom of Table 4.1 and on pg. 53
(length(aphid.femur.sd$sqr.deviates)-1)) #equation 4.6
(sd.aphid<-sqrt(bias.corrected.variance.aphid)) #equation 4.7
#The differences between bias-corrected and uncorrected variance and
#standard deviation will decline as sample size decreases.
#The quantity of n-1 (which we are showing as length(oursample)-1)
#is the degrees of freedom. The authors state that we are
#only to use n to get uncorrected variance and standard deviation
#if evaluating a parameter (i.e. we have all items from the population).
#There is an additional and even more accurate correction for estimating standard deviation.
#Multiply the standard deviation by the correction factor found in the Statistical Tables book,
#Statistical Table II (pg. 204). This is Gurland and Tripathi's correction, and approaches 1
#above a sample size of 100.
#R allows easy calculation of both variance and standard devation.
sd(aphid.femur) #standard devation
#Be careful with var(), as the R function will also work with matrices.
#See ?var for more info.
#On pg. 54 the authors suggest using the midrange to estimate where your mean should be.
#You can use this to see if your mean calculations look correct ("detect gross errors in computation").
#Just average the largest and smallest values.
#On pg. 54, what if you want to get an approximation of mean and standard deviation?
#This is useful to see if your calculated values are about right.
#For the mean, obtain the midrange.
#To estimate the standard deviation, you should be able to use their little table on pg. 54.
#There are 25 aphid.femur samples. This is closest to 30, so divide the range by 4.
#They also mention Statistical Table I as a source of mean ranges for various sample sizes with a normal distribution.
#The statistical tables come as a separate book: Statistical tables, 4th ed
#(table I is found pp. 62-63).
#That table provides a more accurate method than the pg. 54 Biometry table,
#though they are based on similar assumptions.
#The table is read by selecting your sample size from the vertical and horizontal margins.
#For a sample size of 25, as in the aphid example, select 20 from the vertical margin and
#5 from the horizontal margin. The mean range in a normal distribution from a sample of
#this size is 3.931.
#The answer is similar to the actual value of 0.3657.
Section 4.8: coding data before computation#I can't really think of many applications for why you'd use coded averaging with a computer,
#but let's try it anyway. This box is also on pg. 43 with the non-coding application of averaging.
#The original coding to 0-14 iS made by subtracting the lowest class mark
#and then dividing (by 8, in this case) so that the series is 0-n (pg. 55).
coded.classmarks<-seq(from=0, to=14, by=1)
#We still use the frequencies and sample size as before.
#Sure enough, we get 59629.
#Divide by sample size and get the coded average of 6.300.
#(It's 6.299947 but apparently the book was the one to round.)
#Box 4.2 showed how to calculate mean and dispersion from frequency distributions,
#both coded and uncoded.
#Continue on to coded standard deviation and variance.
(SS.coded<-sum(frequencies*(coded.classmarks-(coded.summing/samplesize))^2)) #sum of squares
(variance.box4.2<-SS.coded/(sum(frequencies)-1)) #variance coded
(coded.sd<-sqrt(variance.box4.2)) #standard deviation coded
#To decode, refer to Appendix A.2 which discusses multiplicative,
#additive, and combination properties of codes for means and dispersion.
#In the box 4.2 example, to decode the mean, you just reverse the coding
#The sum of squares, standard deviation, and variance follow slightly different rules,
#per the proofs (pg. 871). Simple additive coding would not require any changes.
#Here we need to divide by the factor we coded with (1/8) (remember, additive is ignored).
#(Dividing by 1/8 equals multiplying by 8.)
Section 4.9: the coefficient of variation#The coefficient of variation is the standard deviation as a percentage of the mean.
#This allows comparision of variation in different populations
#which might have larger or smaller absolute values of measurements.
#As for variance and standard deviation, there is a bias correction to use with small samples.
#If you use Statistical Table II and its correction factor to calculate a standard deviation,
#as discussed on the lower part of pg. 53,
#then do not use the corrected coefficient of variation.
#If you look at Statistical table II, you'll see that the correction factor for n>30
#is similar to the correction factor in equation 4.10.
Post a Comment
Comments and suggestions welcome.