Tuesday, December 20, 2016

Using R to work through Sokal and Rohlf's Biometry: Chapter 6 (The Normal Probability Distribution), sections 6.1-6.2

Previously in this series: Chapter 5 (Descriptive Statistics), section 5.5 and selected exercises.

Chapter 6 moves on from discrete distributions to the normal distribution,
#a continuous distribution that will be the basis of most statistics in the book.

Section 6.1 Frequency Distributions of Continuous Variables

A brief discussion of the section header's topic, nothing to code here.

Section 6.2 Properties of the Normal Distribution

#Let's check where it is.

#We can draw the normal distribution in a custom function as a plot using curve()
#and equation 6.1 on page 95.
normal.manual<-function(mean, sd){
      -4, 4, #go from -4 to +4 standard deviations.
      add = FALSE,
      type = "l")

#You can get this automatically in R with dnorm.
curve(dnorm(x, mean=0, sd=1),
      -4, 4,
#I added a white dashed line over the original black line so you can see that they follow the same path.

#You can draw any of the distributions in Figure 6.2 using these functions
#(either the R one or the one we just made).

#Figure 6.3 shows the cumulative normal distribution function.

#Let's do that and then plot the normal probability density function on it.
#Cumulative first since that plot will go from 0 to 1.
curve(pnorm(x, mean=0, sd=1),
      -4, 4,

#Then add the normal probability density function (like the one we made earlier, just with the default line color).
curve(dnorm(x, mean=0, sd=1),
      -4, 4,

#To get out some of the values  shown in page 96, we use pnorm().

pnorm(0, mean=0, sd=1)
#Just adding one value gives you a point estimate.
#50% will be found up to zero (the mean).

#The other values shown in Figure 6.3
#0 and 1 are the default values for mean and standard deviation, so you can leave them out if you want in this case.
#If you add more than one, it will tell you the values.
#Use diff as suggested here to get the area under the curve.
#This works because it uses the cumulative function to calculate it.
#If you go from -1 to 1 you can add them up to get the 68.27% shown under Figure 6.2 and on page 96.

#To go the other way, i.e. to see where 50% of the items fall, you have to use qnorm.
#Let's see what qnorm looks like.
#It is essentially tipping Figure 6.3 on its side.
#The frequency is now x on the x-axis, and the standard deviations are the y axis.
#You can get the point values in Figure 6.3 this way:
#The last two are close with rounding since those values in Figure 6.3 were rounded.
#How to get the values on page 96, of where 50%, 95%, etc of items are found?
#The values that you input are not the percent values given on page 96.
#You put them in proportions (the quantile you want to get) PLUS
#half of the remaining value.
#If you put in 0.95, it will go from 0 to 0.95.  That isn't centered at the mean of 0,
#where the 50% quantile is.  There is 0.05 left at the end.  So we divide it by two
#to put it on either end.
#You would get the negative number for the standard deviation if you put in the opposite value.
#This also explains it nicely down where it starts talking about qnorm.

#On pg. 98, they show calculation of standard deviates.
#This is described in such a way that they seem to be z scores although they are never named as such in the book.
#You can easily calculate this manually.

#You can also use the scale() function.

#help notes that scale=TRUE divides by the standard deviation when
#center=TRUE, and center=TRUE substracts each number by the column mean.
#This is the same thing we did in the z-score function.

#Table 6.1 has expected frequencies for the normal distribution in column 2 for a sample of 1000 individuals.
#We can generate this with pnorm() and thinking about what the class marks mean.
#Because the class marks are separated by 0.5, we need to go 0.5 around each class mark and start at -5.25.

      #this takes the last entry off because we only need the lower bound for each class,
      #and adds 0.25 to get the class mark.
      round(expected.freqs*1000, 1))
#Then you can have a look at the table and see that it gives the same results as column 2 in Table 6.1.

No comments:

Post a Comment

Comments and suggestions welcome.