Tuesday, January 24, 2017

Using R to work through Sokal and Rohlf's Biometry: Chapter 6 (The Normal Probability Distribution), section 6.6


Section 6.6 Skewness and Kurtosis

#Box 6.1 shows how to compute g1 (skewness) and g2 (kurtosis) from a frequency distribution.
#This is unlike to be how one would do it with your own table of data, 
#but it is a helpful exercise in understanding how these moment statistics work and coding.
#This section assumes you have loaded the birthweights data from the last post.

mean.bw<-sum(birthweights[-16, "frequencies"]*(birthweights[-16, "classmark"]))/samplesize
yfreq<-(
    birthweights[-16, "classmark"]-mean.bw #This is deviation from the mean (see pg 51, section 4.7)
  )

(g1<-(
        samplesize*
        sum(birthweights[-16, "frequencies"]*yfreq^3)
      )/
    (
      (samplesize-1)*(samplesize-2)*birthweights.sd^3
      )
)


(g2<-(
  (
  (samplesize+1)*samplesize*sum(birthweights[-16, "frequencies"]*yfreq^4)
   )/
  (
    (samplesize-1)*(samplesize-2)*(samplesize-3)*(birthweights.sd^4)
    )
  )-
  (
    (
      3*(samplesize-1)^2
      )/
      (
        (samplesize-2)*(samplesize-3)
        )
    )
)

#As an interesting side note, if you use the value of the mean given in the book,
birthweights.mean
#which is rounded to four decimal places, the calculation for cubing 

#(and raising to the power of 4) both were off.  The power of three one was off by 118!

Tuesday, January 17, 2017

Git Large File System

One of my projects needs some pretty big datasets that github can't handle by itself.  Its error message helpfully suggested I install Git Large File System (LFS).

So, I downloaded and installed it.  Then you need to initialize it.  I tried it first in the git shell from RStudio but that didn't work (said it was not a command), so I tried opening a regular command prompt.  There "git lfs install" worked (it said "Git LFS initialized").

Next, I needed to track some giant .csv files, but I didn't want to track all .csv files.  The main instructions for using LFS call for tracking given file extensions.

This part was me noodling around and it didn't work, but has other ideas that might work for other situations.  I found a modification for a single file tracking though, so I can track only some files by putting them in a folder (all the big ones).  I tried following instructions for uploading a file here, but nothing happened when I tried the first file.  I realized it might be too small, so I put a bigger file in (>100 MB) and it still didn't work.  I tried just doing a single file and that didn't do it.  ls-files and lfs status still doesn't show it.  Modifying just to do all .csv files seems to modify them, which is not what I wanted.  So I went back to the last good version (using git reset --hard SHAnumber).  Unfortunately, somehow reset kept locking up RStudio.  I finally got it to open when I paused my computer's Dropbox syncing (I was on a very slow connection) and also opened the project not from the recent projects menu.  Not sure if one or both of those helped.  Then I got the .csv files that had been modified to go away when I deleted the .gitattributes (not my custom.gitattributes file) that had LFS settings in it (apparently track/untrack just writes to .gitattributes).

Once I was reset back to where I was before I started messing with everything, I tried again with tracking the folder AND ITS CONTENTS.  This does it!

git lfs track "myfolder/**"

However, it won't show anything staged in regular git status or git lfs ls-files until commited.  you have to do git lfs status.

Tuesday, January 10, 2017

Using R to work through Sokal and Rohlf's Biometry: Chapter 6 (The Normal Probability Distribution), sections 6.3-6.5

Previously in this series: Chapter 6 (The Normal Probability Distribution), sections 6.1-6.2.


Section 6.3 A Model for the Normal Distribution

#This section describes what produces a normal distribution and
#a heuristic showing how it is related to the binomial distribution.

Section 6.4 Applications of the Normal Distribution

#This section describes how to do what we did to get Table 6.1,
#but with a different set of data back from Section 4.2.

mean.bw<-109.9
sd.bw<-13.593

(sdeviate.bw<-(151-mean.bw)/sd.bw)

pnorm(sdeviate.bw) #This value is more precise
pnorm(3.02) #But the book uses this rounded value so we will too here.
#We can do some of the calculations the book does on pp. 103-104.
#Only a very few individuals are higher than 151 oz.
1-pnorm(3.02)
#If we want to look both directions we can double this number since the distribution is symmetrical.
2*(1-pnorm(3.02))

#By default, lower.tail=TRUE in pnorm.
pnorm(3.02,
      lower.tail=FALSE)
#It is the same as our earlier value of
1-pnorm(3.02)
#Here is the default setting for comparison.
pnorm(3.02,
      lower.tail=TRUE)
#This simply tells use which direction we want to look at,
#the upper or lower tail of the distribution from our value of standard deviate.  

Section 6.5 Fitting a Normal Distribution to Observed Data


#These data are again from Section 4.2.
classmark<-seq(from=59.5, to=171.5, by=8)
frequencies<-c(2,6,39,385,888,1729,2240,2007,1233,641,201,74,14,5,1)
samplesize<-sum(frequencies) #This confirms that we entered the data correctly, and gets our sample size.
#Multiply classmark and frequencies to get the sums for each class.
classsums<-classmark*frequencies

#To look at all this stuff together, combine it into a dataset.
birthweights<-data.frame(cbind(classmark, frequencies, classsums))

#Add on a row of the next class up which contains 0 individuals.
(birthweights<-rbind(birthweights, c(179.5, 0, 0)))


#On page 104, equation 6.2 is like equation 6.1 above but with sample size (n) and i (class intervals).
normal.manual.applied<-function(mean, sd, n, i){
  curve(((1/(sd*sqrt(2*pi)))*exp((-((x-mean)/sd)^2)/2))*n*i,
        -4, 4, #go from -4 to +4 standard deviations.
        add = FALSE,
        ylab="freq",
        xlab="Y",
        type = "l")
}

normal.manual.applied(mean=0, sd=1, n=1000, i=0.5)
#This gives the curve that Table 6.1 also has (same class intervals of 0.5, 1000 samples, and mean=0 with sd=1).
#Let's do this with the birthweights data.

birthweights.mean<-109.8996
birthweights.sd<-13.5942

#Like above we need the lower boundaries of the class marks.
birthweights$boundaries<-birthweights$classmark-4

#Get the expected frequencies for the class boundaries with the mean and sd of our dataset.
birthweights$pnorm.results<-pnorm(birthweights$boundaries,
                     mean=birthweights.mean,
                     sd=birthweights.sd)

#Then, take the difference of the first row minus the next row.
#The last row will not have anything, which is why we needed to add the lower boundary of
#the next class mark, which has a frequency of zero.  Thus, this calculation generates a vector of length 14. 
#We need 15, so we just add a zero on for the last difference as they do in Table 6.2
birthweights$expected.freqs<-c(abs(diff(birthweights$pnorm.results)),0) #add a zero on for the last difference

#Multipy the frequencies by the sample size to get the expected frequencies for a sample of this size.
#Round as in the table.
birthweights$expected.freqs.values<-round(birthweights$expected.freqs*samplesize, 1)

#We can even add the plus and minus signs using ifelse and sign() to see in which direction the differences are.
birthweights$departure.signs<-ifelse(sign(birthweights$frequencies-birthweights$expected.freqs.values)==-1,
                                     "-", #if -1, then write "-"
                                     "+") #else if not -1, write "+"

#View the table to confirm it has the same data as Table 6.2.
birthweights