Tuesday, December 20, 2016

Using R to work through Sokal and Rohlf's Biometry: Chapter 6 (The Normal Probability Distribution), sections 6.1-6.2

Previously in this series: Chapter 5 (Descriptive Statistics), section 5.5 and selected exercises.

Chapter 6 moves on from discrete distributions to the normal distribution,
#a continuous distribution that will be the basis of most statistics in the book.

Section 6.1 Frequency Distributions of Continuous Variables

A brief discussion of the section header's topic, nothing to code here.

Section 6.2 Properties of the Normal Distribution


#Let's check where it is.
?Distributions
?dnorm

#We can draw the normal distribution in a custom function as a plot using curve()
#and equation 6.1 on page 95.
normal.manual<-function(mean, sd){
curve((1/(sd*sqrt(2*pi)))*exp((-((x-mean)/sd)^2)/2),
      -4, 4, #go from -4 to +4 standard deviations.
      add = FALSE,
      ylab="freq",
      xlab="Y",
      type = "l")
}
normal.manual(0,1)

#You can get this automatically in R with dnorm.
curve(dnorm(x, mean=0, sd=1),
      -4, 4,
      col="white",
      lty="dashed",
      add=TRUE)
#I added a white dashed line over the original black line so you can see that they follow the same path.

#You can draw any of the distributions in Figure 6.2 using these functions
#(either the R one or the one we just made).

#Figure 6.3 shows the cumulative normal distribution function.

#Let's do that and then plot the normal probability density function on it.
#Cumulative first since that plot will go from 0 to 1.
curve(pnorm(x, mean=0, sd=1),
      -4, 4,
      lty="solid")

#Then add the normal probability density function (like the one we made earlier, just with the default line color).
curve(dnorm(x, mean=0, sd=1),
      -4, 4,
      lty="dashed",
      add=TRUE)

#To get out some of the values  shown in page 96, we use pnorm().
#http://stackoverflow.com/questions/34236648/r-function-to-calculate-area-under-the-normal-curve-between-adjacent-standard-de

pnorm(0, mean=0, sd=1)
#Just adding one value gives you a point estimate.
#50% will be found up to zero (the mean).

#The other values shown in Figure 6.3
pnorm(-1)
pnorm(-2)
#0 and 1 are the default values for mean and standard deviation, so you can leave them out if you want in this case.
#If you add more than one, it will tell you the values.
(pnorm(0:1))
#Use diff as suggested here to get the area under the curve.
diff(pnorm(0:1))
#This works because it uses the cumulative function to calculate it.
#If you go from -1 to 1 you can add them up to get the 68.27% shown under Figure 6.2 and on page 96.
sum(diff(pnorm(-1:1)))

#To go the other way, i.e. to see where 50% of the items fall, you have to use qnorm.
#Let's see what qnorm looks like.
curve(qnorm(x))
#It is essentially tipping Figure 6.3 on its side.
#The frequency is now x on the x-axis, and the standard deviations are the y axis.
#You can get the point values in Figure 6.3 this way:
qnorm(0.5)
qnorm(0.1587)
qnorm(0.0228)
#The last two are close with rounding since those values in Figure 6.3 were rounded.
#How to get the values on page 96, of where 50%, 95%, etc of items are found?
qnorm(0.5+0.5*0.5)
qnorm(0.95+0.5*0.05)
qnorm(0.99+0.5*0.01)
#The values that you input are not the percent values given on page 96.
#You put them in proportions (the quantile you want to get) PLUS
#half of the remaining value.
#If you put in 0.95, it will go from 0 to 0.95.  That isn't centered at the mean of 0,
#where the 50% quantile is.  There is 0.05 left at the end.  So we divide it by two
#to put it on either end.
#You would get the negative number for the standard deviation if you put in the opposite value.
qnorm(0.5*0.01)
#This also explains it nicely down where it starts talking about qnorm.
#https://cran.r-project.org/web/packages/tigerstats/vignettes/qnorm.html

#On pg. 98, they show calculation of standard deviates.
#This is described in such a way that they seem to be z scores although they are never named as such in the book.
#You can easily calculate this manually.
z.score<-function(x){
  (x-mean(x))/sd(x)
}

x<-c(1,2,3)
z.score(x)
#You can also use the scale() function.
#https://www.r-bloggers.com/r-tutorial-series-centering-variables-and-generating-z-scores-with-the-scale-function/

scale(x,
      center=TRUE,
      scale=TRUE)
#help notes that scale=TRUE divides by the standard deviation when
#center=TRUE, and center=TRUE substracts each number by the column mean.
#This is the same thing we did in the z-score function.

#Table 6.1 has expected frequencies for the normal distribution in column 2 for a sample of 1000 individuals.
#We can generate this with pnorm() and thinking about what the class marks mean.
#Because the class marks are separated by 0.5, we need to go 0.5 around each class mark and start at -5.25.
boundaries<-seq(from=-5.25,
                to=5.25,
                by=0.5)

pnorm.results<-pnorm(boundaries)
expected.freqs<-abs(diff(pnorm.results))
cbind(boundaries[-length(boundaries)]+0.25,
      #this takes the last entry off because we only need the lower bound for each class,
      #and adds 0.25 to get the class mark.
      round(expected.freqs*1000, 1))
#Then you can have a look at the table and see that it gives the same results as column 2 in Table 6.1.

Tuesday, November 29, 2016

Using R to work through Sokal and Rohlf's Biometry: Chapter 5 (Descriptive Statistics), section 5.4 and selected exercises

Previously in this series: Chapter 5 (Descriptive Statistics), section 5.3.


Section 5.4: Other Discrete Probability Distributions

?dhyper #hypergeometric
?dmultinom #multinomial
?dnbinom #negative binomial
#There are also options in various packages later for regression to use different distributions,
#although of these I've only ever used negative binomial.

#The logarithmic distribution requires an extra package.
#https://cran.r-project.org/web/views/Distributions.html
#extraDistr, VGAM, or gamlss.dst
#I installed extraDistr to test it out.
library(extraDistr)
?LogSeries
#and
?dlgser #The distributions in this package follow the same us of d, p, and q prefixes.


Exercises 5

#I'm doing the exercises that require new coding beyond what we have done already.

#Exercise 5.5
#The organism is present or absent in any given slide, assuming the person is diseased
#(if the person has the disease, their samples contain the organism but it's not very common).
#We want a false negative <1% of the time.  The organism is visible in 20% of the slides.
#At first I thought this was a Bayesian problem but I don't see how to do it that way.
#p=0.2 (organism visible), thus q=0.8 (organism present but not visible).
#We need to find the power of q=0.8 that equals 0.01 or smaller
#(1% false negative, which would be present but not visible).
#So, I made a while loop that cycled through to see how many times we need to raise 0.8
#to get to 0.01.  This explanation helped me figure out how to do this type of loop.

n<-0
i<-1
print(n)
while (i > 0.01) {
  print(i)
  print(n)
     i<-0.8^n
     n=n+1
}
#the last number printed is the number of slides needed.

Tuesday, November 15, 2016

Using R to work through Sokal and Rohlf's Biometry: Chapter 5 (Descriptive Statistics), section 5.3

Previously in this series: Chapter 5 (Introduction to Probability Distributions: Binomial and Poisson), sections 5.1-5.2.

Section 5.3: the Poisson distribution

?Distributions #Let's look in help again to find the Poisson distribution.
?dpois #We can see that R does similar calculations for Poisson in addition to the binomial distribution.

#To get column 3 (expected absolute frequencies) in table 5.5,
#there is the recursion formula given in equations 5.11 and 5.12.
#5.12 is for calculations with sample means to get the relative expected frequency,
#and in the text below it notes what term to add to get the absolute ones given in column 3 of table 5.5

abs.expected.freq.pois<-400*dpois(c(0:9),
                                  lambda=1.8,
                                  log=FALSE)

#I also want to do equation 5.12 manually for relative expected frequencies.
#This is not normally needed because R does it so nicely with the base stats function dpois().

#Some reading:
#http://stackoverflow.com/questions/5273857/are-recursive-functions-used-in-r
#this one has a simple one for factorials that makes it clearest to me:
#http://www.programiz.com/r-programming/recursion

rel.exp.freq.pois<-function(samplemean, i) {
  if (i==0) return (exp(-samplemean))
  else      return (rel.exp.freq.pois(samplemean,
                                      i-1)*(samplemean/i))
}

#To get absolute frequencies, multiply by 400.
#For one example:
400*rel.exp.freq.pois(1.8, 1)

#To get all the frequencies at once, use lapply.
rel.freq.pois<-unlist(lapply(c(0:9),
                                     rel.exp.freq.pois,
                                     samplemean=1.8))
(abs.freq.pois<-400*rel.freq.pois)

#On page 82, they show how to calculate the coefficient of dispersion.  Here is a function that will do it.
coefficient.of.dispersion<-function(data) {
  var(data)/mean(data)
}
#You can input any set of data here and get the coefficient of dispersion.

#Figure 5.3 shows Poisson distributions with different means.
#We can make this with the dpois() function generating y data.
#Oddly, to add the extra lines, it is easiest to use points() with
#type="l" (for lines).
plot(dpois(c(0:18),
      lambda=0.1,
      log=FALSE),
     type="l",
     ylab="Relative expected frequency",
     xlab="Number of rare events per sample")
points(dpois(c(0:18),
           lambda=1,
           log=FALSE),
       type="l")
points(dpois(c(0:18),
             lambda=2,
             log=FALSE),
       type="l")
points(dpois(c(0:18),
             lambda=3,
             log=FALSE),
       type="l")
points(dpois(c(0:18),
             lambda=10,
             log=FALSE),
       type="l")

#The remaining tables and examples in this section do not add anything new to code.

Tuesday, November 08, 2016

Using R to work through Sokal and Rohlf's Biometry: Chapter 5 (Introduction to Probability Distributions: Binomial and Poisson), sections 5.1-5.2

Previously in this series: Chapter 4 exercises (in Sokal and Rohlf's Biometry).

Chapter 5 mostly covers the binomial and Poisson distributions, and briefly mentions hypergeometric, multinomial, negative binomial, and logarithmic.


Section 5.1

This section talks about probability theory, unions, intersections, independence,
#conditional probability, and Bayes' theorem (which apparently will not be used in the rest of the book).

Section 5.2: the binomial distribution

?Distributions #to see distributions available in 'stats'.
?dbinom #to see what you can do with binomial distribution in particular.

#Let's generate table 5.1 and figure 5.2 (bar graph side by side).
#The remaining tables in section 5.2 would use similar code.

#You can get the binomial coefficient equation 5.9 by hand using factorial()
#For example on page 71 in table 5.1, they give an example of a sample of 5 containing 2 of the infected insects.
#to population a table with coefficients and powers of p and q (columns 3 and 4 in table)
factorial(5)/(factorial(2)*factorial(5-2))
#Yes, this gives us 10 just like in table 5.1, column 2, for 2 infected insects.

#You can do this more simply using the choose() function.
#Beware the different use of k in the R function versus the book.
choose(n=5, #n is the size of the sample.  "k" in book.
       k=2) #k is the place in the coefficient, starting at 0 (number infected, number whatever); "Y" in book.

#To get the values in column 3 and 4 of table 5.1, except for the power of 0, which is 1 and not calculated here.
poly(0.4, degree=5, raw=T)
poly(0.6, degree=5, raw=T)
#You will note taht the numbers for 0.6 (powers of q) are reversed in column 4.
#When you note that the number of uninfected insects per sample will be just opposite of infected insects (in column 1)
#this makes more sense.

#To get the relative expected frequencies (column 5), use dbinom and pbinom.
#http://www.r-tutor.com/elementary-statistics/probability-distributions/binomial-distribution

(rel.expected.freq<-dbinom(c(0,1,2,3,4,5), size=5, prob=0.4))
#density is book's relative frequencies.
(pbinom(c(0,1,2,3,4,5), size=5, prob=0.4))
#to contrast, this adds up the densities/relative frequencies.

#To get column 6, the absolute expected frequencies, multiply rel.expected.freq by the actual sample size.
(abs.expected.freq<-2423*rel.expected.freq)

#Now we'll use the abs.expected.freq plus the observed frequencies together to make a side-by-side barplot.
obs.freq<-c(202,
            643,
            817,
            535,
            197,
            29)

infected.freq<-rbind(obs.freq,
                 abs.expected.freq)
colnames(infected.freq)<-0:5
rownames(infected.freq)<-c("Observed frequencies",
                           "Expected frequencies")
#Adding colnames and rownames gives the proper legend and x axis labels in the plot.

#If you want to make a side-by-side barplot,
#you need adjacent columns, the barplot() function,
#and to specify beside=TRUE.
barplot(infected.freq,
        beside = TRUE,
        ylab="Frequency",
        xlab="Number of infected insects per sample",
        axes=TRUE,
        legend.text = TRUE,
        args.legend=c(bty="n"))

Tuesday, October 04, 2016

New paper out: Cattle grazing intensity and duration have varied effects on songbird nest survival in mixed-grass prairies

I'm a co-author on a new paper out in Rangeland Ecology and Management.  It's currently up online though it hasn't been assigned a volume/issue yet.  We looked at effects of different grazing regimes (differing duration stocking rates) in Grasslands National Park (Saskatchewan, Canada) and nearby areas on nest survival, and found few effects overall.  A few species had lowered nest survival in some years at low to moderate grazing intensities, such as Sprague's Pipit, Chestnut-collared Longspur, and Vesper Sparrow. Check it out!

Nestling Chestnut-collared Longspurs, near Brooks, Alberta, Canada.  14 June 2016.

Tuesday, September 20, 2016

Basic version control with Git and GitHub in RStudio in both Windows and Ubuntu

I don't remember why, but a few months ago I decided I finally had to try out this "version control" thing.  I love it SO MUCH.  Github and git can be a bit confusing initially, especially as I had the added layers of making it work with RStudio and on two different operating systems, but now that I've got it going it's great.

I started and installed git with this tutorial for starting with RStudio and git/github on Windows.  To understand how push and pull work (as this confused me), I read through several pages.  I started with this graphic of the workflow.  That one ignores RStudio and is just about git and github.  This essay has great illustrations and more details (code to use in the shell/terminal in RStudio).  Finally this one is github/git only (no RStudio) but I still found it helpful to read about push/pull/branching in yet another way.

Next problem was that when I opened up my project on Ubuntu, git said all lines in files were modified when I had not actually changed all lines.  This turned out to be an issue between line endings in Windows and linux type systems, and this would hide any actual changes if I had just made a few, so just committing every time was not a good option.  I decided to go with adding a gitattributes file in each repository (so all users will get the same line endings regardless of who I share with).  I added a gitattributes file (I combined general settings and the R settings available here) into the main project directory.  For a directory that I had already used a lot (so lots of files with the wrong line endings) I followed these steps for the per-repository settings, including opening a shell in RStudio to input their code for refreshing my repository.  (I backed up on a separate hard drive first in case I totally messed something up.)  At first this didn't work, because I have a dual-boot situation, not using the files in two separate computers, according to this top answer.  I ran the two lines of code in the top answer's update and that fixed it.
$ git config core.eol lf
$ git config core.autocrlf input

I now copy the custom attributes file into any new R repository I have, run those two lines, and haven't had any problems going between operating systems now.  Keep in mind this is where I have two different operating systems accessing the SAME FILES on the SAME drive/partition.  If I understand correctly the gitattributes file should work fine without the extra lines if you have different computers, but I have not yet tested this.

Finally, when I tried pushing in RStudio to github, I kept getting two errors:
error: unable to read askpass response from 'rpostback-askpass'
fatal: could not read Username for 'https://github.com': No such device or address

As I had already done the git config steps to add my github username and github email address, I looked up the error and found that I needed to create a private/public key pair because RStudio has issues authenticating apparently.  I followed the top answer's links, which resulted in my changing the url in the .config file and then generating and linking an SSH key.  I did this on both Ubuntu and Windows following the same instructions, resulting in two new key pairs (one for Ubuntu, one for Windows).

2016/11/04 edit: I got the askpass response error again in a new repository, but on the same computer that already had the SSH keys.  In this case I only changed the url in the .config file and then it worked swimmingly.  I've also had consistent success with line endings using the custom .gitattributes file followed by the two git config commands above.

Tuesday, July 19, 2016

Dual-boot Ubuntu 16.04 with Windows 7, then upgrading to Windows 10

I have a small Lenovo (ThinkPad X100e) running 32-bit Windows 7 Home edition.  I wantd to add dual-boot, then upgrade to Windows 10 to test that it won't erase my Ubuntu partition.

I first tried just running a bootable Ubuntu 16.04 LTS (long term stable) USB, and it got me into Ubuntu install, but when I got to the partitioning step the computer said okay, then shut down, and it never seemed to install.  I suspected I might need to partition first.  I pre-emptively went ahead on shrinking the Windows partition before repartitioning, following these steps as in my previous post just adjusting steps by looking in Windows 7 places for the settings.   System protection was already off and I skipped CHKDSK because Windows ran it on its own after my failed original Ubuntu automatic repartitioning attempts.  My computer also insisted on rebooting between steps 4 and 5.  Disk cleanup (step 5) took less than 20 minutes. Disk shrinking was maybe less than an hour.

I also realized during the Windows shrinking that I was trying to install 64-bit Ubuntu (the main file that downloads) onto a 32-bit computer.  I downloaded 32-bit from a mirror (there didn't seem to be an easy link on the main Ubuntu site, had to look around for it in their site) and made a new bootable USB.  I tried again.  At the partitioning step, I got a scary looking message about whether I wanted to continue with automatic partitioning (it said it will change and format some things).  I answered continue, and then I went on with set up.  That worked and I now have Ubuntu 16.04 LTS dual booting with Windows 7.

This went a lot faster than with my last dual-boot setup.  I think this was because I had Windows 7, an older laptop, and a bit more experience, and probably mostly because I cared less about whether I ruined my Windows set up. This older computer also didn't have weird boot settings that the newer one did (secure boot), nor did it have Windows 8 hibernation and fast shutdown/startup to worry about.  I also did not want to set up a shared data partition on this laptop so that did not take up any time.  So I did less prep, and it still turned out okay.  Your results may vary.

Now, the Windows 10 upgrade over the dual-boot system.  I ended up not having enough space on my Windows partition because I had some duplicate users and all their data.  I uninstalled a few large programs that I didn't use any more and then removed the two duplicate users and their data (be careful with this!  I was very sure I did not need those user profiles).  After following the various on-screen prompts, I eventually got a message that I had to log into my administrator account.  I did so, waited for the download again, and then the Windows 10 upgrade just... failed, every time, even though I was booting back into Windows 7 and not Ubuntu.  With the error code, I found that I should be rebooting into the system recovery partition.  I did that and sure enough, it worked!  The install crashed once because my laptop overheated, but that's my laptop's problem.  I set it on an air vent in the house and the AC kept it cool enough to restore Windows 7 (Windows did this on its own).  I then started the Windows 10 upgrade again and it finished the install.  I missed a few reboots and it booted into Ubuntu, but in those instances I just rebooted and watched to select the Windows system recovery partition and it resumed where it had left off.  Windows 10 did not erase my programs or apps, so I am pleased with how this went and might even try it on my main working computer now.