Tuesday, March 07, 2017

New paper out: Ability to alter song in two grassland songbirds exposed to simulated anthropogenic noise is not related to pre-existing variability

Exciting news!  The first paper from my experimental work as a post-doc at the University of Manitoba has been published in Bioacoustics.  We compared how Baird's and Savannah sparrows alter their songs in the presence of simulated oil well drilling noise, and found that they both altered their songs in response to noise.  This was in spite of the fact that Savannah Sparrow has much more species-level variation in song relative to Baird's Sparrow.  Read all about it!

Tuesday, January 24, 2017

Using R to work through Sokal and Rohlf's Biometry: Chapter 6 (The Normal Probability Distribution), section 6.6


Section 6.6 Skewness and Kurtosis

#Box 6.1 shows how to compute g1 (skewness) and g2 (kurtosis) from a frequency distribution.
#This is unlike to be how one would do it with your own table of data, 
#but it is a helpful exercise in understanding how these moment statistics work and coding.
#This section assumes you have loaded the birthweights data from the last post.

mean.bw<-sum(birthweights[-16, "frequencies"]*(birthweights[-16, "classmark"]))/samplesize
yfreq<-(
    birthweights[-16, "classmark"]-mean.bw #This is deviation from the mean (see pg 51, section 4.7)
  )

(g1<-(
        samplesize*
        sum(birthweights[-16, "frequencies"]*yfreq^3)
      )/
    (
      (samplesize-1)*(samplesize-2)*birthweights.sd^3
      )
)


(g2<-(
  (
  (samplesize+1)*samplesize*sum(birthweights[-16, "frequencies"]*yfreq^4)
   )/
  (
    (samplesize-1)*(samplesize-2)*(samplesize-3)*(birthweights.sd^4)
    )
  )-
  (
    (
      3*(samplesize-1)^2
      )/
      (
        (samplesize-2)*(samplesize-3)
        )
    )
)

#As an interesting side note, if you use the value of the mean given in the book,
birthweights.mean
#which is rounded to four decimal places, the calculation for cubing 

#(and raising to the power of 4) both were off.  The power of three one was off by 118!

Tuesday, January 17, 2017

Git Large File System

One of my projects needs some pretty big datasets that github can't handle by itself.  Its error message helpfully suggested I install Git Large File System (LFS).

So, I downloaded and installed it.  Then you need to initialize it.  I tried it first in the git shell from RStudio but that didn't work (said it was not a command), so I tried opening a regular command prompt.  There "git lfs install" worked (it said "Git LFS initialized").

Next, I needed to track some giant .csv files, but I didn't want to track all .csv files.  The main instructions for using LFS call for tracking given file extensions.

This part was me noodling around and it didn't work, but has other ideas that might work for other situations.  I found a modification for a single file tracking though, so I can track only some files by putting them in a folder (all the big ones).  I tried following instructions for uploading a file here, but nothing happened when I tried the first file.  I realized it might be too small, so I put a bigger file in (>100 MB) and it still didn't work.  I tried just doing a single file and that didn't do it.  ls-files and lfs status still doesn't show it.  Modifying just to do all .csv files seems to modify them, which is not what I wanted.  So I went back to the last good version (using git reset --hard SHAnumber).  Unfortunately, somehow reset kept locking up RStudio.  I finally got it to open when I paused my computer's Dropbox syncing (I was on a very slow connection) and also opened the project not from the recent projects menu.  Not sure if one or both of those helped.  Then I got the .csv files that had been modified to go away when I deleted the .gitattributes (not my custom.gitattributes file) that had LFS settings in it (apparently track/untrack just writes to .gitattributes).

Once I was reset back to where I was before I started messing with everything, I tried again with tracking the folder AND ITS CONTENTS.  This does it!

git lfs track "myfolder/**"

However, it won't show anything staged in regular git status or git lfs ls-files until commited.  you have to do git lfs status.

Tuesday, January 10, 2017

Using R to work through Sokal and Rohlf's Biometry: Chapter 6 (The Normal Probability Distribution), sections 6.3-6.5

Previously in this series: Chapter 6 (The Normal Probability Distribution), sections 6.1-6.2.


Section 6.3 A Model for the Normal Distribution

#This section describes what produces a normal distribution and
#a heuristic showing how it is related to the binomial distribution.

Section 6.4 Applications of the Normal Distribution

#This section describes how to do what we did to get Table 6.1,
#but with a different set of data back from Section 4.2.

mean.bw<-109.9
sd.bw<-13.593

(sdeviate.bw<-(151-mean.bw)/sd.bw)

pnorm(sdeviate.bw) #This value is more precise
pnorm(3.02) #But the book uses this rounded value so we will too here.
#We can do some of the calculations the book does on pp. 103-104.
#Only a very few individuals are higher than 151 oz.
1-pnorm(3.02)
#If we want to look both directions we can double this number since the distribution is symmetrical.
2*(1-pnorm(3.02))

#By default, lower.tail=TRUE in pnorm.
pnorm(3.02,
      lower.tail=FALSE)
#It is the same as our earlier value of
1-pnorm(3.02)
#Here is the default setting for comparison.
pnorm(3.02,
      lower.tail=TRUE)
#This simply tells use which direction we want to look at,
#the upper or lower tail of the distribution from our value of standard deviate.  

Section 6.5 Fitting a Normal Distribution to Observed Data


#These data are again from Section 4.2.
classmark<-seq(from=59.5, to=171.5, by=8)
frequencies<-c(2,6,39,385,888,1729,2240,2007,1233,641,201,74,14,5,1)
samplesize<-sum(frequencies) #This confirms that we entered the data correctly, and gets our sample size.
#Multiply classmark and frequencies to get the sums for each class.
classsums<-classmark*frequencies

#To look at all this stuff together, combine it into a dataset.
birthweights<-data.frame(cbind(classmark, frequencies, classsums))

#Add on a row of the next class up which contains 0 individuals.
(birthweights<-rbind(birthweights, c(179.5, 0, 0)))


#On page 104, equation 6.2 is like equation 6.1 above but with sample size (n) and i (class intervals).
normal.manual.applied<-function(mean, sd, n, i){
  curve(((1/(sd*sqrt(2*pi)))*exp((-((x-mean)/sd)^2)/2))*n*i,
        -4, 4, #go from -4 to +4 standard deviations.
        add = FALSE,
        ylab="freq",
        xlab="Y",
        type = "l")
}

normal.manual.applied(mean=0, sd=1, n=1000, i=0.5)
#This gives the curve that Table 6.1 also has (same class intervals of 0.5, 1000 samples, and mean=0 with sd=1).
#Let's do this with the birthweights data.

birthweights.mean<-109.8996
birthweights.sd<-13.5942

#Like above we need the lower boundaries of the class marks.
birthweights$boundaries<-birthweights$classmark-4

#Get the expected frequencies for the class boundaries with the mean and sd of our dataset.
birthweights$pnorm.results<-pnorm(birthweights$boundaries,
                     mean=birthweights.mean,
                     sd=birthweights.sd)

#Then, take the difference of the first row minus the next row.
#The last row will not have anything, which is why we needed to add the lower boundary of
#the next class mark, which has a frequency of zero.  Thus, this calculation generates a vector of length 14. 
#We need 15, so we just add a zero on for the last difference as they do in Table 6.2
birthweights$expected.freqs<-c(abs(diff(birthweights$pnorm.results)),0) #add a zero on for the last difference

#Multipy the frequencies by the sample size to get the expected frequencies for a sample of this size.
#Round as in the table.
birthweights$expected.freqs.values<-round(birthweights$expected.freqs*samplesize, 1)

#We can even add the plus and minus signs using ifelse and sign() to see in which direction the differences are.
birthweights$departure.signs<-ifelse(sign(birthweights$frequencies-birthweights$expected.freqs.values)==-1,
                                     "-", #if -1, then write "-"
                                     "+") #else if not -1, write "+"

#View the table to confirm it has the same data as Table 6.2.
birthweights

Tuesday, December 20, 2016

Using R to work through Sokal and Rohlf's Biometry: Chapter 6 (The Normal Probability Distribution), sections 6.1-6.2

Previously in this series: Chapter 5 (Descriptive Statistics), section 5.5 and selected exercises.

Chapter 6 moves on from discrete distributions to the normal distribution,
#a continuous distribution that will be the basis of most statistics in the book.

Section 6.1 Frequency Distributions of Continuous Variables

A brief discussion of the section header's topic, nothing to code here.

Section 6.2 Properties of the Normal Distribution


#Let's check where it is.
?Distributions
?dnorm

#We can draw the normal distribution in a custom function as a plot using curve()
#and equation 6.1 on page 95.
normal.manual<-function(mean, sd){
curve((1/(sd*sqrt(2*pi)))*exp((-((x-mean)/sd)^2)/2),
      -4, 4, #go from -4 to +4 standard deviations.
      add = FALSE,
      ylab="freq",
      xlab="Y",
      type = "l")
}
normal.manual(0,1)

#You can get this automatically in R with dnorm.
curve(dnorm(x, mean=0, sd=1),
      -4, 4,
      col="white",
      lty="dashed",
      add=TRUE)
#I added a white dashed line over the original black line so you can see that they follow the same path.

#You can draw any of the distributions in Figure 6.2 using these functions
#(either the R one or the one we just made).

#Figure 6.3 shows the cumulative normal distribution function.

#Let's do that and then plot the normal probability density function on it.
#Cumulative first since that plot will go from 0 to 1.
curve(pnorm(x, mean=0, sd=1),
      -4, 4,
      lty="solid")

#Then add the normal probability density function (like the one we made earlier, just with the default line color).
curve(dnorm(x, mean=0, sd=1),
      -4, 4,
      lty="dashed",
      add=TRUE)

#To get out some of the values  shown in page 96, we use pnorm().
#http://stackoverflow.com/questions/34236648/r-function-to-calculate-area-under-the-normal-curve-between-adjacent-standard-de

pnorm(0, mean=0, sd=1)
#Just adding one value gives you a point estimate.
#50% will be found up to zero (the mean).

#The other values shown in Figure 6.3
pnorm(-1)
pnorm(-2)
#0 and 1 are the default values for mean and standard deviation, so you can leave them out if you want in this case.
#If you add more than one, it will tell you the values.
(pnorm(0:1))
#Use diff as suggested here to get the area under the curve.
diff(pnorm(0:1))
#This works because it uses the cumulative function to calculate it.
#If you go from -1 to 1 you can add them up to get the 68.27% shown under Figure 6.2 and on page 96.
sum(diff(pnorm(-1:1)))

#To go the other way, i.e. to see where 50% of the items fall, you have to use qnorm.
#Let's see what qnorm looks like.
curve(qnorm(x))
#It is essentially tipping Figure 6.3 on its side.
#The frequency is now x on the x-axis, and the standard deviations are the y axis.
#You can get the point values in Figure 6.3 this way:
qnorm(0.5)
qnorm(0.1587)
qnorm(0.0228)
#The last two are close with rounding since those values in Figure 6.3 were rounded.
#How to get the values on page 96, of where 50%, 95%, etc of items are found?
qnorm(0.5+0.5*0.5)
qnorm(0.95+0.5*0.05)
qnorm(0.99+0.5*0.01)
#The values that you input are not the percent values given on page 96.
#You put them in proportions (the quantile you want to get) PLUS
#half of the remaining value.
#If you put in 0.95, it will go from 0 to 0.95.  That isn't centered at the mean of 0,
#where the 50% quantile is.  There is 0.05 left at the end.  So we divide it by two
#to put it on either end.
#You would get the negative number for the standard deviation if you put in the opposite value.
qnorm(0.5*0.01)
#This also explains it nicely down where it starts talking about qnorm.
#https://cran.r-project.org/web/packages/tigerstats/vignettes/qnorm.html

#On pg. 98, they show calculation of standard deviates.
#This is described in such a way that they seem to be z scores although they are never named as such in the book.
#You can easily calculate this manually.
z.score<-function(x){
  (x-mean(x))/sd(x)
}

x<-c(1,2,3)
z.score(x)
#You can also use the scale() function.
#https://www.r-bloggers.com/r-tutorial-series-centering-variables-and-generating-z-scores-with-the-scale-function/

scale(x,
      center=TRUE,
      scale=TRUE)
#help notes that scale=TRUE divides by the standard deviation when
#center=TRUE, and center=TRUE substracts each number by the column mean.
#This is the same thing we did in the z-score function.

#Table 6.1 has expected frequencies for the normal distribution in column 2 for a sample of 1000 individuals.
#We can generate this with pnorm() and thinking about what the class marks mean.
#Because the class marks are separated by 0.5, we need to go 0.5 around each class mark and start at -5.25.
boundaries<-seq(from=-5.25,
                to=5.25,
                by=0.5)

pnorm.results<-pnorm(boundaries)
expected.freqs<-abs(diff(pnorm.results))
cbind(boundaries[-length(boundaries)]+0.25,
      #this takes the last entry off because we only need the lower bound for each class,
      #and adds 0.25 to get the class mark.
      round(expected.freqs*1000, 1))
#Then you can have a look at the table and see that it gives the same results as column 2 in Table 6.1.

Tuesday, November 29, 2016

Using R to work through Sokal and Rohlf's Biometry: Chapter 5 (Descriptive Statistics), section 5.4 and selected exercises

Previously in this series: Chapter 5 (Descriptive Statistics), section 5.3.


Section 5.4: Other Discrete Probability Distributions

?dhyper #hypergeometric
?dmultinom #multinomial
?dnbinom #negative binomial
#There are also options in various packages later for regression to use different distributions,
#although of these I've only ever used negative binomial.

#The logarithmic distribution requires an extra package.
#https://cran.r-project.org/web/views/Distributions.html
#extraDistr, VGAM, or gamlss.dst
#I installed extraDistr to test it out.
library(extraDistr)
?LogSeries
#and
?dlgser #The distributions in this package follow the same us of d, p, and q prefixes.


Exercises 5

#I'm doing the exercises that require new coding beyond what we have done already.

#Exercise 5.5
#The organism is present or absent in any given slide, assuming the person is diseased
#(if the person has the disease, their samples contain the organism but it's not very common).
#We want a false negative <1% of the time.  The organism is visible in 20% of the slides.
#At first I thought this was a Bayesian problem but I don't see how to do it that way.
#p=0.2 (organism visible), thus q=0.8 (organism present but not visible).
#We need to find the power of q=0.8 that equals 0.01 or smaller
#(1% false negative, which would be present but not visible).
#So, I made a while loop that cycled through to see how many times we need to raise 0.8
#to get to 0.01.  This explanation helped me figure out how to do this type of loop.

n<-0
i<-1
print(n)
while (i > 0.01) {
  print(i)
  print(n)
     i<-0.8^n
     n=n+1
}
#the last number printed is the number of slides needed.

Tuesday, November 15, 2016

Using R to work through Sokal and Rohlf's Biometry: Chapter 5 (Descriptive Statistics), section 5.3

Previously in this series: Chapter 5 (Introduction to Probability Distributions: Binomial and Poisson), sections 5.1-5.2.

Section 5.3: the Poisson distribution

?Distributions #Let's look in help again to find the Poisson distribution.
?dpois #We can see that R does similar calculations for Poisson in addition to the binomial distribution.

#To get column 3 (expected absolute frequencies) in table 5.5,
#there is the recursion formula given in equations 5.11 and 5.12.
#5.12 is for calculations with sample means to get the relative expected frequency,
#and in the text below it notes what term to add to get the absolute ones given in column 3 of table 5.5

abs.expected.freq.pois<-400*dpois(c(0:9),
                                  lambda=1.8,
                                  log=FALSE)

#I also want to do equation 5.12 manually for relative expected frequencies.
#This is not normally needed because R does it so nicely with the base stats function dpois().

#Some reading:
#http://stackoverflow.com/questions/5273857/are-recursive-functions-used-in-r
#this one has a simple one for factorials that makes it clearest to me:
#http://www.programiz.com/r-programming/recursion

rel.exp.freq.pois<-function(samplemean, i) {
  if (i==0) return (exp(-samplemean))
  else      return (rel.exp.freq.pois(samplemean,
                                      i-1)*(samplemean/i))
}

#To get absolute frequencies, multiply by 400.
#For one example:
400*rel.exp.freq.pois(1.8, 1)

#To get all the frequencies at once, use lapply.
rel.freq.pois<-unlist(lapply(c(0:9),
                                     rel.exp.freq.pois,
                                     samplemean=1.8))
(abs.freq.pois<-400*rel.freq.pois)

#On page 82, they show how to calculate the coefficient of dispersion.  Here is a function that will do it.
coefficient.of.dispersion<-function(data) {
  var(data)/mean(data)
}
#You can input any set of data here and get the coefficient of dispersion.

#Figure 5.3 shows Poisson distributions with different means.
#We can make this with the dpois() function generating y data.
#Oddly, to add the extra lines, it is easiest to use points() with
#type="l" (for lines).
plot(dpois(c(0:18),
      lambda=0.1,
      log=FALSE),
     type="l",
     ylab="Relative expected frequency",
     xlab="Number of rare events per sample")
points(dpois(c(0:18),
           lambda=1,
           log=FALSE),
       type="l")
points(dpois(c(0:18),
             lambda=2,
             log=FALSE),
       type="l")
points(dpois(c(0:18),
             lambda=3,
             log=FALSE),
       type="l")
points(dpois(c(0:18),
             lambda=10,
             log=FALSE),
       type="l")

#The remaining tables and examples in this section do not add anything new to code.