I wanted to make a legend that had both points and lines. I wasn't sure how hard that would be (maybe there would be two legends, who knew!). A quick google search turned up quick and simple ways to do this.
You already need to have made a plot to do this. But once you have your plot, this code works.
legend( x="topright", #places it in the top right; see help via ?legend for other options.
bty="n", #no box around legend
cex=0.9, #slightly smaller than the actual plot symbols and text
legend=c(expression("Critical" ~ chi^2), #Fancy symbols and superscripts in this one!
"Pointy points",
"Gray line",
"Round points"), #Legend specifies what text shows.
col=c("black","black",
"gray","black"), #Colors for each symbol.
lwd=2, #Line widths and border widths for both points and lines.
lty=c(1,NA,1,NA), #The line types for each line.
pch=c(NA,5,NA,1), #The symbols for each point
merge=FALSE ) #does NOT merge points and lines.
Monday, July 28, 2014
Friday, July 25, 2014
A sunny, warm morning on the prairie
Wednesday, July 23, 2014
Another copper butterfly
This Gray Copper (Lycaena dione) was one of many in a fine prairie filled with Baird's Sparrows and Chestnut-collared Longspurs. It was a good day for insects as I also saw the Peck's Skipper and several odonates (I am still working on identifying those). Near Brooks, Alberta, 13 July 2014. |
Saturday, July 19, 2014
"Small classy skipper"
Monday, July 14, 2014
Morning thunderstorm
Saturday, July 12, 2014
Ruddy Copper butterfly
It's finally getting warm enough for butterflies to be out on the regular. This Ruddy Copper (Lycaena rubidus) is a type of blue (Lycaenidae). Near Brooks, Alberta, 10 July 2014. |
This is the underside of the same individual. Very pretty fellow and a new species for me. I think we are somewhat near the edge of its range here in Alberta. Near Brooks, Alberta, 10 July 2014. |
Thursday, July 10, 2014
Changing axis tick mark labels to arbitrary values
Sometimes when you are making a figure in R, the automatic axis tick marks won't look like you want. Let's generate some sample data and a histogram and see its automatic labeling.
decimals<-c(0.12,0.15,0.22,0.37,0.25,0.4,0.5,0.2)
hist(decimals,
breaks=seq(from=0.05, to=0.55, by=0.1),
col="gray",
main="",
xlab="Proportion of things")
Now try it while suppressing the x-axis so we can add our own labels.
hist(decimals,
breaks=seq(from=0.05, to=0.55, by=0.1),
col="gray",
main="",
xlab="Percent of things",
xaxt="n") #suppress the x-axis.
axis(side=1, #from help: "1=below, 2=left, 3=above and 4=right."
las=1,
at=c(0.1,0.3,0.5), #desired intervals if you want to change that too
labels=c("10%", "30%", "50%"))#re-add the x-axis with desired labels at whatever intervals (here every 0.2).
#You can make the labels any arbitrary value you wish,
# so you could label 1-12 as January-December too.
#If you want to automate labeling, use seq().
hist(decimals,
breaks=seq(from=0.05, to=0.55, by=0.1),
col="gray",
main="",
xlab="Percent of things",
xaxt="n") #suppress the x-axis.
axis(side=1,
las=1,
at=c(seq(from=0.1, to=0.5, by=0.1)),
#Use seq in both at and labels (the intervals must match)
#or you get an error: "'at' and 'labels' lengths differ"
labels=c(100*seq(from=0.1, to=0.5, by=0.1)))
#Multiply the sequence you created in "at" to get percentages.
decimals<-c(0.12,0.15,0.22,0.37,0.25,0.4,0.5,0.2)
hist(decimals,
breaks=seq(from=0.05, to=0.55, by=0.1),
col="gray",
main="",
xlab="Proportion of things")
Now try it while suppressing the x-axis so we can add our own labels.
hist(decimals,
breaks=seq(from=0.05, to=0.55, by=0.1),
col="gray",
main="",
xlab="Percent of things",
xaxt="n") #suppress the x-axis.
axis(side=1, #from help: "1=below, 2=left, 3=above and 4=right."
las=1,
at=c(0.1,0.3,0.5), #desired intervals if you want to change that too
labels=c("10%", "30%", "50%"))#re-add the x-axis with desired labels at whatever intervals (here every 0.2).
#You can make the labels any arbitrary value you wish,
# so you could label 1-12 as January-December too.
#If you want to automate labeling, use seq().
hist(decimals,
breaks=seq(from=0.05, to=0.55, by=0.1),
col="gray",
main="",
xlab="Percent of things",
xaxt="n") #suppress the x-axis.
axis(side=1,
las=1,
at=c(seq(from=0.1, to=0.5, by=0.1)),
#Use seq in both at and labels (the intervals must match)
#or you get an error: "'at' and 'labels' lengths differ"
labels=c(100*seq(from=0.1, to=0.5, by=0.1)))
#Multiply the sequence you created in "at" to get percentages.
Tuesday, July 08, 2014
That bastard (toadflax)
This mysterious forb with a berry-like fruit looked so familiar. Near Brooks, Alberta, 02 July 2014. |
Eventually I realized the leaves looked rather like that of Bastard Toadflax (Comandra umbellata). Mom checked it out for me in the north-central Texas flora. Sure enough, that was it. Here are the flowers a month earlier. Also somewhere near Brooks, Alberta, 02 June 2014. |
Sunday, July 06, 2014
Wild rose country
Friday, July 04, 2014
Brown-headed Cowbird hatching at a Vesper Sparrow nest
Wednesday, July 02, 2014
Using R to work through Sokal and Rohlf's Biometry: Chapter 4 (Descriptive Statistics), sections 4.1-4.5
Previously in this series: Chapters 2 and 3 in Sokal and Rohlf's Biometry.
I've decided to start putting up these in sections instead of by chapter. Chapter 4 covers descriptive statistics, such as statistics of location (such as mean and median that show the central tendency of your population sample) and statistics of dispersion (such as range and standard deviation). I'll cover sections 4.1-4.5, which are measurements of central tendency and a bit about sample statistics.
The book notes that they will use "ln" for the natural logarithm, and "log"
for common base 10 logarithms. Recall that R uses log for the natural logaritm.
log10 will give you the common base 10 logarithm.
#divided by the sample size (number of items in the sample).
#We'll test it using the aphid femur data from Chapter 2.
aphid.femur<-c(3.8,3.6,4.3,3.5,4.3,
3.3,4.3,3.9,4.3,3.8,
3.9,4.4,3.8,4.7,3.6,
4.1,4.4,4.5,3.6,3.8,
4.4,4.1,3.6,4.2,3.9)
aphid.average<-sum(aphid.femur)/length(aphid.femur)
#sum() adds up everything in the vector. Length tells you how many values are in it
#(i.e., the sample size).
#Now that we know the mechanics, let's use R's function for it.
(aphid.Raverage<-mean(aphid.femur))
#Same value! Yay! Box 4.1 does additional things with these data beyond mean
#(sum of squares, variance, and standard deviation),
#but we won't worry about this until later in the chapter.
#Next in the text they refer to Box 4.2,
#where we can calculate means from a frequency distribution.
#They give two options for these calculations: non-coded and coded (we'll go through
#this later in section 4.8).
#For non-coded, sum the class marks by frequency.
#The easiest way to do this seemed to be to make a vector of both and multiply and sum.
classmark<-seq(from=59.5, to=171.5, by=8)
frequencies<-c(2,6,39,385,888,1729,2240,2007,1233,641,201,74,14,5,1)
samplesize<-sum(frequencies) #This confirms that we entered the data correctly, and gets our sample size.
#Multiply classmark and frequencies to get the sums for each class.
classsums<-classmark*frequencies
#To look at all this stuff together, combine it into a dataset.
birthweights<-data.frame(cbind(classmark, frequencies, classsums))
summing<-sum(classmark*frequencies)
#Confusingly to me, it displays the answer as 1040200, but R apparently just rounds for display.
#In RStudio's global environment, it does show that R has correctly calculated the answer as 1040199.5.
#That is the value it uses for further calculations.
(mean.birthweights<-summing/samplesize)
#This is confirmed when our average birthweight calculation is 109.8996, as the book shows.
#Weighted averages are up next. You use these if means or other stats are
#based on different sample sizes or for "other reasons".
#I'd assume something like different values should be weighted for study design
#or reliability reasons that are beyond the scope of the simple discussion of
#how to calculate the weighted average.
#We can do it by hand using equation 4.2 as given on pg. 43.
values<-c(3.85,5.21,4.70)
weights<-c(12,25,8)
(weighted.mean<-sum(weights*values)/sum(weights))
#Yep, right answer. R has a function for this, though.
r.weighted.mean<-weighted.mean(values, weights)
#The geometric mean is the nth root of the multiplication of all the values.
#This would be a huge number usually, so logarithms are used instead.
#R does not have a function for this in base R.
geometric.mean.myfunction<-function (dataset) {
exp(mean(log(dataset)))
}
#Test it with the aphid femur data; answers are on pg. 45.
geometric.mean.myfunction(aphid.femur)
#It works! This website http://stackoverflow.com/questions/2602583/geometric-mean-is-there-a-built-in points out that 0 values will give -Inf,
#but it gives code to fix it if you need that.
#The book indicates that geometric means are useful for growth rates measured at equal time intervals.
#Harmonic means are used for rates given as reciprocals such as meters per second.
harmonic.mean.myfunction<-function (dataset) {
1/(
sum(1/dataset)/length(dataset)
)
}
harmonic.mean.myfunction(aphid.femur)
#Yep, works too!
#The R package psych has both built in.
library(psych)
geometric.mean(aphid.femur)
harmonic.mean(aphid.femur)
#The median is simple to get (using their two examples on pg. 45):
median(c(14,15,16,19,23))
median(c(14,15,16,19))
#The example on pg. 46 shows how to find the median if your data are shown
#in a frequency table (in classes) already. If we counted to the middle class,
#we'd get 4.0 as the median. When you use their cumulative frequency, however,
#you correctly get 3.9 as median as we do when using R's function.
median(aphid.femur)
#This section also discusses other ways to divide the data.
#Quartiles cut the data into 25% sections. The second quartile is the median.
#You can get this with the summary function (along with minimum, maxiumum, and mean).
summary(aphid.femur)
#R has an additional function called quantile().
quantile(aphid.femur, probs=c(0.25,.5,0.75))
#It can get surprisingly complicated. If you'll look at the help file
?quantile
#there are nine types of quantiles to calculate.
#Type 7 is the default. The help file indicates that type 3 gives you results like you'd get from SAS.
#This website compares results for each from a sample small dataset: http://tolstoy.newcastle.edu.au/R/e17/help/att-1067/Quartiles_in_R.pdf
#Interestingly, boxplot, which graphs quartiles for R, doesn't ALWAYS return quite the same results as summary
#or any of the quantiles.
boxplot(aphid.femur)
#If you read the help file, you'll see that these different quantiles result from different estimates of
#the frequency distribution of the data.
#You can read a copy of the paper they cite here.
#Help says that type 8 is the one recommended in the paper
#(you'll also note that Hyndman is one of the authors of this function).
#The default is type 7, however. Just to see what we get:
quantile(aphid.femur, probs=c(0.25,0.5,0.75), type=8)
#It's slightly different than our original result.
#Biometry doesn't go into any of these different estimators, and I am not currently qualified
#to suggest any, so I'd either go with one of the defaults or the one suggested by Hyndman and Fan in their paper.
#In other words, the number that shows up the most wins here.
#The mode function in R doesn't give us the statistical mode, so we must resort to other methods.
#I found the suggestions here.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Mode(e2.7)
#Figure 4.1 shows how the mean, median, and mode can differ using the Canadian cattle butterfat example
#that we first used in Chapter 2, exercise 7.
e2.7<-c(4.32, 4.25, 4.82, 4.17, 4.24, 4.28, 3.91, 3.97, 4.29, 4.03, 4.71, 4.20,
4.00, 4.42, 3.96, 4.51, 3.96, 4.09, 3.66, 3.86, 4.48, 4.15, 4.10, 4.36,
3.89, 4.29, 4.38, 4.18, 4.02, 4.27, 4.16, 4.24, 3.74, 4.38, 3.77, 4.05,
4.42, 4.49, 4.40, 4.05, 4.20, 4.05, 4.06, 3.56, 3.87, 3.97, 4.08, 3.94,
4.10, 4.32, 3.66, 3.89, 4.00, 4.67, 4.70, 4.58, 4.33, 4.11, 3.97, 3.99,
3.81, 4.24, 3.97, 4.17, 4.33, 5.00, 4.20, 3.82, 4.16, 4.60, 4.41, 3.70,
3.88, 4.38, 4.31, 4.33, 4.81, 3.72, 3.70, 4.06, 4.23, 3.99, 3.83, 3.89,
4.67, 4.00, 4.24, 4.07, 3.74, 4.46, 4.30, 3.58, 3.93, 4.88, 4.20, 4.28,
3.89, 3.98, 4.60, 3.86, 4.38, 4.58, 4.14, 4.66, 3.97, 4.22, 3.47, 3.92,
4.91, 3.95, 4.38, 4.12, 4.52, 4.35, 3.91, 4.10, 4.09, 4.09, 4.34, 4.09)
hist(e2.7,
breaks=seq(from=3.45, to=5.05, length.out=17),
#To match the figure in the book, we see the classes go around 3.5 and 5 at each end
#hence the need to start classes at 3.45-3.55 and end at 4.95-5.05.
#There are 16 classes (17 breaks), or "by=0.1" will do the same in place of length.out.
main="",
xlab="Percent butterfat",
ylab="Frequency",
ylim=c(0,20))
abline(v=mean(e2.7), lty="longdash")
abline(v=median(e2.7), lty="dashed")
abline(v=Mode(e2.7), lty="dotdash")
#as estimates of population parameters. The number we get for mean,
#for example, is not the actual population mean. It is not the actual population
#mean unless we measured every single individual in the population.
I've decided to start putting up these in sections instead of by chapter. Chapter 4 covers descriptive statistics, such as statistics of location (such as mean and median that show the central tendency of your population sample) and statistics of dispersion (such as range and standard deviation). I'll cover sections 4.1-4.5, which are measurements of central tendency and a bit about sample statistics.
The book notes that they will use "ln" for the natural logarithm, and "log"
for common base 10 logarithms. Recall that R uses log for the natural logaritm.
log10 will give you the common base 10 logarithm.
Section 4.1: arithmetic mean
#The arithmetic mean is the sum of all the sample values#divided by the sample size (number of items in the sample).
#We'll test it using the aphid femur data from Chapter 2.
aphid.femur<-c(3.8,3.6,4.3,3.5,4.3,
3.3,4.3,3.9,4.3,3.8,
3.9,4.4,3.8,4.7,3.6,
4.1,4.4,4.5,3.6,3.8,
4.4,4.1,3.6,4.2,3.9)
aphid.average<-sum(aphid.femur)/length(aphid.femur)
#sum() adds up everything in the vector. Length tells you how many values are in it
#(i.e., the sample size).
#Now that we know the mechanics, let's use R's function for it.
(aphid.Raverage<-mean(aphid.femur))
#Same value! Yay! Box 4.1 does additional things with these data beyond mean
#(sum of squares, variance, and standard deviation),
#but we won't worry about this until later in the chapter.
#Next in the text they refer to Box 4.2,
#where we can calculate means from a frequency distribution.
#They give two options for these calculations: non-coded and coded (we'll go through
#this later in section 4.8).
#For non-coded, sum the class marks by frequency.
#The easiest way to do this seemed to be to make a vector of both and multiply and sum.
classmark<-seq(from=59.5, to=171.5, by=8)
frequencies<-c(2,6,39,385,888,1729,2240,2007,1233,641,201,74,14,5,1)
samplesize<-sum(frequencies) #This confirms that we entered the data correctly, and gets our sample size.
#Multiply classmark and frequencies to get the sums for each class.
classsums<-classmark*frequencies
#To look at all this stuff together, combine it into a dataset.
birthweights<-data.frame(cbind(classmark, frequencies, classsums))
summing<-sum(classmark*frequencies)
#Confusingly to me, it displays the answer as 1040200, but R apparently just rounds for display.
#In RStudio's global environment, it does show that R has correctly calculated the answer as 1040199.5.
#That is the value it uses for further calculations.
(mean.birthweights<-summing/samplesize)
#This is confirmed when our average birthweight calculation is 109.8996, as the book shows.
#Weighted averages are up next. You use these if means or other stats are
#based on different sample sizes or for "other reasons".
#I'd assume something like different values should be weighted for study design
#or reliability reasons that are beyond the scope of the simple discussion of
#how to calculate the weighted average.
#We can do it by hand using equation 4.2 as given on pg. 43.
values<-c(3.85,5.21,4.70)
weights<-c(12,25,8)
(weighted.mean<-sum(weights*values)/sum(weights))
#Yep, right answer. R has a function for this, though.
r.weighted.mean<-weighted.mean(values, weights)
Section 4.2: other means
#This section covers the geometric and harmonic means.#The geometric mean is the nth root of the multiplication of all the values.
#This would be a huge number usually, so logarithms are used instead.
#R does not have a function for this in base R.
geometric.mean.myfunction<-function (dataset) {
exp(mean(log(dataset)))
}
#Test it with the aphid femur data; answers are on pg. 45.
geometric.mean.myfunction(aphid.femur)
#It works! This website http://stackoverflow.com/questions/2602583/geometric-mean-is-there-a-built-in points out that 0 values will give -Inf,
#but it gives code to fix it if you need that.
#The book indicates that geometric means are useful for growth rates measured at equal time intervals.
#Harmonic means are used for rates given as reciprocals such as meters per second.
harmonic.mean.myfunction<-function (dataset) {
1/(
sum(1/dataset)/length(dataset)
)
}
harmonic.mean.myfunction(aphid.femur)
#Yep, works too!
#The R package psych has both built in.
library(psych)
geometric.mean(aphid.femur)
harmonic.mean(aphid.femur)
Section 4.3: the median
#The median is simple to get (using their two examples on pg. 45):
median(c(14,15,16,19,23))
median(c(14,15,16,19))
#The example on pg. 46 shows how to find the median if your data are shown
#in a frequency table (in classes) already. If we counted to the middle class,
#we'd get 4.0 as the median. When you use their cumulative frequency, however,
#you correctly get 3.9 as median as we do when using R's function.
median(aphid.femur)
#This section also discusses other ways to divide the data.
#Quartiles cut the data into 25% sections. The second quartile is the median.
#You can get this with the summary function (along with minimum, maxiumum, and mean).
summary(aphid.femur)
#R has an additional function called quantile().
quantile(aphid.femur, probs=c(0.25,.5,0.75))
#It can get surprisingly complicated. If you'll look at the help file
?quantile
#there are nine types of quantiles to calculate.
#Type 7 is the default. The help file indicates that type 3 gives you results like you'd get from SAS.
#This website compares results for each from a sample small dataset: http://tolstoy.newcastle.edu.au/R/e17/help/att-1067/Quartiles_in_R.pdf
#Interestingly, boxplot, which graphs quartiles for R, doesn't ALWAYS return quite the same results as summary
#or any of the quantiles.
boxplot(aphid.femur)
#If you read the help file, you'll see that these different quantiles result from different estimates of
#the frequency distribution of the data.
#You can read a copy of the paper they cite here.
#Help says that type 8 is the one recommended in the paper
#(you'll also note that Hyndman is one of the authors of this function).
#The default is type 7, however. Just to see what we get:
quantile(aphid.femur, probs=c(0.25,0.5,0.75), type=8)
#It's slightly different than our original result.
#Biometry doesn't go into any of these different estimators, and I am not currently qualified
#to suggest any, so I'd either go with one of the defaults or the one suggested by Hyndman and Fan in their paper.
Section 4.4: the mode
#I love that the mode is so named because it is the "most 'fashionable' value".#In other words, the number that shows up the most wins here.
#The mode function in R doesn't give us the statistical mode, so we must resort to other methods.
#I found the suggestions here.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Mode(e2.7)
#Figure 4.1 shows how the mean, median, and mode can differ using the Canadian cattle butterfat example
#that we first used in Chapter 2, exercise 7.
e2.7<-c(4.32, 4.25, 4.82, 4.17, 4.24, 4.28, 3.91, 3.97, 4.29, 4.03, 4.71, 4.20,
4.00, 4.42, 3.96, 4.51, 3.96, 4.09, 3.66, 3.86, 4.48, 4.15, 4.10, 4.36,
3.89, 4.29, 4.38, 4.18, 4.02, 4.27, 4.16, 4.24, 3.74, 4.38, 3.77, 4.05,
4.42, 4.49, 4.40, 4.05, 4.20, 4.05, 4.06, 3.56, 3.87, 3.97, 4.08, 3.94,
4.10, 4.32, 3.66, 3.89, 4.00, 4.67, 4.70, 4.58, 4.33, 4.11, 3.97, 3.99,
3.81, 4.24, 3.97, 4.17, 4.33, 5.00, 4.20, 3.82, 4.16, 4.60, 4.41, 3.70,
3.88, 4.38, 4.31, 4.33, 4.81, 3.72, 3.70, 4.06, 4.23, 3.99, 3.83, 3.89,
4.67, 4.00, 4.24, 4.07, 3.74, 4.46, 4.30, 3.58, 3.93, 4.88, 4.20, 4.28,
3.89, 3.98, 4.60, 3.86, 4.38, 4.58, 4.14, 4.66, 3.97, 4.22, 3.47, 3.92,
4.91, 3.95, 4.38, 4.12, 4.52, 4.35, 3.91, 4.10, 4.09, 4.09, 4.34, 4.09)
hist(e2.7,
breaks=seq(from=3.45, to=5.05, length.out=17),
#To match the figure in the book, we see the classes go around 3.5 and 5 at each end
#hence the need to start classes at 3.45-3.55 and end at 4.95-5.05.
#There are 16 classes (17 breaks), or "by=0.1" will do the same in place of length.out.
main="",
xlab="Percent butterfat",
ylab="Frequency",
ylim=c(0,20))
abline(v=mean(e2.7), lty="longdash")
abline(v=median(e2.7), lty="dashed")
abline(v=Mode(e2.7), lty="dotdash")
Section 4.5: sample statistics and parameters
#This section is here to remind you that we are calculating sample statistics#as estimates of population parameters. The number we get for mean,
#for example, is not the actual population mean. It is not the actual population
#mean unless we measured every single individual in the population.
Subscribe to:
Posts (Atom)