Saturday, February 22, 2014

Workflow going from Geneious to R to Inkscape for a haplotype network

I wanted to make a network of mtDNA haplotypes showing the number of mutation steps between haplotypes and showing sites or other attributes as pie charts at each node.

 

1. Creating the haplotypes


To get the haplotypes, I aligned my sequences in Geneious.  Presumably you could do this in whatever your program of choice is.  I made sure my sequences had appropriate sample numbers (to merge with attributes later in R).  I trimmed the sequences so they matched in length.  End gaps could be counted as the same length, but then you'll get lots of haplotypes that are simply different lengths of gaps at the end.  I exported the data from Geneious as a .fasta file.


2. Creating the network



Now you have a .fasta file of sequences that are all the same length.  I used code from this blog post to create the haplotype network.  It worked great using functions in the R package pegas.  The code there uses site to make up the slices of the pie, but you can also use other attributes (I used phenotype and transect) of the sampled individuals.


3a. Manually changing the colors


The haplotype network produced from step 2 has a rainbow color palette, which is pretty, but it may not fit with your desired color schemes or with black-and-white printing in a publication.  Some of the colors can be hard to tell apart too.  In my case, it was hard to interpret because I needed light and dark versions of two sets of birds within the haplotypes.  I was having trouble with the igraph package that was suggested to edit colors, so eventually I decided to just export the figure as a scalable vector graphic (.svg).

I then edited the .svg file's colors in Inkscape, a free, open-source vector graphics editor.  Once you open your figure in Inkscape, click on the figure.  That selects it.  Then you can right-click and find "ungroup" in the menu or go to the Object menu for "ungroup".  Usually then you can select individual pieces of the figure, but if not, keep ungrouping pieces of the figure until you can.

Now, you should be able to grab pieces of the pie and change their color.  You can do more than one by holding down the shift key while selecting additional pie slices and the legend (to make sure the legend color matches your pie slice color).  Just double-click on the color you want in the bar on the bottom part of the screen.  Repeat as needed for the other pie slices.

One problem I ran into was distinguishing a few of the colors on the original figure.  You can use the ink dropper tool (press F7 or find it in the toolbar on the left side of the screen) to make sure you are correctly identifying particular colors as the same ones and not accidentally merging different sites or attributes in your final figure.  By holding the ink dropper tool over a color, you show the color's hexadecimal code in a gray area below the color palette bar on the bottom of the screen.

From within Inkscape, you can then use "save as" to get your image in another vector graphic format (.ps or .eps) for publication, or use "export bitmap" (also in the File menu) to create a new raster graphic .png for publication or presentation.

3b. Updated 13 Feb. 2018: Automated colors!


I have figured out how to use the R package igraph to automate the coloring of the pie slices.

Unfortunately I haven't figured out how to put the tick marks back on for number of changes between haplotypes, but the line lengths are still proportional to number of changes.  (If anyone knows how to do that, please let me know.)

Below is a complete example that will take you from creating the network to the final plot.


library(pegas)
library(igraph)
library(plyr)
library(reshape)

########################
#Step 1a.  Load your alignment and use haploNet() to create the haplotype network.
#Load the pegas package sample data "woodmouse".
#This step is adapted from pegas help.
data(woodmouse)
set.seed(42)
x <- woodmouse[sample(15, size = 30, replace = TRUE), ]
h <- haplotype(x)
(net <- haploNet(h))
plot(net)

########################
#Step 1b.  Have sites.
#I'm going to add in some arbitrary sample attributes that we'll use for pie chart testing.
#For your actual dataset, you would load a dataframe of localities for each sample.
#The samples and their sites should be in the same order as the samples in the file you used for haplotypes in 1a.
#Here, we randomly generate assignments to sites A, B, and C for the woodmouse dataset, which now has 30 samples.
sitenames <- c("A", "B", "C")
set.seed(84)
sites <- data.frame("site" = sample(sitenames,
                                    size = 30, 
                                    replace = TRUE))
########################
#Step 2. Assign haplotypes to your site dataframe.
#This code is adapted from a blog post by Jimmy O'Donnell (jodonnellbio at gmail) with help from Kim Tenggardjaja.
#Original posting: http://jimmyodonnell.wordpress.com/2013/11/23/haplotype-networks-in-r-updated/
#Adaptations to this code by Claire Curry (curryclairem@gmail.com)

#Now, fill in the haplotypes for each site.
#You will have to do this with your real dataset.
#The site and haplotype dataframes must be in the SAME SAMPLE ORDER.

#Add empty column to sites dataframe for haplotypes
sites<-cbind(sites,
             haplotype=rep(NA,
                           nrow(sites)))
#Assign haplotypes to the samples automatically using a loop
for (i in 1:length(labels(h))){
  sites$haplotype[attr(h,
                       "index")[[i]]]<-i
}
########################
#Step 3. Convert to igraph format from haplonet.
#Here's my new adaptation to plot colors automatically in igraph, as suggested on the original blog post.
#Get igraph network
net.igraph <- as.igraph(net,
                        directed = FALSE, #Leave network arrows off as they are not relevant to our case.
                        use.labels = TRUE,
                        altlinks = FALSE) #only get main links between haplotypes or you'll end up with tons of extra lines

#Get the haplotype numbers in their weird order so you can order colors and sizes correctly.
newnums <- as.roman(as.character(V(net.igraph)$name))
V(net.igraph)$name <- newnums

########################
#Step 4.Count up haplotype frequencies and site frequencies.
#This code is adapted from a blog post by Jimmy O'Donnell (jodonnellbio at gmail) with help from Kim Tenggardjaja.
#Original posting: http://jimmyodonnell.wordpress.com/2013/11/23/haplotype-networks-in-r-updated/
#Adaptations to this code by Claire Curry (curryclairem@gmail.com)
#Count how many individuals for each haplotype
count.hap0<-ddply(sites,
                  .(haplotype),
                  summarise,
                  freq=length(haplotype))

#Be sure to add in our new ordering!
count.hap <- count.hap0[order(match(rownames(count.hap0),
                                    newnums)),"freq"]


#for each plot, build frequency matrix that tells how to fill the pies by site frequency.
dfcount_site<-ddply(sites,
                  .(site, haplotype),
                  summarise,
                  freq=length(site))
dc.site00 <- cast(haplotype ~ site,
               data = dfcount_site, 
               value = "freq", 
               fill = 0)
dc.site0 <-dc.site00
########################
#Step 5. Convert the site frequencies into the correct list format and the correct order to match the igraph object.
#Now, onto more igraph adaptations.
dc.site0$haplotype<-NULL #Remove the haplotype row before turning this into a matrix.
#Turn the haplotype frequencies into a matrix in the NEW IGRAPH ORDER.
dc.site1<-t(as.matrix(dc.site0[order(match(rownames(dc.site0),
                                       newnums)),]))
#You must do this step to match the ordering of the cast dataframe haplotypes to the ordering used in igraph,
#otherwise the colors and sizes will be off.

#Then split the matrix out into a list.
dc.site <- split(dc.site1, 
               rep(1:ncol(dc.site1), 
                   each = nrow(dc.site1)))


#Remember, if you go back to your haplotype counts to check that colors and sizes are correct,
#that we rearranged dc.site to be in the order that the haplotypes are stored
#You can check V(net.igraph)$name to see them again.
#dc.site0's haplotype column shows the haplotype numbers, but doing order(match() then puts dc.site1 and dc.site into
#the order the haplotypes are stored in the igraph network.
#Hence, we use the following step to make it more obvious by naming each list element with the haplotype number.
names(dc.site) <- V(net.igraph)$name

########################
#Step 6. Plot the network.
plot(net.igraph, #the igraph object is always a list of 10 so far as I can tell, regardless of how many haplotypes you used.
     vertex.shape="pie",
     vertex.pie = dc.site,
     #vertex.pie specifies where you are getting the pie slice counts.
     vertex.pie.color=list(c("black",
                             "orange",
                             "blue")),
     #The colors are the number of categories you have, must be in a list, in the same order as your sites.
     vertex.size = 5*(count.hap), 
     #number of samples for each haplotype, multiplied to make bigger if desired.
     #vertex.label = NA)#, #for no vertex labels (ie no hapotype numbers added)
     #vertex.label.cex = 1)#, #if you want bigger haplotype numbers.
     vertex.label.dist = 2.5) #how far from pie you want the haplotype numbers)

#Add a legend.
legend(x= "bottomleft",
       bty="n",
       pt.bg=c("black",
                "orange",
                "blue"),
       col="black",
       pch=22,
       legend=c("Site A",
                "Site B",
                "Site C"))

Tuesday, February 18, 2014

Getting summary statistics for datasets by factor in R

If you have a dataset with factors, you might want to get some descriptive summaries grouped by each factor. This took me a while to figure out in R but turned out to be reasonably simple.


categories<-rep(c("a","b"), 4)
morecategories<-rep(c("this", "that"), each=4)
thing1<-c(1,2,3,4,5,6,7,8)
thing2<-c(2,3,4,5,6,NA,8,9)
thing3<-c(3,4,5,6,7,8,9,12)
adifferentone<-c(10,1,20,4,19,2,34,1)
data<-data.frame(categories,
                 morecategories,
                       thing1,
                       thing2,
                       thing3,
                       adifferentone
                 )


#The above lines generate a small example dataset.

#View your data to ensure factors are factors, numbers are numeric, and so on.
str(data)

#The first way to get some general summary data is to use the summary() function.
summary(data)

#It doesn't give you anything grouped by your categories ("categories" and "morecategories") though.
#aggregate() will do this.
#I show it here in the formula version, with the function as mean.
#You can also use sd (standard deviation).
aggregate(data$adifferentone~data$categories+data$morecategories, FUN=mean)
aggregate(data$thing2~data$categories, FUN=mean)

#An example with a different formula and at least one NA in the data.
#Note that this automatically removes the NA from thing2; you can tell using the length function.
#(You can also use length to get the sample size.)
aggregate(data$thing2~data$categories, FUN=length)

#What if we need to average the thing columns?
#(I've needed to do this if I take more than one measurement,
#such as north, south, east, and west measurements and then average them.)

#First get the columns you want.
things<-c("thing1", "thing2", "thing3")

#data$avgthing is the new column you are putting your summary into.
#The 'things' object you just created selects columns
#(that's why it goes after the comma within the square brackets;
#before the comma selects rows.)
#Use na.rm=TRUE if you have NAs; otherwise it'll
#give you an NA for the whole row that contains the NA.
data$avgthing<-rowMeans(data[,things], na.rm=TRUE)
data$sd.thing<-apply(data[,things], MARGIN=1, sd, na.rm=TRUE)

#To get standard deviation, you need to use apply.
#MARGIN=1 means you are applying across rows.  (2 would mean columns.)
#view the dataset including the new variables you've generated (mean and standard deviation).
 
data

Tuesday, February 11, 2014

Using Windows command prompt to merge text files for later use

I'm super-busy getting ready to defend my dissertation, but I'm getting a lot of good material saved up from the process (I've gotten a lot of practice with loops lately).  Meanwhile, here's a quick trick I learned when desperately hoping I wouldn't have to concatenate 240 text files by hand.  (Spoiler: I didn't!)

First, gather up your text files into one folder.  I did a search in the folder where mine were stored in numerous subfolders using *spacing*.txt to get all .txt files with "spacing" in the name.  This got me my 240 files.  I copied them into a new folder in case something went wrong and to get them all into one place for the next step.

Open up the Windows command prompt (cmd.exe).  Navigate to the folder where you are storing all of your text files using the cd command.  I had my desired folder open, copied the folder address, and pasted it into the command prompt.  I had to right-click to paste (ctrl-v didn't work).  Here is the command prompt when I've pasted into the folder address.  "cd" is needed before your address.







 Press enter, and then we've navigated to the desired folder:






Now, input your command:


copy /b *.txt merged.txt

The purpose of  the /b stumped for me for a while (it was just in the original instructions that I found), but it appears that it specifies a binary file.  You can find more info on variations on copy that might suit your needs here.

You should end up with all the data from your original files (which should be still intact in that folder) into a single new text file named "merged.txt".  Hours of copy/paste saved!  Yay!  (I was happy, for sure.)

In my case, every single text file already had a heading, so I opened the new merged file in LibreOffice Calc (Microsoft Excel or any other spreadsheet does just as well) and did a variety of sorts to clean up the new file and delete most of those heading rows.  Then my file was ready for other uses!

2015/11/06 Edited to add: Now that I have a computer with a partitioned hard drive, I discovered I couldn't use cd to get to my E:/ drive.  A quick search indicated I just need to type the E: and press enter, then proceed as above.