Tuesday, November 10, 2015

Sorting, grouping, and selecting data in R

I got started sorting data in SQL.  Nice select functions where some variable equals some value and you can get distinct or unique values.  R confused me. I was delighted to find an R package that allows the use of SQL selects in R, but it can occasionally be a bit clumsy due to differences in table and object naming.  I kept seeing references to dplyr as the modern way to use R natively to organize my data.  So, I have decided it is time to start learning dplyr and already it is helping a ton.  I recommend starting with the dplyr vignette.  I also found this tutorial with different sample data to be helpful.  I tried out chaining (the mysterious %>% operator I have seen lurking in code occasionally) and it was fantastic.  No more weird intermediate variables!  The tutorial describes a different-package-specific version of chaining, but dplyr implements it as well (the help file says it was formerly '%.%' but the '%>%' version has become standard) so it worked fine even though I hadn't installed the other 'magrittr' package mentioned on that page.  So far the biggest help to me is the distinct() function, which gets unique combinations of factors as I get when I do simpler and simpler SQL selects (instead of getting repeated rows of categorical data when I try to subset using base R for some variable that does have additional unique variables that I am not currently interested in).

No comments:

Post a Comment

Comments and suggestions welcome.