A Simple Guide to Data Wrangling with the tidyverse
Set up
As always, load the package first
# install.packages("tidyverse")
library(tidyverse)The tidyverse is a collection of the following packages:
tidyverse_packages()## [1] "broom" "cli" "crayon" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "googledrive"
## [9] "googlesheets4" "ggplot2" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "readr"
## [21] "readxl" "reprex" "rlang" "rstudioapi"
## [25] "rvest" "stringr" "tibble" "tidyr"
## [29] "xml2" "tidyverse"
Learn more: https://www.tidyverse.org/packages
Tibbles
“Modern data frames”
Full definition: https://tibble.tidyverse.org
Creating a tibble
Example: Population (mil. people) and land area (mil. of km2) of North American countries
- “Column-wise” creation (similar to old
data.frame())
country <- tibble(
country = c("usa", "canada", "mexico"),
population = c(331.00, 37.74, 128.93),
area = c(9.09, 9.15, 1.94)
)country| country | population | area |
|---|---|---|
| usa | 331.00 | 9.09 |
| canada | 37.74 | 9.15 |
| mexico | 128.93 | 1.94 |
- “Row-wise” creation
country <- tribble(
~country, ~population, ~area,
"usa", 331.00, 9.09,
"canada", 37.74, 9.15,
"mexico", 128.93, 1.94
)country| country | population | area |
|---|---|---|
| usa | 331.00 | 9.09 |
| canada | 37.74 | 9.15 |
| mexico | 128.93 | 1.94 |
If you have data to import, use read_csv() (for comma-delimited files) which stores the data as a tibble. read.csv() in base gives a data.frame object instead.
read_csv("your_file_path/data_file_name.csv")For other types of data file (tsv, fwf…), see https://readr.tidyverse.org/.
%>%
The “pipe” operator
- From
maggritrpackage, automatically loaded when loading the wholetidyverse
- From
Perform a sequence of multiple operations
# the workflow
object %>%
operation1(...) %>%
operation2(...) %>%
.
.
.More on this later…
The following data verbs come from the
dplyrpackage of thetidyverse.
mutate()
Create new variables from previously-existed variables
mutate(new_variable = f(old variables))What could the function
fbe?Math (arithmetic)
Logical condition
Text (strings) operations (with
stringr)Variable type conversion (e.g. character to factor, numeric to factor…)
… and many more tasks
Example: country
Convert area from km2 to mi2 and then calculate the population density (people per km2)
country %>%
mutate(area_mi = area * 0.386102,
density = population / area)| country | population | area | area_mi | density |
|---|---|---|---|---|
| usa | 331.00 | 9.09 | 3.5096672 | 36.41364 |
| canada | 37.74 | 9.15 | 3.5328333 | 4.12459 |
| mexico | 128.93 | 1.94 | 0.7490379 | 66.45876 |
Create an indicator (dummy varible) for whether the population is over 100 mil people
(Here let’s also store the new data into a new object, called country_new, as you might want to do that in some situations.)
?if_elsecountry_new <- country %>%
mutate(hundred = if_else(population > 100, "yes", "no"))country_new| country | population | area | hundred |
|---|---|---|---|
| usa | 331.00 | 9.09 | yes |
| canada | 37.74 | 9.15 | no |
| mexico | 128.93 | 1.94 | yes |
If you have more than two categories, you can still use if_else, but now the statements are “nested”
Divide the population into 3 categories: Less than 100, 100-200, and More than 200
country %>%
mutate(pop_cat = if_else(population < 100, "Less than 100",
if_else(population > 200, "More than 200", "100-200")))| country | population | area | pop_cat |
|---|---|---|---|
| usa | 331.00 | 9.09 | More than 200 |
| canada | 37.74 | 9.15 | Less than 100 |
| mexico | 128.93 | 1.94 | 100-200 |
When there’s a large number of categories, a better solution here is to use case_when()
?case_whencountry %>%
mutate(pop_cat = case_when(population < 100 ~ "Less than 100",
population >= 100 & population <= 200 ~ "100-200",
population > 200 ~ "More than 200"))| country | population | area | pop_cat |
|---|---|---|---|
| usa | 331.00 | 9.09 | More than 200 |
| canada | 37.74 | 9.15 | Less than 100 |
| mexico | 128.93 | 1.94 | 100-200 |
Convert country from character to factor
country %>%
mutate(country = factor(country))| country | population | area |
|---|---|---|
| usa | 331.00 | 9.09 |
| canada | 37.74 | 9.15 |
| mexico | 128.93 | 1.94 |
Working with strings
stringrpackage of thetidyversemakes working with strings easy.
Example: FIFA (football/soccer) rankings
fifa <- tibble(
country = c("Belgium (1.)",
"Brazil (2.)",
"France (3.)",
"England (4.)",
"Italy (5.)")
)fifa| country |
|---|
| Belgium (1.) |
| Brazil (2.) |
| France (3.) |
| England (4.) |
| Italy (5.) |
Get the first three-letters abbreviation for the countries
fifa %>%
mutate(abbrev = str_sub(country, start = 1, end = 3))| country | abbrev |
|---|---|
| Belgium (1.) | Bel |
| Brazil (2.) | Bra |
| France (3.) | Fra |
| England (4.) | Eng |
| Italy (5.) | Ita |
Remove the space between the name and parentheses from the country names
str_remove()
fifa %>%
mutate(country = str_remove(country, " "))| country |
|---|
| Belgium(1.) |
| Brazil(2.) |
| France(3.) |
| England(4.) |
| Italy(5.) |
(Note that naming the new variable the same as the old variable will overwrite the original column)
or str_replace
# replace " " with an empty character
fifa %>%
mutate(country = str_replace(country, " ", ""))| country |
|---|
| Belgium(1.) |
| Brazil(2.) |
| France(3.) |
| England(4.) |
| Italy(5.) |
If there are multiple spaces and we’d like to remove them all, use str_remove_all() or str_replace_all().
Regular expression also works here
# \s: "whitespace character"
fifa %>%
mutate(country = str_remove(country, "\\s")) | country |
|---|
| Belgium(1.) |
| Brazil(2.) |
| France(3.) |
| England(4.) |
| Italy(5.) |
Learn more about stringr and regex in R.
select()
- Subsetting columns
Example: country
Select the population and area (i.e. omit the country names)
country %>%
select(population, area)| population | area |
|---|---|
| 331.00 | 9.09 |
| 37.74 | 9.15 |
| 128.93 | 1.94 |
country %>%
select(-country)| population | area |
|---|---|
| 331.00 | 9.09 |
| 37.74 | 9.15 |
| 128.93 | 1.94 |
Selection helpers: https://tidyselect.r-lib.org/reference
(I should start using these more)
filter()
- Subsetting rows where conditions are true
What are the countries with more than 2 mil km2 in land area?
country %>%
filter(area > 2)| country | population | area |
|---|---|---|
| usa | 331.00 | 9.09 |
| canada | 37.74 | 9.15 |
What are the countries with more than 2 mil km2 in land area AND population of more than 100 mil people?
# and
country %>%
filter(area > 2 & population > 100)| country | population | area |
|---|---|---|
| usa | 331 | 9.09 |
What are the countries with more than 2 mil km2 in land area OR population of more than 100 mil people?
# or
country %>%
filter(area > 2 | population > 100)| country | population | area |
|---|---|---|
| usa | 331.00 | 9.09 |
| canada | 37.74 | 9.15 |
| mexico | 128.93 | 1.94 |
- Remove missing values
Example: New version of FIFA rankings data, where country and rank have their own columns
fifa_new <- tibble(
country = c("BEL", "BRA", "FRA", NA, "ITA"),
rank = c(1, 2, NA, 4, 5)
)fifa_new| country | rank |
|---|---|
| BEL | 1 |
| BRA | 2 |
| FRA | NA |
| NA | 4 |
| ITA | 5 |
Remove rows with missing rank value
fifa_new %>%
filter(!is.na(rank))| country | rank |
|---|---|
| BEL | 1 |
| BRA | 2 |
| NA | 4 |
| ITA | 5 |
%in%operator (one of my favorites)
Keep countries with ranks 1, 2, and 5
fifa_new %>%
filter(rank %in% c(1, 2, 5))| country | rank |
|---|---|
| BEL | 1 |
| BRA | 2 |
| ITA | 5 |
which is equivalent to
fifa_new %>%
filter(rank == 1 | rank == 2 | rank == 5)| country | rank |
|---|---|
| BEL | 1 |
| BRA | 2 |
| ITA | 5 |
arrange()
- Order/sort rows by columns
Example: country
Arrange the cases by population (ascending order is the default argument)
country %>%
arrange(population)| country | population | area |
|---|---|---|
| canada | 37.74 | 9.15 |
| mexico | 128.93 | 1.94 |
| usa | 331.00 | 9.09 |
Descending order instead - use desc()
country %>%
arrange(desc(population))| country | population | area |
|---|---|---|
| usa | 331.00 | 9.09 |
| mexico | 128.93 | 1.94 |
| canada | 37.74 | 9.15 |
Arrange by multiple variables: population low-high (first sort), then area high-low (second sort), variable order matters
country %>%
arrange(population, desc(area))| country | population | area |
|---|---|---|
| canada | 37.74 | 9.15 |
| mexico | 128.93 | 1.94 |
| usa | 331.00 | 9.09 |
Arrange the countries by name in alphabetical order (or reverse)
country %>%
arrange(country)| country | population | area |
|---|---|---|
| canada | 37.74 | 9.15 |
| mexico | 128.93 | 1.94 |
| usa | 331.00 | 9.09 |
country %>%
arrange(desc(country))| country | population | area |
|---|---|---|
| usa | 331.00 | 9.09 |
| mexico | 128.93 | 1.94 |
| canada | 37.74 | 9.15 |
%>% revisited
So far I’ve introduced a few data wrangling functions. Let’s now go back to the idea of piping and try performing multiple operations all in one procedure.
Example: Student info - name, gender, height (cm), weight (kg)
students <- tibble(
name = c("a", "b", "c", "x", "y", "z"),
gender = c("f", "f", "m", "f", "f", "m"),
height = c(180, 170, 190, 162, 173, 155),
weight = c(76, 65, 75, 55, 69, 50)
)students| name | gender | height | weight |
|---|---|---|---|
| a | f | 180 | 76 |
| b | f | 170 | 65 |
| c | m | 190 | 75 |
| x | f | 162 | 55 |
| y | f | 173 | 69 |
| z | m | 155 | 50 |
Who are the female students that are at least 170 cm tall and weigh at most 75 kg? Arrange the student by height, from tall to short. Do not include the gender column in the output.
What should we do here? - First filter(), then arrange(), and finally select()
students %>%
filter(gender == "f" & height >= 170 & weight <= 75) %>%
arrange(desc(height)) %>%
select(-gender)| name | height | weight |
|---|---|---|
| y | 173 | 69 |
| b | 170 | 65 |
group_by() and summarize()
group_by(): perform group-wise operations defined by variable(s)summarize(): provide a row of summary for each combination of grouping variable(s)
For each gender, give summaries of number of observations, average height, and minimum weight
students %>%
group_by(gender) %>%
summarize(n_obs = n(),
avg_height = mean(height),
min_weight = min(weight)) %>%
ungroup() # ungroup unless you want to continue aggregating the data by group| gender | n_obs | avg_height | min_weight |
|---|---|---|---|
| f | 4 | 171.25 | 55 |
| m | 2 | 172.50 | 50 |
group_by()also works withmutate().
Suppose we want to add a column of average weight for each gender to the original data.
students %>%
group_by(gender) %>%
mutate(avg_weight = mean(weight)) %>%
ungroup()| name | gender | height | weight | avg_weight |
|---|---|---|---|---|
| a | f | 180 | 76 | 66.25 |
| b | f | 170 | 65 | 66.25 |
| c | m | 190 | 75 | 62.50 |
| x | f | 162 | 55 | 66.25 |
| y | f | 173 | 69 | 66.25 |
| z | m | 155 | 50 | 62.50 |
- Note: if there’s no grouping variables,
summarize()will give a summary for the overall data.
students %>%
summarize(corr = cor(height, weight))| corr |
|---|
| 0.9461962 |
base vs tidyverse
base |
tidyverse |
|---|---|
country$density <- country$population / country$area |
country %>% mutate(density = population / area) |
fifa$country <- gsub(" ", "", fifa$country) |
fifa %>% mutate(country = str_replace(country, " ", "")) |
country[country$population > 100, ] |
country %>% filter(population > 100) |
country[, c("population", "area")] |
country %>% select(population, area) |
country[order(country$population), ] |
country %>% arrange(population) |
Which one is nicer, more consistent, and more readable?
Bonus
Here are some of my favorite data wrangling functions/tricks with dplyr/tidyverse.
glimpse(): data preview
students %>%
glimpse()## Rows: 6
## Columns: 4
## $ name <chr> "a", "b", "c", "x", "y", "z"
## $ gender <chr> "f", "f", "m", "f", "f", "m"
## $ height <dbl> 180, 170, 190, 162, 173, 155
## $ weight <dbl> 76, 65, 75, 55, 69, 50
slice_head()/slice_tail(): select the first and last rows (tidyverseversions ofhead()andtail())
students %>%
slice_head(n = 3)| name | gender | height | weight |
|---|---|---|---|
| a | f | 180 | 76 |
| b | f | 170 | 65 |
| c | m | 190 | 75 |
students %>%
slice_tail(n = 3)| name | gender | height | weight |
|---|---|---|---|
| x | f | 162 | 55 |
| y | f | 173 | 69 |
| z | m | 155 | 50 |
slice_sample(): randomly samples rows from data (by number or proportion of rows)
set.seed(2)
students %>%
slice_sample(n = 2)| name | gender | height | weight |
|---|---|---|---|
| y | f | 173 | 69 |
| a | f | 180 | 76 |
set.seed(3)
students %>%
slice_sample(prop = 0.5) # helpful when doing cross-validation| name | gender | height | weight |
|---|---|---|---|
| y | f | 173 | 69 |
| b | f | 170 | 65 |
| x | f | 162 | 55 |
across(): apply a function across multiple columns
Example: students
Factor all character variables and take the log of all numeric variables
students %>%
mutate(across(where(is.character), as.factor),
across(where(is.numeric), log))| name | gender | height | weight |
|---|---|---|---|
| a | f | 5.192957 | 4.330733 |
| b | f | 5.135798 | 4.174387 |
| c | m | 5.247024 | 4.317488 |
| x | f | 5.087596 | 4.007333 |
| y | f | 5.153292 | 4.234107 |
| z | m | 5.043425 | 3.912023 |
add_count(): adds a column of counts based on the number of items within a (multiple) specified group(s).
students %>%
add_count(gender, name = "gender_count")| name | gender | height | weight | gender_count |
|---|---|---|---|---|
| a | f | 180 | 76 | 4 |
| b | f | 170 | 65 | 4 |
| c | m | 190 | 75 | 2 |
| x | f | 162 | 55 | 4 |
| y | f | 173 | 69 | 4 |
| z | m | 155 | 50 | 2 |
which is similar to… but cleaner.
students %>%
group_by(gender) %>%
mutate(gender_count = n()) %>%
ungroup()| name | gender | height | weight | gender_count |
|---|---|---|---|---|
| a | f | 180 | 76 | 4 |
| b | f | 170 | 65 | 4 |
| c | m | 190 | 75 | 2 |
| x | f | 162 | 55 | 4 |
| y | f | 173 | 69 | 4 |
| z | m | 155 | 50 | 2 |