A Simple Guide to Data Wrangling with the tidyverse
Set up
As always, load the package first
# install.packages("tidyverse")
library(tidyverse)
The tidyverse
is a collection of the following packages:
tidyverse_packages()
## [1] "broom" "cli" "crayon" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "googledrive"
## [9] "googlesheets4" "ggplot2" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "readr"
## [21] "readxl" "reprex" "rlang" "rstudioapi"
## [25] "rvest" "stringr" "tibble" "tidyr"
## [29] "xml2" "tidyverse"
Learn more: https://www.tidyverse.org/packages
Tibbles
“Modern data frames”
Full definition: https://tibble.tidyverse.org
Creating a tibble
Example: Population (mil. people) and land area (mil. of \(km^2\)) of North American countries
- “Column-wise” creation (similar to old
data.frame()
)
<- tibble(
country country = c("usa", "canada", "mexico"),
population = c(331.00, 37.74, 128.93),
area = c(9.09, 9.15, 1.94)
)
country
country | population | area |
---|---|---|
usa | 331.00 | 9.09 |
canada | 37.74 | 9.15 |
mexico | 128.93 | 1.94 |
- “Row-wise” creation
<- tribble(
country ~country, ~population, ~area,
"usa", 331.00, 9.09,
"canada", 37.74, 9.15,
"mexico", 128.93, 1.94
)
country
country | population | area |
---|---|---|
usa | 331.00 | 9.09 |
canada | 37.74 | 9.15 |
mexico | 128.93 | 1.94 |
If you have data to import, use read_csv()
(for comma-delimited files) which stores the data as a tibble
. read.csv()
in base
gives a data.frame
object instead.
read_csv("your_file_path/data_file_name.csv")
For other types of data file (tsv, fwf…), see https://readr.tidyverse.org/.
%>%
The “pipe” operator
- From
maggritr
package, automatically loaded when loading the wholetidyverse
- From
Perform a sequence of multiple operations
# the workflow
%>%
object operation1(...) %>%
operation2(...) %>%
.
. .
More on this later…
The following data verbs come from the
dplyr
package of thetidyverse
.
mutate()
Create new variables from previously-existed variables
mutate(new_variable = f(old variables))
What could the function
f
be?Math (arithmetic)
Logical condition
Text (strings) operations (with
stringr
)Variable type conversion (e.g. character to factor, numeric to factor…)
… and many more tasks
Example: country
Convert area from \(km^2\) to \(mi^2\) and then calculate the population density (people per \(km^2\))
%>%
country mutate(area_mi = area * 0.386102,
density = population / area)
country | population | area | area_mi | density |
---|---|---|---|---|
usa | 331.00 | 9.09 | 3.5096672 | 36.41364 |
canada | 37.74 | 9.15 | 3.5328333 | 4.12459 |
mexico | 128.93 | 1.94 | 0.7490379 | 66.45876 |
Create an indicator (dummy varible) for whether the population is over 100 mil people
(Here let’s also store the new data into a new object, called country_new
, as you might want to do that in some situations.)
?if_else
<- country %>%
country_new mutate(hundred = if_else(population > 100, "yes", "no"))
country_new
country | population | area | hundred |
---|---|---|---|
usa | 331.00 | 9.09 | yes |
canada | 37.74 | 9.15 | no |
mexico | 128.93 | 1.94 | yes |
If you have more than two categories, you can still use if_else
, but now the statements are “nested”
Divide the population into 3 categories: Less than 100, 100-200, and More than 200
%>%
country mutate(pop_cat = if_else(population < 100, "Less than 100",
if_else(population > 200, "More than 200", "100-200")))
country | population | area | pop_cat |
---|---|---|---|
usa | 331.00 | 9.09 | More than 200 |
canada | 37.74 | 9.15 | Less than 100 |
mexico | 128.93 | 1.94 | 100-200 |
When there’s a large number of categories, a better solution here is to use case_when()
?case_when
%>%
country mutate(pop_cat = case_when(population < 100 ~ "Less than 100",
>= 100 & population <= 200 ~ "100-200",
population > 200 ~ "More than 200")) population
country | population | area | pop_cat |
---|---|---|---|
usa | 331.00 | 9.09 | More than 200 |
canada | 37.74 | 9.15 | Less than 100 |
mexico | 128.93 | 1.94 | 100-200 |
Convert country
from character to factor
%>%
country mutate(country = factor(country))
country | population | area |
---|---|---|
usa | 331.00 | 9.09 |
canada | 37.74 | 9.15 |
mexico | 128.93 | 1.94 |
Working with strings
stringr
package of thetidyverse
makes working with strings easy.
Example: FIFA (football/soccer) rankings
<- tibble(
fifa country = c("Belgium (1.)",
"Brazil (2.)",
"France (3.)",
"England (4.)",
"Italy (5.)")
)
fifa
country |
---|
Belgium (1.) |
Brazil (2.) |
France (3.) |
England (4.) |
Italy (5.) |
Get the first three-letters abbreviation for the countries
%>%
fifa mutate(abbrev = str_sub(country, start = 1, end = 3))
country | abbrev |
---|---|
Belgium (1.) | Bel |
Brazil (2.) | Bra |
France (3.) | Fra |
England (4.) | Eng |
Italy (5.) | Ita |
Remove the space between the name and parentheses from the country names
str_remove()
%>%
fifa mutate(country = str_remove(country, " "))
country |
---|
Belgium(1.) |
Brazil(2.) |
France(3.) |
England(4.) |
Italy(5.) |
(Note that naming the new variable the same as the old variable will overwrite the original column)
or str_replace
# replace " " with an empty character
%>%
fifa mutate(country = str_replace(country, " ", ""))
country |
---|
Belgium(1.) |
Brazil(2.) |
France(3.) |
England(4.) |
Italy(5.) |
If there are multiple spaces and we’d like to remove them all, use str_remove_all()
or str_replace_all()
.
Regular expression also works here
# \s: "whitespace character"
%>%
fifa mutate(country = str_remove(country, "\\s"))
country |
---|
Belgium(1.) |
Brazil(2.) |
France(3.) |
England(4.) |
Italy(5.) |
Learn more about stringr
and regex in R
.
select()
- Subsetting columns
Example: country
Select the population and area (i.e. omit the country names)
%>%
country select(population, area)
population | area |
---|---|
331.00 | 9.09 |
37.74 | 9.15 |
128.93 | 1.94 |
%>%
country select(-country)
population | area |
---|---|
331.00 | 9.09 |
37.74 | 9.15 |
128.93 | 1.94 |
Selection helpers: https://tidyselect.r-lib.org/reference
(I should start using these more)
filter()
- Subsetting rows where conditions are true
What are the countries with more than 2 mil \(km^2\) in land area?
%>%
country filter(area > 2)
country | population | area |
---|---|---|
usa | 331.00 | 9.09 |
canada | 37.74 | 9.15 |
What are the countries with more than 2 mil \(km^2\) in land area AND population of more than 100 mil people?
# and
%>%
country filter(area > 2 & population > 100)
country | population | area |
---|---|---|
usa | 331 | 9.09 |
What are the countries with more than 2 mil \(km^2\) in land area OR population of more than 100 mil people?
# or
%>%
country filter(area > 2 | population > 100)
country | population | area |
---|---|---|
usa | 331.00 | 9.09 |
canada | 37.74 | 9.15 |
mexico | 128.93 | 1.94 |
- Remove missing values
Example: New version of FIFA rankings data, where country and rank have their own columns
<- tibble(
fifa_new country = c("BEL", "BRA", "FRA", NA, "ITA"),
rank = c(1, 2, NA, 4, 5)
)
fifa_new
country | rank |
---|---|
BEL | 1 |
BRA | 2 |
FRA | NA |
NA | 4 |
ITA | 5 |
Remove rows with missing rank value
%>%
fifa_new filter(!is.na(rank))
country | rank |
---|---|
BEL | 1 |
BRA | 2 |
NA | 4 |
ITA | 5 |
%in%
operator (one of my favorites)
Keep countries with ranks 1, 2, and 5
%>%
fifa_new filter(rank %in% c(1, 2, 5))
country | rank |
---|---|
BEL | 1 |
BRA | 2 |
ITA | 5 |
which is equivalent to
%>%
fifa_new filter(rank == 1 | rank == 2 | rank == 5)
country | rank |
---|---|
BEL | 1 |
BRA | 2 |
ITA | 5 |
arrange()
- Order/sort rows by columns
Example: country
Arrange the cases by population (ascending order is the default argument)
%>%
country arrange(population)
country | population | area |
---|---|---|
canada | 37.74 | 9.15 |
mexico | 128.93 | 1.94 |
usa | 331.00 | 9.09 |
Descending order instead - use desc()
%>%
country arrange(desc(population))
country | population | area |
---|---|---|
usa | 331.00 | 9.09 |
mexico | 128.93 | 1.94 |
canada | 37.74 | 9.15 |
Arrange by multiple variables: population low-high (first sort), then area high-low (second sort), variable order matters
%>%
country arrange(population, desc(area))
country | population | area |
---|---|---|
canada | 37.74 | 9.15 |
mexico | 128.93 | 1.94 |
usa | 331.00 | 9.09 |
Arrange the countries by name in alphabetical order (or reverse)
%>%
country arrange(country)
country | population | area |
---|---|---|
canada | 37.74 | 9.15 |
mexico | 128.93 | 1.94 |
usa | 331.00 | 9.09 |
%>%
country arrange(desc(country))
country | population | area |
---|---|---|
usa | 331.00 | 9.09 |
mexico | 128.93 | 1.94 |
canada | 37.74 | 9.15 |
%>%
revisited
So far I’ve introduced a few data wrangling functions. Let’s now go back to the idea of piping and try performing multiple operations all in one procedure.
Example: Student info - name, gender, height (cm), weight (kg)
<- tibble(
students name = c("a", "b", "c", "x", "y", "z"),
gender = c("f", "f", "m", "f", "f", "m"),
height = c(180, 170, 190, 162, 173, 155),
weight = c(76, 65, 75, 55, 69, 50)
)
students
name | gender | height | weight |
---|---|---|---|
a | f | 180 | 76 |
b | f | 170 | 65 |
c | m | 190 | 75 |
x | f | 162 | 55 |
y | f | 173 | 69 |
z | m | 155 | 50 |
Who are the female students that are at least 170 cm tall and weigh at most 75 kg? Arrange the student by height, from tall to short. Do not include the gender column in the output.
What should we do here? - First filter()
, then arrange()
, and finally select()
%>%
students filter(gender == "f" & height >= 170 & weight <= 75) %>%
arrange(desc(height)) %>%
select(-gender)
name | height | weight |
---|---|---|
y | 173 | 69 |
b | 170 | 65 |
group_by()
and summarize()
group_by()
: perform group-wise operations defined by variable(s)summarize()
: provide a row of summary for each combination of grouping variable(s)
For each gender, give summaries of number of observations, average height, and minimum weight
%>%
students group_by(gender) %>%
summarize(n_obs = n(),
avg_height = mean(height),
min_weight = min(weight)) %>%
ungroup() # ungroup unless you want to continue aggregating the data by group
gender | n_obs | avg_height | min_weight |
---|---|---|---|
f | 4 | 171.25 | 55 |
m | 2 | 172.50 | 50 |
group_by()
also works withmutate()
.
Suppose we want to add a column of average weight for each gender to the original data.
%>%
students group_by(gender) %>%
mutate(avg_weight = mean(weight)) %>%
ungroup()
name | gender | height | weight | avg_weight |
---|---|---|---|---|
a | f | 180 | 76 | 66.25 |
b | f | 170 | 65 | 66.25 |
c | m | 190 | 75 | 62.50 |
x | f | 162 | 55 | 66.25 |
y | f | 173 | 69 | 66.25 |
z | m | 155 | 50 | 62.50 |
- Note: if there’s no grouping variables,
summarize()
will give a summary for the overall data.
%>%
students summarize(corr = cor(height, weight))
corr |
---|
0.9461962 |
base
vs tidyverse
base |
tidyverse |
---|---|
country$density <- country$population / country$area |
country %>% mutate(density = population / area) |
fifa$country <- gsub(" ", "", fifa$country) |
fifa %>% mutate(country = str_replace(country, " ", "")) |
country[country$population > 100, ] |
country %>% filter(population > 100) |
country[, c("population", "area")] |
country %>% select(population, area) |
country[order(country$population), ] |
country %>% arrange(population) |
Which one is nicer, more consistent, and more readable?
Bonus
Here are some of my favorite data wrangling functions/tricks with dplyr
/tidyverse
.
glimpse()
: data preview
%>%
students glimpse()
## Rows: 6
## Columns: 4
## $ name <chr> "a", "b", "c", "x", "y", "z"
## $ gender <chr> "f", "f", "m", "f", "f", "m"
## $ height <dbl> 180, 170, 190, 162, 173, 155
## $ weight <dbl> 76, 65, 75, 55, 69, 50
slice_head()
/slice_tail()
: select the first and last rows (tidyverse
versions ofhead()
andtail()
)
%>%
students slice_head(n = 3)
name | gender | height | weight |
---|---|---|---|
a | f | 180 | 76 |
b | f | 170 | 65 |
c | m | 190 | 75 |
%>%
students slice_tail(n = 3)
name | gender | height | weight |
---|---|---|---|
x | f | 162 | 55 |
y | f | 173 | 69 |
z | m | 155 | 50 |
slice_sample()
: randomly samples rows from data (by number or proportion of rows)
set.seed(2)
%>%
students slice_sample(n = 2)
name | gender | height | weight |
---|---|---|---|
y | f | 173 | 69 |
a | f | 180 | 76 |
set.seed(3)
%>%
students slice_sample(prop = 0.5) # helpful when doing cross-validation
name | gender | height | weight |
---|---|---|---|
y | f | 173 | 69 |
b | f | 170 | 65 |
x | f | 162 | 55 |
across()
: apply a function across multiple columns
Example: students
Factor all character variables and take the log of all numeric variables
%>%
students mutate(across(where(is.character), as.factor),
across(where(is.numeric), log))
name | gender | height | weight |
---|---|---|---|
a | f | 5.192957 | 4.330733 |
b | f | 5.135798 | 4.174387 |
c | m | 5.247024 | 4.317488 |
x | f | 5.087596 | 4.007333 |
y | f | 5.153292 | 4.234107 |
z | m | 5.043425 | 3.912023 |
add_count()
: adds a column of counts based on the number of items within a (multiple) specified group(s).
%>%
students add_count(gender, name = "gender_count")
name | gender | height | weight | gender_count |
---|---|---|---|---|
a | f | 180 | 76 | 4 |
b | f | 170 | 65 | 4 |
c | m | 190 | 75 | 2 |
x | f | 162 | 55 | 4 |
y | f | 173 | 69 | 4 |
z | m | 155 | 50 | 2 |
which is similar to… but cleaner.
%>%
students group_by(gender) %>%
mutate(gender_count = n()) %>%
ungroup()
name | gender | height | weight | gender_count |
---|---|---|---|---|
a | f | 180 | 76 | 4 |
b | f | 170 | 65 | 4 |
c | m | 190 | 75 | 2 |
x | f | 162 | 55 | 4 |
y | f | 173 | 69 | 4 |
z | m | 155 | 50 | 2 |