A Simple Guide to Data Wrangling with the tidyverse

Set up

As always, load the package first

# install.packages("tidyverse")
library(tidyverse)

The tidyverse is a collection of the following packages:

tidyverse_packages()
##  [1] "broom"         "cli"           "crayon"        "dbplyr"       
##  [5] "dplyr"         "dtplyr"        "forcats"       "googledrive"  
##  [9] "googlesheets4" "ggplot2"       "haven"         "hms"          
## [13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
## [17] "modelr"        "pillar"        "purrr"         "readr"        
## [21] "readxl"        "reprex"        "rlang"         "rstudioapi"   
## [25] "rvest"         "stringr"       "tibble"        "tidyr"        
## [29] "xml2"          "tidyverse"

Learn more: https://www.tidyverse.org/packages


Tibbles

Example: Population (mil. people) and land area (mil. of \(km^2\)) of North American countries

  • “Column-wise” creation (similar to old data.frame())
country <- tibble(
  country = c("usa", "canada", "mexico"),
  population = c(331.00, 37.74, 128.93),
  area = c(9.09, 9.15, 1.94)
)
country
country population area
usa 331.00 9.09
canada 37.74 9.15
mexico 128.93 1.94
  • “Row-wise” creation
country <- tribble(
  ~country, ~population, ~area,
  "usa", 331.00, 9.09,
  "canada", 37.74, 9.15,
  "mexico", 128.93, 1.94
)
country
country population area
usa 331.00 9.09
canada 37.74 9.15
mexico 128.93 1.94

If you have data to import, use read_csv() (for comma-delimited files) which stores the data as a tibble. read.csv() in base gives a data.frame object instead.

read_csv("your_file_path/data_file_name.csv")

For other types of data file (tsv, fwf…), see https://readr.tidyverse.org/.


%>%

  • The “pipe” operator

    • From maggritr package, automatically loaded when loading the whole tidyverse


  • Perform a sequence of multiple operations

# the workflow
object %>%
  operation1(...) %>% 
  operation2(...) %>% 
  .
  .
  .
  • More on this later…

  • The following data verbs come from the dplyr package of the tidyverse.


mutate()

  • Create new variables from previously-existed variables

  • mutate(new_variable = f(old variables))

  • What could the function f be?

    • Math (arithmetic)

    • Logical condition

    • Text (strings) operations (with stringr)

    • Variable type conversion (e.g. character to factor, numeric to factor…)

    • … and many more tasks

Example: country

Convert area from \(km^2\) to \(mi^2\) and then calculate the population density (people per \(km^2\))

country %>% 
  mutate(area_mi = area * 0.386102,
         density = population / area)
country population area area_mi density
usa 331.00 9.09 3.5096672 36.41364
canada 37.74 9.15 3.5328333 4.12459
mexico 128.93 1.94 0.7490379 66.45876

Create an indicator (dummy varible) for whether the population is over 100 mil people

(Here let’s also store the new data into a new object, called country_new, as you might want to do that in some situations.)

?if_else
country_new <- country %>% 
  mutate(hundred = if_else(population > 100, "yes", "no"))
country_new
country population area hundred
usa 331.00 9.09 yes
canada 37.74 9.15 no
mexico 128.93 1.94 yes

If you have more than two categories, you can still use if_else, but now the statements are “nested”

Divide the population into 3 categories: Less than 100, 100-200, and More than 200

country %>% 
  mutate(pop_cat = if_else(population < 100, "Less than 100",
                           if_else(population > 200, "More than 200", "100-200")))
country population area pop_cat
usa 331.00 9.09 More than 200
canada 37.74 9.15 Less than 100
mexico 128.93 1.94 100-200

When there’s a large number of categories, a better solution here is to use case_when()

?case_when
country %>% 
  mutate(pop_cat = case_when(population < 100 ~ "Less than 100",
                             population >= 100 & population <= 200 ~ "100-200",
                             population > 200 ~ "More than 200"))
country population area pop_cat
usa 331.00 9.09 More than 200
canada 37.74 9.15 Less than 100
mexico 128.93 1.94 100-200

Convert country from character to factor

country %>% 
  mutate(country = factor(country))
country population area
usa 331.00 9.09
canada 37.74 9.15
mexico 128.93 1.94

Working with strings

  • stringr package of the tidyverse makes working with strings easy.

Example: FIFA (football/soccer) rankings

fifa <- tibble(
  country = c("Belgium (1.)", 
              "Brazil (2.)", 
              "France (3.)", 
              "England (4.)", 
              "Italy (5.)")
)
fifa
country
Belgium (1.)
Brazil (2.)
France (3.)
England (4.)
Italy (5.)

Get the first three-letters abbreviation for the countries

fifa %>% 
  mutate(abbrev = str_sub(country, start = 1, end = 3))
country abbrev
Belgium (1.) Bel
Brazil (2.) Bra
France (3.) Fra
England (4.) Eng
Italy (5.) Ita

Remove the space between the name and parentheses from the country names

str_remove()

fifa %>% 
  mutate(country = str_remove(country, " "))
country
Belgium(1.)
Brazil(2.)
France(3.)
England(4.)
Italy(5.)

(Note that naming the new variable the same as the old variable will overwrite the original column)

or str_replace

# replace " " with an empty character
fifa %>% 
  mutate(country = str_replace(country, " ", ""))
country
Belgium(1.)
Brazil(2.)
France(3.)
England(4.)
Italy(5.)

If there are multiple spaces and we’d like to remove them all, use str_remove_all() or str_replace_all().

Regular expression also works here

# \s: "whitespace character"
fifa %>% 
  mutate(country = str_remove(country, "\\s")) 
country
Belgium(1.)
Brazil(2.)
France(3.)
England(4.)
Italy(5.)

Learn more about stringr and regex in R.


select()

  • Subsetting columns

Example: country

Select the population and area (i.e. omit the country names)

country %>% 
  select(population, area)
population area
331.00 9.09
37.74 9.15
128.93 1.94
country %>% 
  select(-country)
population area
331.00 9.09
37.74 9.15
128.93 1.94

Selection helpers: https://tidyselect.r-lib.org/reference

(I should start using these more)


filter()

  • Subsetting rows where conditions are true

What are the countries with more than 2 mil \(km^2\) in land area?

country %>% 
  filter(area > 2)
country population area
usa 331.00 9.09
canada 37.74 9.15

What are the countries with more than 2 mil \(km^2\) in land area AND population of more than 100 mil people?

# and
country %>% 
  filter(area > 2 & population > 100)
country population area
usa 331 9.09

What are the countries with more than 2 mil \(km^2\) in land area OR population of more than 100 mil people?

# or
country %>% 
  filter(area > 2 | population > 100)
country population area
usa 331.00 9.09
canada 37.74 9.15
mexico 128.93 1.94
  • Remove missing values

Example: New version of FIFA rankings data, where country and rank have their own columns

fifa_new <- tibble(
  country = c("BEL", "BRA", "FRA", NA, "ITA"),
  rank = c(1, 2, NA, 4, 5)
)
fifa_new
country rank
BEL 1
BRA 2
FRA NA
NA 4
ITA 5

Remove rows with missing rank value

fifa_new %>% 
  filter(!is.na(rank))
country rank
BEL 1
BRA 2
NA 4
ITA 5
  • %in% operator (one of my favorites)

Keep countries with ranks 1, 2, and 5

fifa_new %>% 
  filter(rank %in% c(1, 2, 5))
country rank
BEL 1
BRA 2
ITA 5

which is equivalent to

fifa_new %>% 
  filter(rank == 1 | rank == 2 | rank == 5)
country rank
BEL 1
BRA 2
ITA 5

arrange()

  • Order/sort rows by columns

Example: country

Arrange the cases by population (ascending order is the default argument)

country %>% 
  arrange(population)
country population area
canada 37.74 9.15
mexico 128.93 1.94
usa 331.00 9.09

Descending order instead - use desc()

country %>% 
  arrange(desc(population))
country population area
usa 331.00 9.09
mexico 128.93 1.94
canada 37.74 9.15

Arrange by multiple variables: population low-high (first sort), then area high-low (second sort), variable order matters

country %>% 
  arrange(population, desc(area))
country population area
canada 37.74 9.15
mexico 128.93 1.94
usa 331.00 9.09

Arrange the countries by name in alphabetical order (or reverse)

country %>% 
  arrange(country)
country population area
canada 37.74 9.15
mexico 128.93 1.94
usa 331.00 9.09
country %>% 
  arrange(desc(country))
country population area
usa 331.00 9.09
mexico 128.93 1.94
canada 37.74 9.15

%>% revisited

So far I’ve introduced a few data wrangling functions. Let’s now go back to the idea of piping and try performing multiple operations all in one procedure.

Example: Student info - name, gender, height (cm), weight (kg)

students <- tibble(
  name = c("a", "b", "c", "x", "y", "z"),
  gender = c("f", "f", "m", "f", "f", "m"),
  height = c(180, 170, 190, 162, 173, 155),
  weight = c(76, 65, 75, 55, 69, 50)
)
students
name gender height weight
a f 180 76
b f 170 65
c m 190 75
x f 162 55
y f 173 69
z m 155 50

Who are the female students that are at least 170 cm tall and weigh at most 75 kg? Arrange the student by height, from tall to short. Do not include the gender column in the output.

What should we do here? - First filter(), then arrange(), and finally select()

students %>% 
  filter(gender == "f" & height >= 170 & weight <= 75) %>% 
  arrange(desc(height)) %>% 
  select(-gender)
name height weight
y 173 69
b 170 65

group_by() and summarize()

  • group_by(): perform group-wise operations defined by variable(s)

  • summarize(): provide a row of summary for each combination of grouping variable(s)

For each gender, give summaries of number of observations, average height, and minimum weight

students %>% 
  group_by(gender) %>%
  summarize(n_obs = n(),
            avg_height = mean(height),
            min_weight = min(weight)) %>%
  ungroup() # ungroup unless you want to continue aggregating the data by group
gender n_obs avg_height min_weight
f 4 171.25 55
m 2 172.50 50
  • group_by() also works with mutate().

Suppose we want to add a column of average weight for each gender to the original data.

students %>% 
  group_by(gender) %>% 
  mutate(avg_weight = mean(weight)) %>% 
  ungroup()
name gender height weight avg_weight
a f 180 76 66.25
b f 170 65 66.25
c m 190 75 62.50
x f 162 55 66.25
y f 173 69 66.25
z m 155 50 62.50
  • Note: if there’s no grouping variables, summarize() will give a summary for the overall data.
students %>% 
  summarize(corr = cor(height, weight))
corr
0.9461962

base vs tidyverse

base tidyverse
country$density <- country$population / country$area country %>% mutate(density = population / area)
fifa$country <- gsub(" ", "", fifa$country) fifa %>% mutate(country = str_replace(country, " ", ""))
country[country$population > 100, ] country %>% filter(population > 100)
country[, c("population", "area")] country %>% select(population, area)
country[order(country$population), ] country %>% arrange(population)

Which one is nicer, more consistent, and more readable?

Bonus

Here are some of my favorite data wrangling functions/tricks with dplyr/tidyverse.

  • glimpse(): data preview
students %>% 
  glimpse()
## Rows: 6
## Columns: 4
## $ name   <chr> "a", "b", "c", "x", "y", "z"
## $ gender <chr> "f", "f", "m", "f", "f", "m"
## $ height <dbl> 180, 170, 190, 162, 173, 155
## $ weight <dbl> 76, 65, 75, 55, 69, 50
  • slice_head()/slice_tail(): select the first and last rows (tidyverse versions of head() and tail())
students %>% 
  slice_head(n = 3)
name gender height weight
a f 180 76
b f 170 65
c m 190 75
students %>% 
  slice_tail(n = 3)
name gender height weight
x f 162 55
y f 173 69
z m 155 50
  • slice_sample(): randomly samples rows from data (by number or proportion of rows)
set.seed(2)
students %>% 
  slice_sample(n = 2)
name gender height weight
y f 173 69
a f 180 76
set.seed(3)
students %>% 
  slice_sample(prop = 0.5) # helpful when doing cross-validation
name gender height weight
y f 173 69
b f 170 65
x f 162 55
  • across(): apply a function across multiple columns

Example: students

Factor all character variables and take the log of all numeric variables

students %>% 
  mutate(across(where(is.character), as.factor),
         across(where(is.numeric), log))
name gender height weight
a f 5.192957 4.330733
b f 5.135798 4.174387
c m 5.247024 4.317488
x f 5.087596 4.007333
y f 5.153292 4.234107
z m 5.043425 3.912023
  • add_count(): adds a column of counts based on the number of items within a (multiple) specified group(s).
students %>% 
  add_count(gender, name = "gender_count")
name gender height weight gender_count
a f 180 76 4
b f 170 65 4
c m 190 75 2
x f 162 55 4
y f 173 69 4
z m 155 50 2

which is similar to… but cleaner.

students %>% 
  group_by(gender) %>% 
  mutate(gender_count = n()) %>% 
  ungroup()
name gender height weight gender_count
a f 180 76 4
b f 170 65 4
c m 190 75 2
x f 162 55 4
y f 173 69 4
z m 155 50 2