A Simple Guide to Data Wrangling with the `tidyverse`

Set up

As always, load the package first

# install.packages("tidyverse")
library(tidyverse)

The tidyverse is a collection of the following packages:

tidyverse_packages()

##  [1] "broom"         "cli"           "crayon"        "dbplyr"       
##  [5] "dplyr"         "dtplyr"        "forcats"       "googledrive"  
##  [9] "googlesheets4" "ggplot2"       "haven"         "hms"          
## [13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
## [17] "modelr"        "pillar"        "purrr"         "readr"        
## [21] "readxl"        "reprex"        "rlang"         "rstudioapi"   
## [25] "rvest"         "stringr"       "tibble"        "tidyr"        
## [29] "xml2"          "tidyverse"

Learn more: https://www.tidyverse.org/packages

Tibbles

“Modern data frames”
Full definition: https://tibble.tidyverse.org
Creating a tibble

Example: Population (mil. people) and land area (mil. of $km^2$) of North American countries

“Column-wise” creation (similar to old data.frame())

country <- tibble(
  country = c("usa", "canada", "mexico"),
  population = c(331.00, 37.74, 128.93),
  area = c(9.09, 9.15, 1.94)
)

country

country	population	area
usa	331.00	9.09
canada	37.74	9.15
mexico	128.93	1.94

“Row-wise” creation

country <- tribble(
  ~country, ~population, ~area,
  "usa", 331.00, 9.09,
  "canada", 37.74, 9.15,
  "mexico", 128.93, 1.94
)

country

country	population	area
usa	331.00	9.09
canada	37.74	9.15
mexico	128.93	1.94

If you have data to import, use read_csv() (for comma-delimited files) which stores the data as a tibble. read.csv() in base gives a data.frame object instead.

read_csv("your_file_path/data_file_name.csv")

For other types of data file (tsv, fwf…), see https://readr.tidyverse.org/.

`%>%`

The “pipe” operator
- From maggritr package, automatically loaded when loading the whole tidyverse
Perform a sequence of multiple operations

# the workflow
object %>%
  operation1(...) %>% 
  operation2(...) %>% 
  .
  .
  .

More on this later…
The following data verbs come from the dplyr package of the tidyverse.

`mutate()`

Create new variables from previously-existed variables
mutate(new_variable = f(old variables))
What could the function f be?
- Math (arithmetic)
- Logical condition
- Text (strings) operations (with stringr)
- Variable type conversion (e.g. character to factor, numeric to factor…)
- … and many more tasks

Example: country

Convert area from $km^2$ to $mi^2$ and then calculate the population density (people per $km^2$)

country %>% 
  mutate(area_mi = area * 0.386102,
         density = population / area)

country	population	area	area_mi	density
usa	331.00	9.09	3.5096672	36.41364
canada	37.74	9.15	3.5328333	4.12459
mexico	128.93	1.94	0.7490379	66.45876

Create an indicator (dummy varible) for whether the population is over 100 mil people

(Here let’s also store the new data into a new object, called country_new, as you might want to do that in some situations.)

?if_else

country_new <- country %>% 
  mutate(hundred = if_else(population > 100, "yes", "no"))

country_new

country	population	area	hundred
usa	331.00	9.09	yes
canada	37.74	9.15	no
mexico	128.93	1.94	yes

If you have more than two categories, you can still use if_else, but now the statements are “nested”

Divide the population into 3 categories: Less than 100, 100-200, and More than 200

country %>% 
  mutate(pop_cat = if_else(population < 100, "Less than 100",
                           if_else(population > 200, "More than 200", "100-200")))

country	population	area	pop_cat
usa	331.00	9.09	More than 200
canada	37.74	9.15	Less than 100
mexico	128.93	1.94	100-200

When there’s a large number of categories, a better solution here is to use case_when()

?case_when

country %>% 
  mutate(pop_cat = case_when(population < 100 ~ "Less than 100",
                             population >= 100 & population <= 200 ~ "100-200",
                             population > 200 ~ "More than 200"))

country	population	area	pop_cat
usa	331.00	9.09	More than 200
canada	37.74	9.15	Less than 100
mexico	128.93	1.94	100-200

Convert country from character to factor

country %>% 
  mutate(country = factor(country))

country	population	area
usa	331.00	9.09
canada	37.74	9.15
mexico	128.93	1.94

Working with strings

stringr package of the tidyverse makes working with strings easy.

Example: FIFA (football/soccer) rankings

fifa <- tibble(
  country = c("Belgium (1.)", 
              "Brazil (2.)", 
              "France (3.)", 
              "England (4.)", 
              "Italy (5.)")
)

fifa

country
Belgium (1.)
Brazil (2.)
France (3.)
England (4.)
Italy (5.)

Get the first three-letters abbreviation for the countries

fifa %>% 
  mutate(abbrev = str_sub(country, start = 1, end = 3))

country	abbrev
Belgium (1.)	Bel
Brazil (2.)	Bra
France (3.)	Fra
England (4.)	Eng
Italy (5.)	Ita

Remove the space between the name and parentheses from the country names

str_remove()

fifa %>% 
  mutate(country = str_remove(country, " "))

country
Belgium(1.)
Brazil(2.)
France(3.)
England(4.)
Italy(5.)

(Note that naming the new variable the same as the old variable will overwrite the original column)

or str_replace

# replace " " with an empty character
fifa %>% 
  mutate(country = str_replace(country, " ", ""))

country
Belgium(1.)
Brazil(2.)
France(3.)
England(4.)
Italy(5.)

If there are multiple spaces and we’d like to remove them all, use str_remove_all() or str_replace_all().

Regular expression also works here

# \s: "whitespace character"
fifa %>% 
  mutate(country = str_remove(country, "\\s"))

country
Belgium(1.)
Brazil(2.)
France(3.)
England(4.)
Italy(5.)

Learn more about stringr and regex in R.

`select()`

Subsetting columns

Example: country

Select the population and area (i.e. omit the country names)

country %>% 
  select(population, area)

population	area
331.00	9.09
37.74	9.15
128.93	1.94

country %>% 
  select(-country)

population	area
331.00	9.09
37.74	9.15
128.93	1.94

Selection helpers: https://tidyselect.r-lib.org/reference

(I should start using these more)

`filter()`

Subsetting rows where conditions are true

What are the countries with more than 2 mil $km^2$ in land area?

country %>% 
  filter(area > 2)

country	population	area
usa	331.00	9.09
canada	37.74	9.15

What are the countries with more than 2 mil $km^2$ in land area AND population of more than 100 mil people?

# and
country %>% 
  filter(area > 2 & population > 100)

country	population	area
usa	331	9.09

What are the countries with more than 2 mil $km^2$ in land area OR population of more than 100 mil people?

# or
country %>% 
  filter(area > 2 | population > 100)

country	population	area
usa	331.00	9.09
canada	37.74	9.15
mexico	128.93	1.94

Remove missing values

Example: New version of FIFA rankings data, where country and rank have their own columns

fifa_new <- tibble(
  country = c("BEL", "BRA", "FRA", NA, "ITA"),
  rank = c(1, 2, NA, 4, 5)
)

fifa_new

country	rank
BEL	1
BRA	2
FRA	NA
NA	4
ITA	5

Remove rows with missing rank value

fifa_new %>% 
  filter(!is.na(rank))

country	rank
BEL	1
BRA	2
NA	4
ITA	5

%in% operator (one of my favorites)

Keep countries with ranks 1, 2, and 5

fifa_new %>% 
  filter(rank %in% c(1, 2, 5))

country	rank
BEL	1
BRA	2
ITA	5

which is equivalent to

fifa_new %>% 
  filter(rank == 1 | rank == 2 | rank == 5)

country	rank
BEL	1
BRA	2
ITA	5

`arrange()`

Order/sort rows by columns

Example: country

Arrange the cases by population (ascending order is the default argument)

country %>% 
  arrange(population)

country	population	area
canada	37.74	9.15
mexico	128.93	1.94
usa	331.00	9.09

Descending order instead - use desc()

country %>% 
  arrange(desc(population))

country	population	area
usa	331.00	9.09
mexico	128.93	1.94
canada	37.74	9.15

Arrange by multiple variables: population low-high (first sort), then area high-low (second sort), variable order matters

country %>% 
  arrange(population, desc(area))

country	population	area
canada	37.74	9.15
mexico	128.93	1.94
usa	331.00	9.09

Arrange the countries by name in alphabetical order (or reverse)

country %>% 
  arrange(country)

country	population	area
canada	37.74	9.15
mexico	128.93	1.94
usa	331.00	9.09

country %>% 
  arrange(desc(country))

country	population	area
usa	331.00	9.09
mexico	128.93	1.94
canada	37.74	9.15

`%>%` revisited

So far I’ve introduced a few data wrangling functions. Let’s now go back to the idea of piping and try performing multiple operations all in one procedure.

Example: Student info - name, gender, height (cm), weight (kg)

students <- tibble(
  name = c("a", "b", "c", "x", "y", "z"),
  gender = c("f", "f", "m", "f", "f", "m"),
  height = c(180, 170, 190, 162, 173, 155),
  weight = c(76, 65, 75, 55, 69, 50)
)

students

name	gender	height	weight
a	f	180	76
b	f	170	65
c	m	190	75
x	f	162	55
y	f	173	69
z	m	155	50

Who are the female students that are at least 170 cm tall and weigh at most 75 kg? Arrange the student by height, from tall to short. Do not include the gender column in the output.

What should we do here? - First filter(), then arrange(), and finally select()

students %>% 
  filter(gender == "f" & height >= 170 & weight <= 75) %>% 
  arrange(desc(height)) %>% 
  select(-gender)

name	height	weight
y	173	69
b	170	65

`group_by()` and `summarize()`

group_by(): perform group-wise operations defined by variable(s)
summarize(): provide a row of summary for each combination of grouping variable(s)

For each gender, give summaries of number of observations, average height, and minimum weight

students %>% 
  group_by(gender) %>%
  summarize(n_obs = n(),
            avg_height = mean(height),
            min_weight = min(weight)) %>%
  ungroup() # ungroup unless you want to continue aggregating the data by group

gender	n_obs	avg_height	min_weight
f	4	171.25	55
m	2	172.50	50

group_by() also works with mutate().

Suppose we want to add a column of average weight for each gender to the original data.

students %>% 
  group_by(gender) %>% 
  mutate(avg_weight = mean(weight)) %>% 
  ungroup()

name	gender	height	weight	avg_weight
a	f	180	76	66.25
b	f	170	65	66.25
c	m	190	75	62.50
x	f	162	55	66.25
y	f	173	69	66.25
z	m	155	50	62.50

Note: if there’s no grouping variables, summarize() will give a summary for the overall data.

students %>% 
  summarize(corr = cor(height, weight))

corr
0.9461962

`base` vs `tidyverse`

`base`	`tidyverse`
`country$density <- country$population / country$area`	`country %>% mutate(density = population / area)`
`fifa$country <- gsub(" ", "", fifa$country)`	`fifa %>% mutate(country = str_replace(country, " ", ""))`
`country[country$population > 100, ]`	`country %>% filter(population > 100)`
`country[, c("population", "area")]`	`country %>% select(population, area)`
`country[order(country$population), ]`	`country %>% arrange(population)`

Which one is nicer, more consistent, and more readable?

Bonus

Here are some of my favorite data wrangling functions/tricks with dplyr/tidyverse.

glimpse(): data preview

students %>% 
  glimpse()

## Rows: 6
## Columns: 4
## $ name   <chr> "a", "b", "c", "x", "y", "z"
## $ gender <chr> "f", "f", "m", "f", "f", "m"
## $ height <dbl> 180, 170, 190, 162, 173, 155
## $ weight <dbl> 76, 65, 75, 55, 69, 50

slice_head()/slice_tail(): select the first and last rows (tidyverse versions of head() and tail())

students %>% 
  slice_head(n = 3)

name	gender	height	weight
a	f	180	76
b	f	170	65
c	m	190	75

students %>% 
  slice_tail(n = 3)

name	gender	height	weight
x	f	162	55
y	f	173	69
z	m	155	50

slice_sample(): randomly samples rows from data (by number or proportion of rows)

set.seed(2)
students %>% 
  slice_sample(n = 2)

name	gender	height	weight
y	f	173	69
a	f	180	76

set.seed(3)
students %>% 
  slice_sample(prop = 0.5) # helpful when doing cross-validation

name	gender	height	weight
y	f	173	69
b	f	170	65
x	f	162	55

across(): apply a function across multiple columns

Example: students

Factor all character variables and take the log of all numeric variables

students %>% 
  mutate(across(where(is.character), as.factor),
         across(where(is.numeric), log))

name	gender	height	weight
a	f	5.192957	4.330733
b	f	5.135798	4.174387
c	m	5.247024	4.317488
x	f	5.087596	4.007333
y	f	5.153292	4.234107
z	m	5.043425	3.912023

add_count(): adds a column of counts based on the number of items within a (multiple) specified group(s).

students %>% 
  add_count(gender, name = "gender_count")

name	gender	height	weight	gender_count
a	f	180	76	4
b	f	170	65	4
c	m	190	75	2
x	f	162	55	4
y	f	173	69	4
z	m	155	50	2

which is similar to… but cleaner.

students %>% 
  group_by(gender) %>% 
  mutate(gender_count = n()) %>% 
  ungroup()

name	gender	height	weight	gender_count
a	f	180	76	4
b	f	170	65	4
c	m	190	75	2
x	f	162	55	4
y	f	173	69	4
z	m	155	50	2

A Simple Guide to Data Wrangling with the tidyverse

Set up

Tibbles

%>%

mutate()

select()

filter()

arrange()

%>% revisited

group_by() and summarize()

base vs tidyverse