Advanced Data Wrangling in R using dplyr

Xiaorui (Jeremy) Zhu

01/20/2026

Why do we need dplyr

Making data manipulation easier and more intuitive .
It simplifies the process of data manipulation by providing a consistent grammar for interacting with data.
People can easily filter, arrange, summarize, and extract specific parts of their data without complex code.

Why do we need dplyr

For example, you have a data frame with information about customers, including their names, ages, spending and satisfaction level.

You want to filter out customers who are

above a certain age,
arrange the data by grade,
calculate the average satisfaction,
and extract specific columns of interest
accomplish these task just a few clear lines of code.

Key functions in `dplyr` package

You learned six key dplyr functions that allow you to solve the vast majority of your data manipulation challenges. What do each do?

filter:
arrange:
select:
mutate:
summarize:
group_by:

Key functions in `dplyr` package

You learned six key dplyr functions that allow you to solve the vast majority of your data manipulation challenges. What do each do?

filter: pick observations based on values
arrange: reorder data
select: pick variables
mutate: create new variables
summarize: summarize data by functions of choice - group_by: group data by categorical levels

Details of `dplyr` package and functions

Are there vignettes for the dplyr package?
Can you find additional documentation explaining the flights data set?

Details of `dplyr` package and functions

# are there vignettes for the dplyr package —> yes, 5 of them
vignette(package = "dplyr")
browseVignettes(package = "dplyr")  # shows hyperlinks in browser

# additional documentation for the mpg data set
?flights

Case study: `dplyr` package for the “flights” dataset

The “flights” dataset in the “nycflights13” R package is a popular dataset containing information about flights departing from New York City airports in 2013. It includes details such as flight dates, departure and arrival times, airlines, distances, and delays.

It’s great for exploring patterns in flight data, understanding the factors affecting delays, and building predictive models.

Question for EducateUs: can you briefly explain the “flights” dataset in the “nycflights13” R package?

You will learn six key dplyr functions that allow you to solve the vast majority of your data manipulation challenges.

# load packages
library(nycflights13)
library(tidyverse)
# your turn!

Case study: `dplyr` package for the “flights” dataset

If you want to filter out the flights that are

departure on January of 2013
arrange these flights by the departure delay
and extract a few specific columns
and calculate the flying speed

The filter() function —————————————————

filter() function: filter values based on defined conditions

Filter based on one or more variables

# basic filtering
filter(flights, month == 1)

## # A tibble: 27,004 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 26,994 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

filter(flights, month == 1, day == 1)

## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 832 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

filter(flights, month == 1, day == 1, dep_delay > 0)

## # A tibble: 352 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      601            600         1      844            850
##  5  2013     1     1      608            600         8      807            735
##  6  2013     1     1      611            600        11      945            931
##  7  2013     1     1      613            610         3      925            921
##  8  2013     1     1      623            610        13      920            915
##  9  2013     1     1      632            608        24      740            728
## 10  2013     1     1      644            636         8      931            940
## # ℹ 342 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Save filtered new data frame

Save filter data frame using assignment operator (<-)
Save and view filtered data frame by wrapping entire function with ()

# save a new data frame
jan1 <- filter(flights, month == 1, day == 1)
(dec25 <- filter(flights, month == 12, day == 25))

## # A tibble: 719 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    25      456            500        -4      649            651
##  2  2013    12    25      524            515         9      805            814
##  3  2013    12    25      542            540         2      832            850
##  4  2013    12    25      546            550        -4     1022           1027
##  5  2013    12    25      556            600        -4      730            745
##  6  2013    12    25      557            600        -3      743            752
##  7  2013    12    25      557            600        -3      818            831
##  8  2013    12    25      559            600        -1      855            856
##  9  2013    12    25      559            600        -1      849            855
## 10  2013    12    25      600            600         0      850            846
## # ℹ 709 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

logical tests for filtering

# logical tests
12 == 12
12 <= c(12, 11)
12 %in% c(12, 11, 8)
x <- c(12, NA, 11, NA, 8)
is.na(x)

Using logical tests for filtering

What will these operations produce?

filter(flights, month == 12)
filter(flights, month != 12)
filter(flights, month %in% c(11, 12))
filter(flights, arr_delay <= 120)
filter(flights, !(arr_delay <= 120))
filter(flights, is.na(tailnum))

multiple logical tests for filtering

12 == 12 & 12 < 14
12 == 12 & 12 < 10
12 == 12 | 12 < 10
any(12 == 12, 12 < 10)
all(12 == 12, 12 < 10)

multiple comparisons for filtering

# Using comma is same as using &
filter(flights, month == 12, day == 25)
filter(flights, month == 12 & day == 25)

# Use %in% as a shortcut for |
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))

# Use not operator !
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

The filter() exercise: your turn!

Find the number of flights that

Had an arrival delay of two or more hours
Flew to Houston (IAH or HOU)
Arrived more than two hours late, but didn’t leave late

The arrange() function ————————————————–

arrange() function: reorder data

ordering data

arrange(flights, dep_delay)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

arrange(flights, dep_delay, arr_delay)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013     3    30     2030           2055       -25     2213           2250
##  8  2013    10    23     1907           1932       -25     2143           2143
##  9  2013     5     5      934            958       -24     1225           1309
## 10  2013     9    18     1631           1655       -24     1812           1845
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

arrange(flights, desc(dep_delay))

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

missing values always sorted at the end

df <- tibble(x = c(5, 2, 5, NA))
arrange(df, x)

## # A tibble: 4 × 1
##       x
##   <dbl>
## 1     2
## 2     5
## 3     5
## 4    NA

arrange(df, desc(x))

## # A tibble: 4 × 1
##       x
##   <dbl>
## 1     5
## 2     5
## 3     2
## 4    NA

The arrange() exercise: your turn!

Sort flights to find those with largest departure delays
Find the flights that left earliest based on departure time
Which flights traveled the longest distance?
Which traveled the shortest?

The select() function —————————————————

Select variables of concern

selecting variables

select(flights, year, month, day)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ℹ 336,766 more rows

select(flights, year:day)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ℹ 336,766 more rows

select(flights, -(year:day))

## # A tibble: 336,776 × 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # ℹ 336,766 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Helper functions with the select() function

select(flights, ends_with("time"))

## # A tibble: 336,776 × 5
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <int>          <int>    <int>          <int>    <dbl>
##  1      517            515      830            819      227
##  2      533            529      850            830      227
##  3      542            540      923            850      160
##  4      544            545     1004           1022      183
##  5      554            600      812            837      116
##  6      554            558      740            728      150
##  7      555            600      913            854      158
##  8      557            600      709            723       53
##  9      557            600      838            846      140
## 10      558            600      753            745      138
## # ℹ 336,766 more rows

select(flights, c(carrier, ends_with("time"), contains("delay")))

## # A tibble: 336,776 × 8
##    carrier dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay
##    <chr>      <int>          <int>    <int>          <int>    <dbl>     <dbl>
##  1 UA           517            515      830            819      227         2
##  2 UA           533            529      850            830      227         4
##  3 AA           542            540      923            850      160         2
##  4 B6           544            545     1004           1022      183        -1
##  5 DL           554            600      812            837      116        -6
##  6 UA           554            558      740            728      150        -4
##  7 B6           555            600      913            854      158        -5
##  8 EV           557            600      709            723       53        -3
##  9 B6           557            600      838            846      140        -3
## 10 AA           558            600      753            745      138        -2
## # ℹ 336,766 more rows
## # ℹ 1 more variable: arr_delay <dbl>

select(flights, time_hour, air_time, everything())

## # A tibble: 336,776 × 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
## # ℹ 336,766 more rows
## # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>

The select() function: your turn!

What happens if you include the name of a variable multiple times in a select() call?
What does the one_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("MONTH", "month", "day", "dep_delay", "arr_delay")

Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

The rename() function

rename(flights, ANNOYING = dep_delay)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time ANNOYING arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>    <dbl>    <int>          <int>
##  1  2013     1     1      517            515        2      830            819
##  2  2013     1     1      533            529        4      850            830
##  3  2013     1     1      542            540        2      923            850
##  4  2013     1     1      544            545       -1     1004           1022
##  5  2013     1     1      554            600       -6      812            837
##  6  2013     1     1      554            558       -4      740            728
##  7  2013     1     1      555            600       -5      913            854
##  8  2013     1     1      557            600       -3      709            723
##  9  2013     1     1      557            600       -3      838            846
## 10  2013     1     1      558            600       -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The mutate() function —————————————————

Create new variables with functions of existing variables

The mutate() function: create new variables based on existing variables

mutate() creates new variables with functions of existing variables:

# create a smaller data set to work with
flights_sml <- select(
  flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
flights_sml

## # A tibble: 336,776 × 7
##     year month   day dep_delay arr_delay distance air_time
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
##  1  2013     1     1         2        11     1400      227
##  2  2013     1     1         4        20     1416      227
##  3  2013     1     1         2        33     1089      160
##  4  2013     1     1        -1       -18     1576      183
##  5  2013     1     1        -6       -25      762      116
##  6  2013     1     1        -4        12      719      150
##  7  2013     1     1        -5        19     1065      158
##  8  2013     1     1        -3       -14      229       53
##  9  2013     1     1        -3        -8      944      140
## 10  2013     1     1        -2         8      733      138
## # ℹ 336,766 more rows

mutate(flights_sml,
       # arr_delay=arr_delay, dep_delay=dep_delay,
       hours = air_time / 60,
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60,
       gain_per_hour = gain / hours)

## # A tibble: 336,776 × 11
##     year month   day dep_delay arr_delay distance air_time hours  gain speed
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227 3.78      9  370.
##  2  2013     1     1         4        20     1416      227 3.78     16  374.
##  3  2013     1     1         2        33     1089      160 2.67     31  408.
##  4  2013     1     1        -1       -18     1576      183 3.05    -17  517.
##  5  2013     1     1        -6       -25      762      116 1.93    -19  394.
##  6  2013     1     1        -4        12      719      150 2.5      16  288.
##  7  2013     1     1        -5        19     1065      158 2.63     24  404.
##  8  2013     1     1        -3       -14      229       53 0.883   -11  259.
##  9  2013     1     1        -3        -8      944      140 2.33     -5  405.
## 10  2013     1     1        -2         8      733      138 2.3      10  319.
## # ℹ 336,766 more rows
## # ℹ 1 more variable: gain_per_hour <dbl>

transmute() function: only keep new variables

If you only want to keep the new variables use transmute():

transmute(
  flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

## # A tibble: 336,776 × 3
##     gain hours gain_per_hour
##    <dbl> <dbl>         <dbl>
##  1     9 3.78           2.38
##  2    16 3.78           4.23
##  3    31 2.67          11.6 
##  4   -17 3.05          -5.57
##  5   -19 1.93          -9.83
##  6    16 2.5            6.4 
##  7    24 2.63           9.11
##  8   -11 0.883        -12.5 
##  9    -5 2.33          -2.14
## 10    10 2.3            4.35
## # ℹ 336,766 more rows

The mutate() exercise: your turn!

Create a new variable distance_km that converts distance in miles to kilometers
Create a time_per_km variable based on air_time and distance_km.

The summarize() or summarise() function ———————————

Collapse many values down to a single summary statistic

summarizing data

summarise(flights, dep_delay_mean = mean(dep_delay, na.rm = TRUE))

## # A tibble: 1 × 1
##   dep_delay_mean
##            <dbl>
## 1           12.6

summarise(flights, 
          dep_delay_mean = mean(dep_delay, na.rm = TRUE),
          dep_delay_sd = sd(dep_delay, na.rm = TRUE))

## # A tibble: 1 × 2
##   dep_delay_mean dep_delay_sd
##            <dbl>        <dbl>
## 1           12.6         40.2

summarise(flights, 
          dep_delay_mean = mean(dep_delay, na.rm = TRUE),
          dep_delay_sd = sd(dep_delay, na.rm = TRUE),
          n = n())

## # A tibble: 1 × 3
##   dep_delay_mean dep_delay_sd      n
##            <dbl>        <dbl>  <int>
## 1           12.6         40.2 336776

The group_by() function ———————————

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ℹ 355 more rows

The group_by() exercise: your turn!

Which carrier had the largest mean departure delay? Smallest?
Which carrier had the largest difference between their max and min departure delay?
Which month has the largest standard deviation for arrival delays?

The pipe operator %>%

Recall the tasks we discussed. If you want to filter out the flights that are

departure on January of 2013
arrange these flights by the departure delay
and extract a few specific columns
and calculate the flying speed

sel_flights <- flights %>%
  # Extract flights departing in January 2013
  filter(month == 1, year == 2013) %>%
  
  # Arrange the flights by departure delay
  arrange(dep_delay) %>%
  
  # extract a few specific columns
  select(year:day, dep_time, arr_time, dep_delay, distance, air_time) %>%
  
  # Calculate the flying speed as distance divided by air_time
  mutate(flying_speed = distance / air_time)

# View the resulting dataset with extracted flights and calculated metrics
print(sel_flights)

## # A tibble: 27,004 × 9
##     year month   day dep_time arr_time dep_delay distance air_time flying_speed
##    <int> <int> <int>    <int>    <int>     <dbl>    <dbl>    <dbl>        <dbl>
##  1  2013     1    11     1900     2233       -30     1010      139         7.27
##  2  2013     1    29     1703     1947       -27     1620      250         6.48
##  3  2013     1    12     1354     1606       -22      762      110         6.93
##  4  2013     1    21     2137     2232       -22      269       38         7.08
##  5  2013     1    20      704     1025       -21     2402      348         6.90
##  6  2013     1    12     2050     2310       -20      937      123         7.62
##  7  2013     1    12     2134        4       -20     1065      135         7.89
##  8  2013     1    14     2050     2329       -20      937      135         6.94
##  9  2013     1     4     2140     2241       -19      269       45         5.98
## 10  2013     1    11     1947     2209       -18      647      102         6.34
## # ℹ 26,994 more rows

If these are separate tasks, it should not be too hard. But what if you need to conduct all these tasks all at once, how can you efficiently do it and organize the codes.

go to top

go back to course website

Advanced Data Wrangling in R using dplyr

Why do we need dplyr

Why do we need dplyr

Key functions in dplyr package

Key functions in dplyr package

Details of dplyr package and functions

Details of dplyr package and functions

Case study: dplyr package for the “flights” dataset

Case study: dplyr package for the “flights” dataset

The filter() function —————————————————

Filter based on one or more variables

Save filtered new data frame

logical tests for filtering

Using logical tests for filtering

multiple logical tests for filtering

multiple comparisons for filtering

The filter() exercise: your turn!

The arrange() function ————————————————–

ordering data

missing values always sorted at the end

The arrange() exercise: your turn!

The select() function —————————————————

selecting variables

Helper functions with the select() function

The select() function: your turn!

The rename() function

The mutate() function —————————————————

The mutate() function: create new variables based on existing variables

transmute() function: only keep new variables

The mutate() exercise: your turn!

The summarize() or summarise() function ———————————

summarizing data

The group_by() function ———————————

The group_by() exercise: your turn!

The pipe operator %>%

Key functions in `dplyr` package

Key functions in `dplyr` package

Details of `dplyr` package and functions

Details of `dplyr` package and functions

Case study: `dplyr` package for the “flights” dataset

Case study: `dplyr` package for the “flights” dataset