Xiaorui (Jeremy) Zhu
01/20/2026
Making data manipulation easier and more intuitive .
It simplifies the process of data manipulation by providing a consistent grammar for interacting with data.
People can easily filter, arrange, summarize, and extract specific parts of their data without complex code.
For example, you have a data frame with information about customers, including their names, ages, spending and satisfaction level.
You want to filter out customers who are
dplyr packageYou learned six key dplyr functions that allow you to solve the vast majority of your data manipulation challenges. What do each do?
dplyr packageYou learned six key dplyr functions that allow you to solve the vast majority of your data manipulation challenges. What do each do?
dplyr package and functionsdplyr package and functionsdplyr package for the “flights”
datasetThe “flights” dataset in the “nycflights13” R package is a popular dataset containing information about flights departing from New York City airports in 2013. It includes details such as flight dates, departure and arrival times, airlines, distances, and delays.
It’s great for exploring patterns in flight data, understanding the factors affecting delays, and building predictive models.
Question for EducateUs: can you briefly explain the “flights” dataset in the “nycflights13” R package?
You will learn six key dplyr functions that allow you to
solve the vast majority of your data manipulation challenges.
dplyr package for the “flights”
datasetIf you want to filter out the flights that are
filter() function: filter values based on defined conditions
## # A tibble: 27,004 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 26,994 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 842 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 832 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 352 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 601 600 1 844 850
## 5 2013 1 1 608 600 8 807 735
## 6 2013 1 1 611 600 11 945 931
## 7 2013 1 1 613 610 3 925 921
## 8 2013 1 1 623 610 13 920 915
## 9 2013 1 1 632 608 24 740 728
## 10 2013 1 1 644 636 8 931 940
## # ℹ 342 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# save a new data frame
jan1 <- filter(flights, month == 1, day == 1)
(dec25 <- filter(flights, month == 12, day == 25))## # A tibble: 719 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 25 456 500 -4 649 651
## 2 2013 12 25 524 515 9 805 814
## 3 2013 12 25 542 540 2 832 850
## 4 2013 12 25 546 550 -4 1022 1027
## 5 2013 12 25 556 600 -4 730 745
## 6 2013 12 25 557 600 -3 743 752
## 7 2013 12 25 557 600 -3 818 831
## 8 2013 12 25 559 600 -1 855 856
## 9 2013 12 25 559 600 -1 849 855
## 10 2013 12 25 600 600 0 850 846
## # ℹ 709 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
What will these operations produce?
filter(flights, month == 12)
filter(flights, month != 12)
filter(flights, month %in% c(11, 12))
filter(flights, arr_delay <= 120)
filter(flights, !(arr_delay <= 120))
filter(flights, is.na(tailnum))12 == 12 & 12 < 14
12 == 12 & 12 < 10
12 == 12 | 12 < 10
any(12 == 12, 12 < 10)
all(12 == 12, 12 < 10)# Using comma is same as using &
filter(flights, month == 12, day == 25)
filter(flights, month == 12 & day == 25)
# Use %in% as a shortcut for |
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
# Use not operator !
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)Find the number of flights that
Had an arrival delay of two or more hours
Flew to Houston (IAH or HOU)
Arrived more than two hours late, but didn’t leave late
arrange() function: reorder data
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 3 30 2030 2055 -25 2213 2250
## 8 2013 10 23 1907 1932 -25 2143 2143
## 9 2013 5 5 934 958 -24 1225 1309
## 10 2013 9 18 1631 1655 -24 1812 1845
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 4 × 1
## x
## <dbl>
## 1 2
## 2 5
## 3 5
## 4 NA
## # A tibble: 4 × 1
## x
## <dbl>
## 1 5
## 2 5
## 3 2
## 4 NA
Select variables of concern
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ℹ 336,766 more rows
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ℹ 336,766 more rows
## # A tibble: 336,776 × 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 517 515 2 830 819 11 UA
## 2 533 529 4 850 830 20 UA
## 3 542 540 2 923 850 33 AA
## 4 544 545 -1 1004 1022 -18 B6
## 5 554 600 -6 812 837 -25 DL
## 6 554 558 -4 740 728 12 UA
## 7 555 600 -5 913 854 19 B6
## 8 557 600 -3 709 723 -14 EV
## 9 557 600 -3 838 846 -8 B6
## 10 558 600 -2 753 745 8 AA
## # ℹ 336,766 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 × 5
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ℹ 336,766 more rows
## # A tibble: 336,776 × 8
## carrier dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay
## <chr> <int> <int> <int> <int> <dbl> <dbl>
## 1 UA 517 515 830 819 227 2
## 2 UA 533 529 850 830 227 4
## 3 AA 542 540 923 850 160 2
## 4 B6 544 545 1004 1022 183 -1
## 5 DL 554 600 812 837 116 -6
## 6 UA 554 558 740 728 150 -4
## 7 B6 555 600 913 854 158 -5
## 8 EV 557 600 709 723 53 -3
## 9 B6 557 600 838 846 140 -3
## 10 AA 558 600 753 745 138 -2
## # ℹ 336,766 more rows
## # ℹ 1 more variable: arr_delay <dbl>
## # A tibble: 336,776 × 19
## time_hour air_time year month day dep_time sched_dep_time
## <dttm> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545
## 5 2013-01-01 06:00:00 116 2013 1 1 554 600
## 6 2013-01-01 05:00:00 150 2013 1 1 554 558
## 7 2013-01-01 06:00:00 158 2013 1 1 555 600
## 8 2013-01-01 06:00:00 53 2013 1 1 557 600
## 9 2013-01-01 06:00:00 140 2013 1 1 557 600
## 10 2013-01-01 06:00:00 138 2013 1 1 558 600
## # ℹ 336,766 more rows
## # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>
What happens if you include the name of a variable multiple times in a select() call?
What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("MONTH", "month", "day", "dep_delay", "arr_delay")
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time ANNOYING arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Create new variables with functions of existing variables
mutate() creates new variables with functions of existing variables:
# create a smaller data set to work with
flights_sml <- select(
flights,
year:day,
ends_with("delay"),
distance,
air_time
)
flights_sml## # A tibble: 336,776 × 7
## year month day dep_delay arr_delay distance air_time
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227
## 2 2013 1 1 4 20 1416 227
## 3 2013 1 1 2 33 1089 160
## 4 2013 1 1 -1 -18 1576 183
## 5 2013 1 1 -6 -25 762 116
## 6 2013 1 1 -4 12 719 150
## 7 2013 1 1 -5 19 1065 158
## 8 2013 1 1 -3 -14 229 53
## 9 2013 1 1 -3 -8 944 140
## 10 2013 1 1 -2 8 733 138
## # ℹ 336,766 more rows
mutate(flights_sml,
# arr_delay=arr_delay, dep_delay=dep_delay,
hours = air_time / 60,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60,
gain_per_hour = gain / hours)## # A tibble: 336,776 × 11
## year month day dep_delay arr_delay distance air_time hours gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 3.78 9 370.
## 2 2013 1 1 4 20 1416 227 3.78 16 374.
## 3 2013 1 1 2 33 1089 160 2.67 31 408.
## 4 2013 1 1 -1 -18 1576 183 3.05 -17 517.
## 5 2013 1 1 -6 -25 762 116 1.93 -19 394.
## 6 2013 1 1 -4 12 719 150 2.5 16 288.
## 7 2013 1 1 -5 19 1065 158 2.63 24 404.
## 8 2013 1 1 -3 -14 229 53 0.883 -11 259.
## 9 2013 1 1 -3 -8 944 140 2.33 -5 405.
## 10 2013 1 1 -2 8 733 138 2.3 10 319.
## # ℹ 336,766 more rows
## # ℹ 1 more variable: gain_per_hour <dbl>
If you only want to keep the new variables use transmute():
transmute(
flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)## # A tibble: 336,776 × 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 9 3.78 2.38
## 2 16 3.78 4.23
## 3 31 2.67 11.6
## 4 -17 3.05 -5.57
## 5 -19 1.93 -9.83
## 6 16 2.5 6.4
## 7 24 2.63 9.11
## 8 -11 0.883 -12.5
## 9 -5 2.33 -2.14
## 10 10 2.3 4.35
## # ℹ 336,766 more rows
Create a new variable distance_km that converts distance in miles to kilometers
Create a time_per_km variable based on air_time and distance_km.
Collapse many values down to a single summary statistic
## # A tibble: 1 × 1
## dep_delay_mean
## <dbl>
## 1 12.6
summarise(flights,
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
dep_delay_sd = sd(dep_delay, na.rm = TRUE))## # A tibble: 1 × 2
## dep_delay_mean dep_delay_sd
## <dbl> <dbl>
## 1 12.6 40.2
summarise(flights,
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
dep_delay_sd = sd(dep_delay, na.rm = TRUE),
n = n())## # A tibble: 1 × 3
## dep_delay_mean dep_delay_sd n
## <dbl> <dbl> <int>
## 1 12.6 40.2 336776
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ℹ 355 more rows
Recall the tasks we discussed. If you want to filter out the flights that are
sel_flights <- flights %>%
# Extract flights departing in January 2013
filter(month == 1, year == 2013) %>%
# Arrange the flights by departure delay
arrange(dep_delay) %>%
# extract a few specific columns
select(year:day, dep_time, arr_time, dep_delay, distance, air_time) %>%
# Calculate the flying speed as distance divided by air_time
mutate(flying_speed = distance / air_time)
# View the resulting dataset with extracted flights and calculated metrics
print(sel_flights)## # A tibble: 27,004 × 9
## year month day dep_time arr_time dep_delay distance air_time flying_speed
## <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 11 1900 2233 -30 1010 139 7.27
## 2 2013 1 29 1703 1947 -27 1620 250 6.48
## 3 2013 1 12 1354 1606 -22 762 110 6.93
## 4 2013 1 21 2137 2232 -22 269 38 7.08
## 5 2013 1 20 704 1025 -21 2402 348 6.90
## 6 2013 1 12 2050 2310 -20 937 123 7.62
## 7 2013 1 12 2134 4 -20 1065 135 7.89
## 8 2013 1 14 2050 2329 -20 937 135 6.94
## 9 2013 1 4 2140 2241 -19 269 45 5.98
## 10 2013 1 11 1947 2209 -18 647 102 6.34
## # ℹ 26,994 more rows
If these are separate tasks, it should not be too hard. But what if you need to conduct all these tasks all at once, how can you efficiently do it and organize the codes.