Data Wrangling in R

Xiaorui (Jeremy) Zhu

01/08/2026

Importance of Data Wrangling!

“Unless the data is something I’ve analyzed a lot before, I usually feel like the blind men and the elephant.” – Jeff Leek

Most Time-Consuming Data Science Task: Data Cleaning

80% of the work - Data preparation accounts for about 80% of the work of data scientists

Data Cleaning: ensure \(\color{red}{\text{reliability}}\) and \(\color{red}{\text{quality}}\)

In business analytics, data comes from various sources such as databases, spreadsheets, and external sources. These sources often contain missing values, duplicate entries, outliers, and formatting inconsistencies, which can strongly skew the analysis results and lead to incorrect business decisions if not addressed properly.

Data cleaning is a crucial step in the process of preparing data for analysis in the field of business analytics. It involves identifying and correcting errors, inconsistencies, and inaccuracies in the dataset to ensure its reliability and quality.

WHY data cleaning is a crucial step

Do you enjoy it?

It is important and time-consuming,

but, do you enjoy data cleaning?

Data Wrangling in R

In R programming language, data wrangling skills are essential for efficiently manipulating and preparing data for analysis. Some important data wrangling skills in R include:

  1. Data Importing
  2. Data Cleaning
  3. Data Transformation
  4. Data Manipulation
  5. Data Visualization
  6. Handling Dates and Times
  7. Handling Strings
  8. Looping and Conditional Statements
  9. Efficient Coding Practices

By performing thorough data cleaning, business analysts can improve the quality and reliability of their analyses, leading to more accurate insights and better-informed business decisions. It lays the foundation for successful data analysis and ensures that businesses can trust the results derived from their data-driven initiatives.

Data Importing

R supports the importing from various sources such as CSV files, Excel spreadsheets, databases (e.g., MySQL, PostgreSQL), web APIs, and other formats.

Option Function Speed in seconds
Base R read.csv("data/flights.csv") ~ 4 seconds
library(readr) read_csv("data/flights.csv") ~ 0.8 seconds
library(data.table) fread("data/flights.csv") ~ 0.3 seconds
library(readxl) read_excel()
Database: library(DBI) dbSendQuery()
Web APIs: library(httr) GET(url = NULL,...)

Data Importing

What about other flat files?

Examples: read.csv() and read_excel()

We focus on read.csv() function to read in .csv files, and read_excel() to read in .xlsx files.

  1. Example 1: Importing data from a CSV file (Auto.csv)
Auto <- read.csv("data/Auto.csv")
# View(Auto)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
  1. Example 2: Importing data from an Excel spreadsheet

(RetailSales2018.xlsx); (RetailSales2018_clean.xlsx)

library(readxl)
Retail <- read_xlsx(path = "data/RetailSales2018.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
head(Retail)
## # A tibble: 6 × 5
##   RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2  ...3  ...4  ...5 
##   <chr>                                                  <chr> <chr> <chr> <chr>
## 1 SEASONAL FACTORS(http://www.census.gov/retail/marts/w… <NA>  <NA>  <NA>  <NA> 
## 2 CPI Table 24(http://www.bls.gov/cpi)                   <NA>  <NA>  <NA>  <NA> 
## 3 <NA>                                                   <NA>  <NA>  <NA>  <NA> 
## 4 <NA>                                                   <NA>  <NA>  <NA>  <NA> 
## 5 Year                                                   Month Sales S_Fa… CPI  
## 6 1992                                                   JAN   1471… 0.89… 138.1
## # ℹ abbreviated name:
## #   ¹​`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`

Learn about the data

Data Cleaning

Skills to clean and preprocess data, which may involve handling missing values (NA), removing duplicates, converting data types, dealing with outliers, and detecting and correcting errors in the dataset using functions like na.omit(), complete.cases(), duplicated(), and various functions from the dplyr and tidyr packages.

Data Cleaning

df <- data.frame(
  x = c(1, 2, NA),
  y = c(4, NA, 6)
)

na.omit(df)
##   x y
## 1 1 4

Data Cleaning

complete.cases(df)
## [1]  TRUE FALSE FALSE

Data Cleaning

x <- c(1, 2, 2, 3, 1)
duplicated(x)
## [1] FALSE FALSE  TRUE FALSE  TRUE

Data Cleaning

Example 1: Removing useless rows

Retail2 <- Retail[-c(1:4),]
head(Retail2)
## # A tibble: 6 × 5
##   RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2  ...3  ...4  ...5 
##   <chr>                                                  <chr> <chr> <chr> <chr>
## 1 Year                                                   Month Sales S_Fa… CPI  
## 2 1992                                                   JAN   1471… 0.89… 138.1
## 3 1992                                                   FEB   1470… 0.89… 138.6
## 4 1992                                                   MAR   1596… 0.97… 139.…
## 5 1992                                                   APR   1636… 0.99… 139.5
## 6 1992                                                   MAY   1700… 1.02… 139.…
## # ℹ abbreviated name:
## #   ¹​`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`
# Change colomn names
names(Retail2) <- as.character(Retail2[1, ])
Retail2 <- Retail2[-1, ]
Retail2[1:3,]
## # A tibble: 3 × 5
##   Year  Month Sales              S_Factor            CPI               
##   <chr> <chr> <chr>              <chr>               <chr>             
## 1 1992  JAN   147182.451         0.89700000000000002 138.1             
## 2 1992  FEB   147012.70000000001 0.89500000000000002 138.6             
## 3 1992  MAR   159653.32499999998 0.97499999999999998 139.30000000000001

Data Cleaning

Example 2: Handling missing values

clean_Retail <- na.omit(Retail2)
dim(clean_Retail)
## [1] 318   5

Data Cleaning

Example 3: Removing duplicates:

clean_Retail2 <- clean_Retail[!duplicated(clean_Retail), ]
dim(clean_Retail2)
## [1] 318   5

Data Transformation

Ability to transform data by creating new variables, reshaping data from wide to long format (and vice versa), and summarizing data using functions like mutate(), select(), filter(), arrange(), group_by(), summarize() from the dplyr package, and functions like pivot_longer() and pivot_wider() from the tidyr package.

Data Manipulation

Skills to perform various data manipulation tasks such as merging/joining datasets, splitting data into subsets, and reshaping data using functions like merge(), join(), split(), reshape(), rbind(), cbind().

Data Visualization

For quick data exploration, base R plotting functions can provide an expeditious and straightforward approach to understanding your data. Data visualization skills are the abilities to visualize data to explore patterns, relationships, and trends using functions like ggplot2 for creating sophisticated and customizable plots, plot() for basic plots, and ggplotly() from the plotly package for interactive plots.

Handling Dates and Times

Skills to work with date and time data, including parsing, formatting, extracting components (e.g., year, month, day), and performing date arithmetic using functions like as.Date(), as.POSIXct(), strftime(), strptime(), lubridate package.

Handling Strings

Ability to manipulate and process character strings, including pattern matching, substring extraction, and string manipulation using functions like grep(), sub(), gsub(), strsplit(), and stringr package.

Looping and Conditional Statements

Understanding of control structures like loops (for, while) and conditional statements (if, else, ifelse) to automate repetitive tasks and apply conditional operations on data.

Efficient Coding Practices

Familiarity with efficient coding practices such as vectorization, using R’s apply family of functions (apply(), lapply(), sapply(), vapply(), mapply()), and leveraging the capabilities of packages like dplyr and tidyr for faster and more concise code.

These skills are crucial for effectively managing and preparing data for analysis and visualization tasks in R. Continuous practice and exploration of R packages and functions related to data wrangling will enhance proficiency in handling diverse datasets and extracting meaningful insights.

Time to flex our new knowledge muscles!

Let’s apply what you’ve learned to a real case.

go to top

go back to course website