Data Cleaning: ensure \(\color{red}{\text{reliability}}\) and \(\color{red}{\text{quality}}\)

In business analytics, data comes from various sources such as databases, spreadsheets, and external sources. These sources often contain missing values, duplicate entries, outliers, and formatting inconsistencies, which can strongly skew the analysis results and lead to incorrect business decisions if not addressed properly.

Data cleaning is a crucial step in the process of preparing data for analysis in the field of business analytics. It involves identifying and correcting errors, inconsistencies, and inaccuracies in the dataset to ensure its reliability and quality.

WHY data cleaning is a crucial step

Organizational data often represent multiple truths
Organizations are increasingly integrating internal and external data sources
Analysts need to understand what assumptions or nuances are represented by the data

Do you enjoy it?

It is important and time-consuming,

but, do you enjoy data cleaning?

Data Wrangling in R

In R programming language, data wrangling skills are essential for efficiently manipulating and preparing data for analysis. Some important data wrangling skills in R include:

Data Importing
Data Cleaning
Data Transformation
Data Manipulation
Data Visualization
Handling Dates and Times
Handling Strings
Looping and Conditional Statements
Efficient Coding Practices

By performing thorough data cleaning, business analysts can improve the quality and reliability of their analyses, leading to more accurate insights and better-informed business decisions. It lays the foundation for successful data analysis and ensures that businesses can trust the results derived from their data-driven initiatives.

Data Importing

R supports the importing from various sources such as CSV files, Excel spreadsheets, databases (e.g., MySQL, PostgreSQL), web APIs, and other formats.

Option	Function	Speed in seconds
Base R	`read.csv("data/flights.csv")`	~ 4 seconds
`library(readr)`	`read_csv("data/flights.csv")`	~ 0.8 seconds
`library(data.table)`	`fread("data/flights.csv")`	~ 0.3 seconds
`library(readxl)`	`read_excel()`
Database: `library(DBI)`	`dbSendQuery()`
Web APIs: `library(httr)`	`GET(url = NULL,...)`

Data Importing

What about other flat files?

Examples: `read.csv()` and `read_excel()`

We focus on read.csv() function to read in .csv files, and read_excel() to read in .xlsx files.

Example 1: Importing data from a CSV file (Auto.csv)

Auto <- read.csv("data/Auto.csv")
# View(Auto)
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Example 2: Importing data from an Excel spreadsheet

(RetailSales2018.xlsx); (RetailSales2018_clean.xlsx)

library(readxl)
Retail <- read_xlsx(path = "data/RetailSales2018.xlsx")

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`

head(Retail)

## # A tibble: 6 × 5
##   RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2  ...3  ...4  ...5 
##   <chr>                                                  <chr> <chr> <chr> <chr>
## 1 SEASONAL FACTORS(http://www.census.gov/retail/marts/w… <NA>  <NA>  <NA>  <NA> 
## 2 CPI Table 24(http://www.bls.gov/cpi)                   <NA>  <NA>  <NA>  <NA> 
## 3 <NA>                                                   <NA>  <NA>  <NA>  <NA> 
## 4 <NA>                                                   <NA>  <NA>  <NA>  <NA> 
## 5 Year                                                   Month Sales S_Fa… CPI  
## 6 1992                                                   JAN   1471… 0.89… 138.1
## # ℹ abbreviated name:
## #   ¹`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`

Learn about the data

So what are the first things we want to know about our data?
?
?
?
?

Data Cleaning

Skills to clean and preprocess data, which may involve handling missing values (NA), removing duplicates, converting data types, dealing with outliers, and detecting and correcting errors in the dataset using functions like na.omit(), complete.cases(), duplicated(), and various functions from the dplyr and tidyr packages.

Data Cleaning

na.omit() : Removes any rows that contain at least one NA.

df <- data.frame(
  x = c(1, 2, NA),
  y = c(4, NA, 6)
)

na.omit(df)

##   x y
## 1 1 4

Data Cleaning

complete.cases() : Returns a logical vector indicating which rows have no missing values.

complete.cases(df)

## [1]  TRUE FALSE FALSE

Data Cleaning

duplicated(): Identifies duplicate values or rows, marking all but the first occurrence as TRUE.

x <- c(1, 2, 2, 3, 1)
duplicated(x)

## [1] FALSE FALSE  TRUE FALSE  TRUE

Data Cleaning

Example 1: Removing useless rows

Retail2 <- Retail[-c(1:4),]
head(Retail2)

## # A tibble: 6 × 5
##   RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2  ...3  ...4  ...5 
##   <chr>                                                  <chr> <chr> <chr> <chr>
## 1 Year                                                   Month Sales S_Fa… CPI  
## 2 1992                                                   JAN   1471… 0.89… 138.1
## 3 1992                                                   FEB   1470… 0.89… 138.6
## 4 1992                                                   MAR   1596… 0.97… 139.…
## 5 1992                                                   APR   1636… 0.99… 139.5
## 6 1992                                                   MAY   1700… 1.02… 139.…
## # ℹ abbreviated name:
## #   ¹`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`

# Change colomn names
names(Retail2) <- as.character(Retail2[1, ])
Retail2 <- Retail2[-1, ]
Retail2[1:3,]

## # A tibble: 3 × 5
##   Year  Month Sales              S_Factor            CPI               
##   <chr> <chr> <chr>              <chr>               <chr>             
## 1 1992  JAN   147182.451         0.89700000000000002 138.1             
## 2 1992  FEB   147012.70000000001 0.89500000000000002 138.6             
## 3 1992  MAR   159653.32499999998 0.97499999999999998 139.30000000000001

Data Cleaning

Example 2: Handling missing values

clean_Retail <- na.omit(Retail2)
dim(clean_Retail)

## [1] 318   5

Data Cleaning

Example 3: Removing duplicates:

clean_Retail2 <- clean_Retail[!duplicated(clean_Retail), ]
dim(clean_Retail2)

## [1] 318   5

Data Transformation

Ability to transform data by creating new variables, reshaping data from wide to long format (and vice versa), and summarizing data using functions like mutate(), select(), filter(), arrange(), group_by(), summarize() from the dplyr package, and functions like pivot_longer() and pivot_wider() from the tidyr package.

Data Manipulation

Skills to perform various data manipulation tasks such as merging/joining datasets, splitting data into subsets, and reshaping data using functions like merge(), join(), split(), reshape(), rbind(), cbind().

Data Visualization

For quick data exploration, base R plotting functions can provide an expeditious and straightforward approach to understanding your data. Data visualization skills are the abilities to visualize data to explore patterns, relationships, and trends using functions like ggplot2 for creating sophisticated and customizable plots, plot() for basic plots, and ggplotly() from the plotly package for interactive plots.

Handling Dates and Times

Skills to work with date and time data, including parsing, formatting, extracting components (e.g., year, month, day), and performing date arithmetic using functions like as.Date(), as.POSIXct(), strftime(), strptime(), lubridate package.

Handling Strings

Ability to manipulate and process character strings, including pattern matching, substring extraction, and string manipulation using functions like grep(), sub(), gsub(), strsplit(), and stringr package.

Looping and Conditional Statements

Understanding of control structures like loops (for, while) and conditional statements (if, else, ifelse) to automate repetitive tasks and apply conditional operations on data.

Efficient Coding Practices

Familiarity with efficient coding practices such as vectorization, using R’s apply family of functions (apply(), lapply(), sapply(), vapply(), mapply()), and leveraging the capabilities of packages like dplyr and tidyr for faster and more concise code.

These skills are crucial for effectively managing and preparing data for analysis and visualization tasks in R. Continuous practice and exploration of R packages and functions related to data wrangling will enhance proficiency in handling diverse datasets and extracting meaningful insights.

Data Wrangling in R

Importance of Data Wrangling!

Most Time-Consuming Data Science Task: Data Cleaning

Data Cleaning: ensure \(\color{red}{\text{reliability}}\) and \(\color{red}{\text{quality}}\)

WHY data cleaning is a crucial step

Do you enjoy it?

Data Wrangling in R

Data Importing

Data Importing

What about other flat files?

Examples: `read.csv()` and `read_excel()`

Learn about the data

Data Cleaning

Data Cleaning

Data Cleaning

Data Cleaning

Data Cleaning

Example 1: Removing useless rows

Data Cleaning

Example 2: Handling missing values

Data Cleaning

Example 3: Removing duplicates:

Data Transformation

Data Manipulation

Data Visualization

Handling Dates and Times

Handling Strings

Looping and Conditional Statements

Efficient Coding Practices

Time to flex our new knowledge muscles!

Data Wrangling in R

Importance of Data Wrangling!

Most Time-Consuming Data Science Task: Data Cleaning

Data Cleaning: ensure \(\color{red}{\text{reliability}}\) and \(\color{red}{\text{quality}}\)

WHY data cleaning is a crucial step

Do you enjoy it?

Data Wrangling in R

Data Importing

Data Importing

What about other flat files?

Examples: read.csv() and read_excel()

Learn about the data

Data Cleaning

Data Cleaning

Data Cleaning

Data Cleaning

Data Cleaning

Example 1: Removing useless rows

Data Cleaning

Example 2: Handling missing values

Data Cleaning

Example 3: Removing duplicates:

Data Transformation

Data Manipulation

Data Visualization

Handling Dates and Times

Handling Strings

Looping and Conditional Statements

Efficient Coding Practices

Time to flex our new knowledge muscles!

Examples: `read.csv()` and `read_excel()`