Exploratory Data Analysis (EDA) is usually the first step when you analyze data, because you want to know what information the dataset carries. In this lab, we introudce basic R functions for EDA in both quantitative and graphical approaches. In the end, we will also learn some useful functions for data manipulation, which is often necessary in data analysis.
Before start, always do
Let’s first load the Iris dataset. This is a very famous dataset in almost all data mining, machine learning courses, and it has been an R build-in dataset. The dataset consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginicaand Iris versicolor). Four features(variables) were measured from each sample, they are the length and the width of sepal and petal, in centimeters. It is introduced by Sir Ronald Fisher in 1936.
The iris flower data set is included in R. It is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
First, load iris data to the current workspace
data(iris)
iris
You can use head()
or tail()
to print the first or last few rows of a dataset:
head(iris)
Check dimensionality, the dataset has 150 rows(observations) and 5 columns (variables)
dim(iris)
## [1] 150 5
Another way to get the dim is to use ncol or nrow:
ncol(iris)
## [1] 5
nrow(iris)
## [1] 150
Variable names or column names
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
You can also use this command
colnames(iris)
Structure of the dataframe, note that the difference between num and Factor
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
By default, R treat strings as factors (or categorical variables), in many situations (for example, building a regression model) this is what you want because R can automatically create “dummy variables” from the factors. However when merging data from different sources this can cause errors. In this case you can use stringsAsFactors = FALSE
option in read.table
.
class(iris[,1])
## [1] "numeric"
class(iris[,5])
## [1] "factor"
Try the summary()
function.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
## Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
It only produces the location statistics for continues variable, and count for categorical variable. How about standard deviation, another important summary statistic?
sd(iris$Sepal.Length)
## [1] 0.8280661
quantile(iris$Sepal.Length)
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
Using apply()
to calculate a particular statistic for multiple variables at the same time.
apply(iris[,1:4], 2, sd) # "2" means "by column"
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8280661 0.4358663 1.7652982 0.7622377
aggregate()
function to find summary statistics by group.# group mean
aggregate(.~Species, iris, mean)
# group standard deviation
aggregate(.~Species, iris, sd)
What if there are multiple “factor” variables?
Let’s first artificially create a new column that categorizes “Sepal.Length” by quantile.
# use function cut for categorization based on quntiles
Cate.SepalLength<- cut(iris$Sepal.Length, breaks=quantile(iris$Sepal.Length), include.lowes=T)
# Add the created categorical variable to the data
iris1<- iris
iris1$Sepal.Length.Cate <- Cate.SepalLength
Average value of numerical varialbes by two categorical variables: Species and Sepal.Length.Cate:
aggregate(.~Species+Sepal.Length.Cate, iris1, mean)
# One-way count table
table(iris1$Species)
##
## setosa versicolor virginica
## 50 50 50
# Two-way count table
table(iris1$Species, iris1$Sepal.Length.Cate)
##
## [4.3,5.1] (5.1,5.8] (5.8,6.4] (6.4,7.9]
## setosa 36 14 0 0
## versicolor 4 20 17 9
## virginica 1 5 18 26
Download the customer data and read into R.
customer <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/CustomerData.csv")
- How many rows and columns of the dataset?
- Print first few rows the dataset.
- Obtain the summary statistics (Min, Median, Max, Mean and Std.) for Age, EducationYears, HHIncome, and CreditDebt.
- Obtain the mean of HHIncome by MaritalStatus
- Obtain a pivot table of LoanDefault vs. JobCategory. Which Job Category has the highest and lowerst loan default rate?
Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.
To do it with code, let’s suppose that you would like to find the top 5 rows in iris
dataset with largest Sepal.Length
.
iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ]
Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.
To do it with code, let’s suppose that you would like to find the top 5 rows in iris
dataset with largest Sepal.Length
.
iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ]
The syntax is cleaner with the arrange()
function in the dplyr
package:
arrange(iris, desc(Sepal.Length))[1:5, ]
If you want to select one or more variables of a data frame, there are two ways to do that. First is using indexing by “[]”. Second is select()
function in dplyr. For example, suppose we want to select variable “Sepal.Length”:
iris[, "Sepal.Length"]
or alternatively select two variables: “Sepal.Length”, “Sepal.Width”
iris[, c("Sepal.Length", "Sepal.Width")]
Recall the customer dataset we in previous exercise.
customer <- read.csv("data/CustomerData.csv")
# How many missing values are in customer dataset?
sum(is.na(customer))
## [1] 76
# How many missing values are in each variable?
data.frame(num_missing=colSums(is.na(customer)))
# Simply delete rows with missings
clean_customer<- na.omit(customer)
nrow(clean_customer)
## [1] 4940
# Impute missing values by median
medHS<- median(customer$HouseholdSize, na.rm = T)
customer$HouseholdSize[is.na(customer$HouseholdSize)==T]<- medHS
Download the customer data and read into R.
customer <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/CustomerData.csv")
- How many rows and columns of the dataset?
- Print first few rows the dataset.
- Obtain the summary statistics (Min, Median, Max, Mean and Std.) for Age, EducationYears, HHIncome, and CreditDebt.
- Obtain the mean of HHIncome by MaritalStatus
- Obtain a pivot table of LoanDefault vs. JobCategory. Which Job Category has the highest and lowerst loan default rate?