We introduce package tidyverse, and some basic functions in the sub-packages for EDA. For more details, please see https://www.tidyverse.org/. This section is based on Dr. Bradley Boehmke’s short course for MSBA students at Lindner College of Business. The course materials can be downloaded from here.
install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
We introduce dplyr package with some very user-friendly functions for data manipulation. These functions are:
filter()
select()
arrange()
rename()
mutate()
Here I introduce 4 ways to get subsets of data that satisfy certain logical conditions: subset()
, logical vectors, SQL, and filter()
. These kind of operations are called filtering in Excel. Knowing any one of these well is enough. Do not worry about memorizing the syntax, you can always look them up.
Suppose we want to get the observations that have Sepal.Length > 5 and Sepal.Width > 4. We can use logical operators: != not equal to; == equal to; | or; & and.
data(iris)
subset(x = iris, subset = Sepal.Length > 5 & Sepal.Width > 4)
You can omit the x = and subset = part
subset(iris, Sepal.Length > 5 & Sepal.Width > 4)
iris[(iris$Sepal.Length > 5 & iris$Sepal.Width > 4), ]
install.packages('sqldf')
library(sqldf)
sqldf('select * from iris where `Sepal.Length` > 5 and `Sepal.Width` > 4')
In earlier version of sqldf all dots(.) in variable names need to be changed to underscores(_).
filter()
is a power function in package dplyr to perform fitering like Excel Filter.# filter by row observations
data(iris)
iris_filter <- filter(iris, Sepal.Length<=5 & Sepal.Width>3)
iris_filter2 <- filter(iris, Species=="setosa", Sepal.Width<=3 | Sepal.Width>=4)
The following code random sample (without replacement) 90% of the original dataset and assgin them to a new variable iris_sample.
iris_sample <- iris[sample(x = nrow(iris), size = nrow(iris)*0.90),]
The dplyr
package provides more convinient ways for generating random samples. You can take a fixed number of samples using sample_n()
or a fraction using sample_frac()
as follows
install.packages('dplyr')
library(dplyr)
iris_sample <- sample_frac(iris, 0.9)
The dplyr
package provides more convinient ways for generating random samples. You can take a fixed number of samples using sample_n()
or a fraction using sample_frac()
as follows
install.packages('dplyr')
library(dplyr)
iris_sample <- sample_frac(iris, 0.9)
# using dplyr for logical subsetting
filter(iris, Sepal.Length> 5, Sepal.Width > 4)
I recommend you to go through the dplyr
tutorial and lubridate tutorial. They make common data manipulation tasks and dealing with time-date much easier in R.
Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.
To do it with code, let’s suppose that you would like to find the top 5 rows in iris
dataset with largest Sepal.Length
.
iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ]
The syntax is cleaner with the arrange()
function in the dplyr
package:
arrange(iris, desc(Sepal.Length))[1:5, ]
If you want to select one or more variables of a data frame, there are two ways to do that. First is using indexing by “[]”. Second is select()
function in dplyr. For example, suppose we want to select variable “Sepal.Length”:
iris[, "Sepal.Length"]
or alternatively select two variables: “Sepal.Length”, “Sepal.Width”
iris[, c("Sepal.Length", "Sepal.Width")]
On the other hand, select()
in dplyr package can be used to filter by column, i.e., selecting or dropping variables.
# Keep the variable Sepal.Length, Sepal.Width
varname <- c("Sepal.Length", "Sepal.Width")
iris_select <- select(iris, varname)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(varname)` instead of `varname` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# verify if we did correctly
names(iris_select)
## [1] "Sepal.Length" "Sepal.Width"
# This is equivalent to
iris_select <- iris[,varname]
What about dropping variables?
iris_select2 <- select(iris, -Sepal.Length, -Petal.Length, -Species)
names(iris_select2)
## [1] "Sepal.Width" "Petal.Width"
This is equivalent to
varname <- c("Sepal.Length", "Petal.Length", "Species")
iris_select2 <- iris[,!names(iris) %in% varname]
names(iris_select2)
## [1] "Sepal.Width" "Petal.Width"
It would be easier if you know the order of the variables that you want to drop or keep. Try to obtain iris_select and iris_select2 by using “dataname[, "variable_index"].”
Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.
To do it with code, let’s suppose that you would like to find the top 5 rows in iris
dataset with largest Sepal.Length
.
iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ]
The syntax is cleaner with the arrange()
function in the dplyr
package:
arrange(iris, desc(Sepal.Length))[1:5, ]
# re-ordering the columns
iris_order <- select(iris, Species, Petal.Width, everything())
names(iris_order)
## [1] "Species" "Petal.Width" "Sepal.Length" "Sepal.Width" "Petal.Length"
# sorting rows by particular variable
iris_sort<- arrange(iris, Sepal.Length)
# sorting by more than one variable
iris_sort2<- arrange(iris, Sepal.Length, Sepal.Width)
# descending order
iris_sort_desc<- arrange(iris, desc(Sepal.Length))
Note that missing values are always sorted at the end.
iris_rename<- rename(iris, SL=Sepal.Length, SW=Sepal.Width)
names(iris_rename)
## [1] "SL" "SW" "Petal.Length" "Petal.Width" "Species"
iris_newvar<- mutate(iris, Sepal.L_W=Sepal.Length/Sepal.Width)
names(iris_newvar)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" "Sepal.L_W"
Try to obtain iris_newvar WITHOUT using mutate()
function. (You may need multiple steps, so mutate()
is very useful especially you need to create many new variables.)
ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics. More details can be found at http://ggplot2.org/. Here is a very good tutorial.
dplyr
packages.