We introduce package **tidyverse**, and some basic functions in the sub-packages for EDA. For more details, please see https://www.tidyverse.org/. This section is based on Dr. Bradley Boehmke’s short course for MSBA students at Lindner College of Business. The course materials can be downloaded from here.

`install.packages("tidyverse")`

`library(tidyverse)`

`## -- Attaching packages ------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --`

```
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
```

```
## -- Conflicts ---------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
```

We introduce **dplyr** package with some very user-friendly functions for data manipulation. These functions are:

`filter()`

`select()`

`arrange()`

`rename()`

`mutate()`

Here I introduce 4 ways to get subsets of data that satisfy certain logical conditions: `subset()`

, logical vectors, SQL, and `filter()`

. These kind of operations are called filtering in Excel. Knowing any one of these well is enough. Do not worry about memorizing the syntax, you can always look them up.

Suppose we want to get the **observations that have Sepal.Length > 5 and Sepal.Width > 4**. We can use logical operators: != not equal to; == equal to; | or; & and.

- Use subset function

```
data(iris)
subset(x = iris, subset = Sepal.Length > 5 & Sepal.Width > 4)
```

You can omit the x = and subset = part

`subset(iris, Sepal.Length > 5 & Sepal.Width > 4)`

- Use logical vectors

`iris[(iris$Sepal.Length > 5 & iris$Sepal.Width > 4), ]`

- Use SQL statement

```
install.packages('sqldf')
library(sqldf)
sqldf('select * from iris where `Sepal.Length` > 5 and `Sepal.Width` > 4')
```

In earlier version of sqldf all dots(.) in variable names need to be changed to underscores(_).

`filter()`

is a power function in package**dplyr**to perform fitering like Excel Filter.

```
# filter by row observations
data(iris)
iris_filter <- filter(iris, Sepal.Length<=5 & Sepal.Width>3)
iris_filter2 <- filter(iris, Species=="setosa", Sepal.Width<=3 | Sepal.Width>=4)
```

The following code random sample (without replacement) 90% of the original dataset and assgin them to a new variable *iris_sample*.

`iris_sample <- iris[sample(x = nrow(iris), size = nrow(iris)*0.90),]`

The `dplyr`

package provides more convinient ways for generating random samples. You can take a fixed number of samples using `sample_n()`

or a fraction using `sample_frac()`

as follows

```
install.packages('dplyr')
library(dplyr)
iris_sample <- sample_frac(iris, 0.9)
```

The `dplyr`

package provides more convinient ways for generating random samples. You can take a fixed number of samples using `sample_n()`

or a fraction using `sample_frac()`

as follows

```
install.packages('dplyr')
library(dplyr)
iris_sample <- sample_frac(iris, 0.9)
# using dplyr for logical subsetting
filter(iris, Sepal.Length> 5, Sepal.Width > 4)
```

I recommend you to go through the `dplyr`

tutorial and lubridate tutorial. They make common data manipulation tasks and dealing with time-date much easier in R.

Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.

To do it with code, let’s suppose that you would like to find the top 5 rows in `iris`

dataset with largest `Sepal.Length`

.

`iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ] `

The syntax is cleaner with the `arrange()`

function in the `dplyr`

package:

`arrange(iris, desc(Sepal.Length))[1:5, ]`

If you want to select one or more variables of a data frame, there are two ways to do that. First is using indexing by “[]”. Second is `select()`

function in *dplyr*. For example, suppose we want to select variable “Sepal.Length”:

`iris[, "Sepal.Length"]`

or alternatively select two variables: “Sepal.Length”, “Sepal.Width”

`iris[, c("Sepal.Length", "Sepal.Width")]`

On the other hand, `select()`

in *dplyr* package can be used to filter by column, i.e., selecting or dropping variables.

```
# Keep the variable Sepal.Length, Sepal.Width
varname <- c("Sepal.Length", "Sepal.Width")
iris_select <- select(iris, varname)
```

```
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(varname)` instead of `varname` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
```

```
# verify if we did correctly
names(iris_select)
```

`## [1] "Sepal.Length" "Sepal.Width"`

```
# This is equivalent to
iris_select <- iris[,varname]
```

What about dropping variables?

```
iris_select2 <- select(iris, -Sepal.Length, -Petal.Length, -Species)
names(iris_select2)
```

`## [1] "Sepal.Width" "Petal.Width"`

This is equivalent to

```
varname <- c("Sepal.Length", "Petal.Length", "Species")
iris_select2 <- iris[,!names(iris) %in% varname]
names(iris_select2)
```

`## [1] "Sepal.Width" "Petal.Width"`

It would be easier if you know the order of the variables that you want to drop or keep. Try to obtain *iris_select* and *iris_select2* by using “dataname[, "variable_index"].”

Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.

To do it with code, let’s suppose that you would like to find the top 5 rows in `iris`

dataset with largest `Sepal.Length`

.

`iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ] `

The syntax is cleaner with the `arrange()`

function in the `dplyr`

package:

`arrange(iris, desc(Sepal.Length))[1:5, ]`

```
# re-ordering the columns
iris_order <- select(iris, Species, Petal.Width, everything())
names(iris_order)
```

`## [1] "Species" "Petal.Width" "Sepal.Length" "Sepal.Width" "Petal.Length"`

```
# sorting rows by particular variable
iris_sort<- arrange(iris, Sepal.Length)
# sorting by more than one variable
iris_sort2<- arrange(iris, Sepal.Length, Sepal.Width)
# descending order
iris_sort_desc<- arrange(iris, desc(Sepal.Length))
```

Note that missing values are always sorted at the end.

```
iris_rename<- rename(iris, SL=Sepal.Length, SW=Sepal.Width)
names(iris_rename)
```

`## [1] "SL" "SW" "Petal.Length" "Petal.Width" "Species"`

```
iris_newvar<- mutate(iris, Sepal.L_W=Sepal.Length/Sepal.Width)
names(iris_newvar)
```

`## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" "Sepal.L_W"`

Try to obtain *iris_newvar* WITHOUT using `mutate()`

function. (You may need multiple steps, so `mutate()`

is very useful especially you need to create many new variables.)

ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics. More details can be found at http://ggplot2.org/. Here is a very good tutorial.

- How to obtain basic summary statistics
- Summary statistics by groups
- Pivot table
- Use of “[ ]” for subsetting and indexing
- Functions in
`dplyr`

packages. - Basic R graphics.