Let’s first load the Iris dataset. This is a very famous dataset in almost all data mining, machine learning courses, and it has been an R build-in dataset. The dataset consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginicaand Iris versicolor). Four features(variables) were measured from each sample, they are the length and the width of sepal and petal, in centimeters. It is introduced by Sir Ronald Fisher in 1936.

1 Exploratory Data Analysis by Visualization

1.1 Histogram

Histogram is the easiest way to show how numerical variables are distributed.

1.1.0.1 Produce a single histogram

data(iris)
hist(iris$Sepal.Length, col="green", breaks=20)

You may change “breaks=” and “col=” to have different appearance.

1.1.0.2 Density plot – Fitted curve for histogram

Density plot is a nonparametric fitting.

plot(density(iris$Sepal.Length))

1.1.0.3 Combine the histogram and the density chart.

You can make the plot more elegant with different options. For example, adding a title, adjusting the axis range, renaming the axis label, and so on…

You can also add curves on top of an existing plot by using lines() or abline() function.

hist(iris$Sepal.Length, prob=T, col="green", breaks=20, main="Histogram and Density of Sepal Length", xlim=c(3,9), xlab="Sepal Length")
lines(density(iris$Sepal.Length), col="red", lwd=2)

# Add a vertical line that indicates the average of Sepal Length
abline(v=mean(iris$Sepal.Length), col="blue", lty=2, lwd=1.5)

1.2 Bar Chart

Bar chart is produces by using a vector of single data points, which is often a vector of summary statistics. Therefore, you need to preprocess your data, and get summary statistics before drawing the bar chart.

# bar chart for average of the 4 quantitative variables
aveg<- apply(iris[,1:4], 2, mean)
barplot(aveg, ylab = "Average")

1.2.0.1 Use `?barplot` or Google search to produce following bar chart.

1.3 Pie Chart

Pie chart is commonly used to visualize the proportion of different subject. It is similar to bar chart. You have to use a vector of single data points to produce a pie chart.

pie(table(iris$Species), col=rainbow(3))

1.4 Box plot

Box plot can only be drawn for continuous variable.

# box plot of Sepal.Length
boxplot(iris$Sepal.Length)

1.4.0.1 Draw box plot of multiple variables into one figure

boxplot(iris[,1:4], notch=T, col=c("red", "blue", "yellow", "grey"))

1.4.0.2 Box plot by group

boxplot(iris[,1]~iris[,5], notch=T, ylab="Sepal Length", col="blue")

1.5 Scatter Plot

1.5.1 Simple Scatter plot of two numerical variables

plot(iris$Sepal.Length, iris$Sepal.Width, xlab = "Length", ylab = "Width", main = "Sepal")

1.5.2 Scatter plot matrix (all paired variables)

pairs(iris[,1:4])

1.6 Parallel Coordinates

library(MASS)
parcoord(iris[,1:4],col=iris$Species)

1.7 R Graphic Options

You may display multiple plots in one window (one figure).

# set arrangement of multiple plots
par(mfrow=c(2,2))
# set mrgins
par(mar=c(4.5, 4.2, 3, 1.5)) 
hist(iris$Sepal.Length, xlab = "Sepal Length", cex.lab=1.5)
hist(iris$Sepal.Width, xlab = "Sepal Width", col = "red")
plot(iris$Sepal.Length, iris$Sepal.Width, xlab = "Length", ylab = "Width", main= "Sepal", pch=17)
boxplot(iris[,1:4], notch=T, col=c("red", "blue", "yellow", "grey"))

There are much more options that can make your plot nice. You can learn options at here or ask your best friend – Google.

Details about figure margins can be found here.

go to top

Exploratory Data Analysis by Visualization