1 Import data and representation

First install and load packages needed for text mining.

install.packages(c('tm', 'SnowballC', 'wordcloud', 'topicmodels'))

library(tm)

## Loading required package: NLP

library(SnowballC)
library(wordcloud)

## Loading required package: RColorBrewer

Next we load the movie review dataset(Download Here)

reviews = read.csv("movie_reviews.csv", stringsAsFactors = F, row.names = 1)

The review dataset has two variables: content of the review, and polarity of the review (0 or 1). To use the tm package we first transfrom the dataset to a corpus:

review_corpus = Corpus(VectorSource(reviews$content))

Next we normalize the texts in the reviews using a series of pre-processing steps: 1. Switch to lower case 2. Remove numbers 3. Remove punctuation marks and stopwords 4. Remove extra whitespaces

review_corpus = tm_map(review_corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(review_corpus,
## content_transformer(tolower)): transformation drops documents

review_corpus = tm_map(review_corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(review_corpus, removeNumbers):
## transformation drops documents

review_corpus = tm_map(review_corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(review_corpus, removePunctuation):
## transformation drops documents

review_corpus = tm_map(review_corpus, removeWords, c("the", "and", stopwords("english")))

## Warning in tm_map.SimpleCorpus(review_corpus, removeWords, c("the",
## "and", : transformation drops documents

review_corpus =  tm_map(review_corpus, stripWhitespace)

## Warning in tm_map.SimpleCorpus(review_corpus, stripWhitespace):
## transformation drops documents

After the above transformations the first review looks like

inspect(review_corpus[1])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] plot two teen couples go church party drink drive get accident one guys dies girlfriend continues see life nightmares whats deal watch movie sorta find critique mindfuck movie teen generation touches cool idea presents bad package makes review even harder one write since generally applaud films attempt break mold mess head lost highway memento good bad ways making types films folks just didnt snag one correctly seem taken pretty neat concept executed terribly problems movie well main problem simply jumbled starts normal downshifts fantasy world audience member idea whats going dreams characters coming back dead others look like dead strange apparitions disappearances looooot chase scenes tons weird things happen simply explained now personally dont mind trying unravel film every now give clue get kind fed films biggest problem obviously got big secret hide seems want hide completely final five minutes make things entertaining thrilling even engaging meantime really sad part arrow dig flicks like actually figured halfway point strangeness start make little bit sense still didnt make film entertaining guess bottom line movies like always make sure audience even given secret password enter world understanding mean showing melissa sagemiller running away visions minutes throughout movie just plain lazy okay get people chasing dont know really need see giving us different scenes offering insight strangeness going movie apparently studio took film away director chopped shows mightve pretty decent teen mindfuck movie somewhere guess suits decided turning music video little edge make sense actors pretty good part although wes bentley just seemed playing exact character american beauty new neighborhood biggest kudos go sagemiller holds throughout entire film actually feeling characters unraveling overall film doesnt stick doesnt entertain confusing rarely excites feels pretty redundant runtime despite pretty cool ending explanation craziness came oh way horror teen slasher flick just packaged look way someone apparently assuming genre still hot kids also wrapped production two years ago sitting shelves ever since whatever skip wheres joblo coming nightmare elm street blair witch crow crow salvation lost highway memento others stir echoes

To analyze the textual data, we use a Document-Term Matrix (DTM) representation: documents as the rows, terms/words as the columns, frequency of the term in the document as the entries. Because the number of unique words in the corpus the dimension can be large.

review_dtm <- DocumentTermMatrix(review_corpus)
review_dtm

## <<DocumentTermMatrix (documents: 2000, terms: 46460)>>
## Non-/sparse entries: 538003/92381997
## Sparsity           : 99%
## Maximal term length: 61
## Weighting          : term frequency (tf)

inspect(review_dtm[500:505, 500:505])

## <<DocumentTermMatrix (documents: 6, terms: 6)>>
## Non-/sparse entries: 0/36
## Sparsity           : 100%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  accidentally adults adventures anastasia angry animation
##   500            0      0          0         0     0         0
##   501            0      0          0         0     0         0
##   502            0      0          0         0     0         0
##   503            0      0          0         0     0         0
##   504            0      0          0         0     0         0
##   505            0      0          0         0     0         0

To reduce the dimension of the DTM, we can emove the less frequent terms such that the sparsity is less than 0.95

review_dtm = removeSparseTerms(review_dtm, 0.99)
review_dtm

## <<DocumentTermMatrix (documents: 2000, terms: 4366)>>
## Non-/sparse entries: 400173/8331827
## Sparsity           : 95%
## Maximal term length: 17
## Weighting          : term frequency (tf)

The first review now looks like

inspect(review_dtm[1,1:20])

## <<DocumentTermMatrix (documents: 1, terms: 20)>>
## Non-/sparse entries: 20/0
## Sparsity           : 0%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs accident actors actually ago also apparently audience away bad
##    1        1      1        2   1    1          2        2    2   2
##     Terms
## Docs biggest
##    1       2

We can draw a simple word cloud

findFreqTerms(review_dtm, 1000)

##  [1] "also"       "back"       "bad"        "character"  "characters"
##  [6] "director"   "doesnt"     "dont"       "even"       "film"      
## [11] "films"      "get"        "good"       "just"       "know"      
## [16] "life"       "like"       "little"     "make"       "movie"     
## [21] "movies"     "new"        "one"        "people"     "plot"      
## [26] "really"     "scenes"     "see"        "seems"      "still"     
## [31] "two"        "way"        "well"       "action"     "another"   
## [36] "hes"        "much"       "story"      "time"       "work"      
## [41] "best"       "can"        "first"      "man"        "end"       
## [46] "something"  "will"       "love"       "many"       "never"     
## [51] "great"      "made"       "scene"

freq = data.frame(sort(colSums(as.matrix(review_dtm)), decreasing=TRUE))
wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, "Dark2"))

One may argue that in the wordcloud, words such as one, film, movie do not carry too much meaning in the setting, since we know that the entire corpus is about movies. Therefore sometimes it is necessary to use the tf–idf(term frequency–inverse document frequency) instead of the frequencies of the term as entries, tf-idf measures the relative importance of a word to a document.

review_dtm_tfidf <- DocumentTermMatrix(review_corpus, control = list(weighting = weightTfIdf))
review_dtm_tfidf = removeSparseTerms(review_dtm_tfidf, 0.95)
review_dtm_tfidf

## <<DocumentTermMatrix (documents: 2000, terms: 963)>>
## Non-/sparse entries: 254264/1671736
## Sparsity           : 87%
## Maximal term length: 14
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

# The first document
inspect(review_dtm_tfidf[1,1:20])

## <<DocumentTermMatrix (documents: 1, terms: 20)>>
## Non-/sparse entries: 20/0
## Sparsity           : 0%
## Maximal term length: 10
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##     Terms
## Docs   actually        ago apparently     attempt  audience       away
##    1 0.01088143 0.01054396 0.02084643 0.009503339 0.0102235 0.01187785
##     Terms
## Docs         bad   biggest      break       came
##    1 0.008411705 0.0234983 0.01146785 0.01079597

Is the new word cloud more informative?

freq = data.frame(sort(colSums(as.matrix(review_dtm_tfidf)), decreasing=TRUE))
wordcloud(rownames(freq), freq[,1], max.words=100, colors=brewer.pal(1, "Dark2"))

1.1 Predictive modeling

To predict the polarity (sentiment) of a review, we can make use of a precompiled list of words with positive and negative meanings (Source, negative-words.txt, positive-words.txt)

neg_words = read.table("negative-words.txt", header = F, stringsAsFactors = F)[, 1]
pos_words = read.table("positive-words.txt", header = F, stringsAsFactors = F)[, 1]

As simple indicators, we create two variables (neg, pos) that contain the number of positive and negative words in each document

reviews$neg = tm_term_score(DocumentTermMatrix(review_corpus), neg_words)
reviews$pos = tm_term_score(DocumentTermMatrix(review_corpus), pos_words)

Let’s remove the actual texual content for statistical model building

reviews$content = NULL

Now we can combine the tf-idf matrix with the sentiment polarity according to the sentiment lists.

reviews = cbind(reviews, as.matrix(review_dtm_tfidf))
reviews$polarity = as.factor(reviews$polarity)

Split to testing and training set

id_train <- sample(nrow(reviews),nrow(reviews)*0.80)
reviews.train = reviews[id_train,]
reviews.test = reviews[-id_train,]

The rest should be natural for you by this point. We can compare the performance of logistic regression, decision tree, SVM, and neural network models.

install.packages(c('rpart', 'rpart.plot', 'e1071', 'nnet'))

library(rpart)
library(rpart.plot)
library(e1071)
library(nnet)

Train models:

reviews.tree = rpart(polarity~.,  method = "class", data = reviews.train);  
prp(reviews.tree)

reviews.glm = glm(polarity~ ., family = "binomial", data =reviews.train, maxit = 100);  
reviews.svm = svm(polarity~., data = reviews.train);
reviews.nnet = nnet(polarity~., data=reviews.train, size=1, maxit=500)

## # weights:  968
## initial  value 1115.551086 
## iter  10 value 912.730585
## iter  20 value 789.464411
## iter  30 value 107.359804
## iter  40 value 0.224776
## iter  50 value 0.005437
## iter  60 value 0.000158
## iter  60 value 0.000098
## iter  60 value 0.000077
## final  value 0.000077 
## converged

Evaluate performance with the test set:

pred.tree = predict(reviews.tree, reviews.test,  type="class")
table(reviews.test$polarity,pred.tree,dnn=c("Obs","Pred"))

##    Pred
## Obs   0   1
##   0 153  45
##   1  78 124

mean(ifelse(reviews.test$polarity != pred.tree, 1, 0))

## [1] 0.3075

pred.glm = as.numeric(predict(reviews.glm, reviews.test, type="response") > 0.5)
table(reviews.test$polarity,pred.glm,dnn=c("Obs","Pred"))

##    Pred
## Obs   0   1
##   0 134  64
##   1  62 140

mean(ifelse(reviews.test$polarity != pred.glm, 1, 0))

## [1] 0.315

pred.svm = predict(reviews.svm, reviews.test)
table(reviews.test$polarity,pred.svm,dnn=c("Obs","Pred"))

##    Pred
## Obs   0   1
##   0 167  31
##   1  27 175

mean(ifelse(reviews.test$polarity != pred.svm, 1, 0))

## [1] 0.145

prob.nnet= predict(reviews.nnet,reviews.test)
pred.nnet = as.numeric(prob.nnet > 0.5)
table(reviews.test$polarity, pred.nnet, dnn=c("Obs","Pred"))

##    Pred
## Obs   0   1
##   0 156  42
##   1  32 170

mean(ifelse(reviews.test$polarity != pred.nnet, 1, 0))

## [1] 0.185

Basic Text Mining with R

1 Import data and representation

1.1 Predictive modeling