In this lab we will demonstrate the basic steps of using R to build a predictive model for movie review sentiments. Source of the data is from http://www.cs.cornell.edu/people/pabo/movie-review-data/.
First install and load packages needed for text mining.
install.packages(c('tm', 'SnowballC', 'wordcloud', 'topicmodels'))
library(tm)
## Loading required package: NLP
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
Next we load the movie review dataset(Download Here)
reviews = read.csv("movie_reviews.csv", stringsAsFactors = F, row.names = 1)
The review dataset has two variables: content of the review, and polarity of the review (0 or 1). To use the tm package we first transfrom the dataset to a corpus:
review_corpus = Corpus(VectorSource(reviews$content))
Next we normalize the texts in the reviews using a series of pre-processing steps: 1. Switch to lower case 2. Remove numbers 3. Remove punctuation marks and stopwords 4. Remove extra whitespaces
review_corpus = tm_map(review_corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(review_corpus,
## content_transformer(tolower)): transformation drops documents
review_corpus = tm_map(review_corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(review_corpus, removeNumbers):
## transformation drops documents
review_corpus = tm_map(review_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(review_corpus, removePunctuation):
## transformation drops documents
review_corpus = tm_map(review_corpus, removeWords, c("the", "and", stopwords("english")))
## Warning in tm_map.SimpleCorpus(review_corpus, removeWords, c("the",
## "and", : transformation drops documents
review_corpus = tm_map(review_corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(review_corpus, stripWhitespace):
## transformation drops documents
After the above transformations the first review looks like
inspect(review_corpus[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] plot two teen couples go church party drink drive get accident one guys dies girlfriend continues see life nightmares whats deal watch movie sorta find critique mindfuck movie teen generation touches cool idea presents bad package makes review even harder one write since generally applaud films attempt break mold mess head lost highway memento good bad ways making types films folks just didnt snag one correctly seem taken pretty neat concept executed terribly problems movie well main problem simply jumbled starts normal downshifts fantasy world audience member idea whats going dreams characters coming back dead others look like dead strange apparitions disappearances looooot chase scenes tons weird things happen simply explained now personally dont mind trying unravel film every now give clue get kind fed films biggest problem obviously got big secret hide seems want hide completely final five minutes make things entertaining thrilling even engaging meantime really sad part arrow dig flicks like actually figured halfway point strangeness start make little bit sense still didnt make film entertaining guess bottom line movies like always make sure audience even given secret password enter world understanding mean showing melissa sagemiller running away visions minutes throughout movie just plain lazy okay get people chasing dont know really need see giving us different scenes offering insight strangeness going movie apparently studio took film away director chopped shows mightve pretty decent teen mindfuck movie somewhere guess suits decided turning music video little edge make sense actors pretty good part although wes bentley just seemed playing exact character american beauty new neighborhood biggest kudos go sagemiller holds throughout entire film actually feeling characters unraveling overall film doesnt stick doesnt entertain confusing rarely excites feels pretty redundant runtime despite pretty cool ending explanation craziness came oh way horror teen slasher flick just packaged look way someone apparently assuming genre still hot kids also wrapped production two years ago sitting shelves ever since whatever skip wheres joblo coming nightmare elm street blair witch crow crow salvation lost highway memento others stir echoes
To analyze the textual data, we use a Document-Term Matrix (DTM) representation: documents as the rows, terms/words as the columns, frequency of the term in the document as the entries. Because the number of unique words in the corpus the dimension can be large.
review_dtm <- DocumentTermMatrix(review_corpus)
review_dtm
## <<DocumentTermMatrix (documents: 2000, terms: 46460)>>
## Non-/sparse entries: 538003/92381997
## Sparsity : 99%
## Maximal term length: 61
## Weighting : term frequency (tf)
inspect(review_dtm[500:505, 500:505])
## <<DocumentTermMatrix (documents: 6, terms: 6)>>
## Non-/sparse entries: 0/36
## Sparsity : 100%
## Maximal term length: 12
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs accidentally adults adventures anastasia angry animation
## 500 0 0 0 0 0 0
## 501 0 0 0 0 0 0
## 502 0 0 0 0 0 0
## 503 0 0 0 0 0 0
## 504 0 0 0 0 0 0
## 505 0 0 0 0 0 0
To reduce the dimension of the DTM, we can emove the less frequent terms such that the sparsity is less than 0.95
review_dtm = removeSparseTerms(review_dtm, 0.99)
review_dtm
## <<DocumentTermMatrix (documents: 2000, terms: 4366)>>
## Non-/sparse entries: 400173/8331827
## Sparsity : 95%
## Maximal term length: 17
## Weighting : term frequency (tf)
The first review now looks like
inspect(review_dtm[1,1:20])
## <<DocumentTermMatrix (documents: 1, terms: 20)>>
## Non-/sparse entries: 20/0
## Sparsity : 0%
## Maximal term length: 10
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs accident actors actually ago also apparently audience away bad
## 1 1 1 2 1 1 2 2 2 2
## Terms
## Docs biggest
## 1 2
We can draw a simple word cloud
findFreqTerms(review_dtm, 1000)
## [1] "also" "back" "bad" "character" "characters"
## [6] "director" "doesnt" "dont" "even" "film"
## [11] "films" "get" "good" "just" "know"
## [16] "life" "like" "little" "make" "movie"
## [21] "movies" "new" "one" "people" "plot"
## [26] "really" "scenes" "see" "seems" "still"
## [31] "two" "way" "well" "action" "another"
## [36] "hes" "much" "story" "time" "work"
## [41] "best" "can" "first" "man" "end"
## [46] "something" "will" "love" "many" "never"
## [51] "great" "made" "scene"
freq = data.frame(sort(colSums(as.matrix(review_dtm)), decreasing=TRUE))
wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, "Dark2"))
One may argue that in the wordcloud, words such as one, film, movie do not carry too much meaning in the setting, since we know that the entire corpus is about movies. Therefore sometimes it is necessary to use the tf–idf(term frequency–inverse document frequency) instead of the frequencies of the term as entries, tf-idf measures the relative importance of a word to a document.
review_dtm_tfidf <- DocumentTermMatrix(review_corpus, control = list(weighting = weightTfIdf))
review_dtm_tfidf = removeSparseTerms(review_dtm_tfidf, 0.95)
review_dtm_tfidf
## <<DocumentTermMatrix (documents: 2000, terms: 963)>>
## Non-/sparse entries: 254264/1671736
## Sparsity : 87%
## Maximal term length: 14
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
# The first document
inspect(review_dtm_tfidf[1,1:20])
## <<DocumentTermMatrix (documents: 1, terms: 20)>>
## Non-/sparse entries: 20/0
## Sparsity : 0%
## Maximal term length: 10
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample :
## Terms
## Docs actually ago apparently attempt audience away
## 1 0.01088143 0.01054396 0.02084643 0.009503339 0.0102235 0.01187785
## Terms
## Docs bad biggest break came
## 1 0.008411705 0.0234983 0.01146785 0.01079597
Is the new word cloud more informative?
freq = data.frame(sort(colSums(as.matrix(review_dtm_tfidf)), decreasing=TRUE))
wordcloud(rownames(freq), freq[,1], max.words=100, colors=brewer.pal(1, "Dark2"))
To predict the polarity (sentiment) of a review, we can make use of a precompiled list of words with positive and negative meanings (Source, negative-words.txt, positive-words.txt)
neg_words = read.table("negative-words.txt", header = F, stringsAsFactors = F)[, 1]
pos_words = read.table("positive-words.txt", header = F, stringsAsFactors = F)[, 1]
As simple indicators, we create two variables (neg, pos) that contain the number of positive and negative words in each document
reviews$neg = tm_term_score(DocumentTermMatrix(review_corpus), neg_words)
reviews$pos = tm_term_score(DocumentTermMatrix(review_corpus), pos_words)
Let’s remove the actual texual content for statistical model building
reviews$content = NULL
Now we can combine the tf-idf matrix with the sentiment polarity according to the sentiment lists.
reviews = cbind(reviews, as.matrix(review_dtm_tfidf))
reviews$polarity = as.factor(reviews$polarity)
Split to testing and training set
id_train <- sample(nrow(reviews),nrow(reviews)*0.80)
reviews.train = reviews[id_train,]
reviews.test = reviews[-id_train,]
The rest should be natural for you by this point. We can compare the performance of logistic regression, decision tree, SVM, and neural network models.
install.packages(c('rpart', 'rpart.plot', 'e1071', 'nnet'))
library(rpart)
library(rpart.plot)
library(e1071)
library(nnet)
Train models:
reviews.tree = rpart(polarity~., method = "class", data = reviews.train);
prp(reviews.tree)
reviews.glm = glm(polarity~ ., family = "binomial", data =reviews.train, maxit = 100);
reviews.svm = svm(polarity~., data = reviews.train);
reviews.nnet = nnet(polarity~., data=reviews.train, size=1, maxit=500)
## # weights: 968
## initial value 1115.551086
## iter 10 value 912.730585
## iter 20 value 789.464411
## iter 30 value 107.359804
## iter 40 value 0.224776
## iter 50 value 0.005437
## iter 60 value 0.000158
## iter 60 value 0.000098
## iter 60 value 0.000077
## final value 0.000077
## converged
Evaluate performance with the test set:
pred.tree = predict(reviews.tree, reviews.test, type="class")
table(reviews.test$polarity,pred.tree,dnn=c("Obs","Pred"))
## Pred
## Obs 0 1
## 0 153 45
## 1 78 124
mean(ifelse(reviews.test$polarity != pred.tree, 1, 0))
## [1] 0.3075
pred.glm = as.numeric(predict(reviews.glm, reviews.test, type="response") > 0.5)
table(reviews.test$polarity,pred.glm,dnn=c("Obs","Pred"))
## Pred
## Obs 0 1
## 0 134 64
## 1 62 140
mean(ifelse(reviews.test$polarity != pred.glm, 1, 0))
## [1] 0.315
pred.svm = predict(reviews.svm, reviews.test)
table(reviews.test$polarity,pred.svm,dnn=c("Obs","Pred"))
## Pred
## Obs 0 1
## 0 167 31
## 1 27 175
mean(ifelse(reviews.test$polarity != pred.svm, 1, 0))
## [1] 0.145
prob.nnet= predict(reviews.nnet,reviews.test)
pred.nnet = as.numeric(prob.nnet > 0.5)
table(reviews.test$polarity, pred.nnet, dnn=c("Obs","Pred"))
## Pred
## Obs 0 1
## 0 156 42
## 1 32 170
mean(ifelse(reviews.test$polarity != pred.nnet, 1, 0))
## [1] 0.185