Multi Class Classification in Text using R: Predicting Ted Talk Ratings

Multi Class Classification in Text

This blog is in continuation to my NLP blog series. In the previous blogs, I discussed data pre-processing steps in R and recognizing emotions present in ted talks. In this blog, I am going to predict the ratings of the ted talks given by viewers. This would require Multi Class Classification and quite a bit of data cleaning and preprocessing. We will discuss each step in detail below. So, let’s dive in.

The ratings column in the ted talk dataset looks like the image below.

 Ratings Column

It was of paramount importance to clean and parse the column. The screenshot of the column above signifies how many people have rated a particular talk to be “Inspiring”, “Beautiful”, “Ingenious”, “Persuasive”, etc. The count key in the JSON signifies the value of the rating given to the talk. The objective is to get the rating with the highest count for each talk.

One of the primary concern is to parse the ratings column. I have used the gsub function for replacing the single quote with double quotes. I found the details and explanation about gsub here.

library(jsonlite)

formatted_ted_ratings <- gsub(“‘”,‘”‘,ted_talks$ratings)

The next step was to parse the JSON to get a list of id, rating name and rating count. The jsonlite library in R, provides functions to stream, validate and prettify JSON data. The fromJSON function is used to de-serialize JSON objects into R objects. Finally, purrr::map function, applies a function (fromJSON in our case) to each element in the list. The documentation and implementation can be read here.


ted_ratings <- purrr::map(formatted_ted_ratings, jsonlite::fromJSON)

The code block above gives us a neat list of parsed rating column. It looks like:

Parsed Rating Column

In the next step, I am going to make a new column “highest_rating” in which I will store the rating with highest count for every talk. Post that, I will convert the column into factor which will effectively give us 13 unique factors (ratings) to deal with. I hope, by this point you must have got a hint that this highest_rating column with 13 factor variables will be used for Multi Class Classification. In the sense, a binary classification problem has two classes to classify a data point, e.g. True and False. Whereas, in this problem we have to deal with the classification of a data point into one of the 13 classes and hence, this is a Multi Class Classification problem.


for (i in (1:length(ted_ratings))) {
 ted_ratings_df <- ted_ratings[[i]]
 highest_rating_count <- ted_ratings_df[which(ted_ratings_df$count == max(ted_ratings_df$count)), ]
 ted_talks$highest_rating[i] <- highest_rating_count$name
}

ted_talks$highest_rating = as.factor(ted_talks$highest_rating)

Our dataset preparation is now complete. We will now split our dataset into training and test. I have divided my dataset in 60:40 ratio.

trainObs <- sample(nrow(ted_talks), .6 * nrow(ted_talks), replace = FALSE)
testObs <- sample(nrow(ted_talks), .4 * nrow(ted_talks), replace = FALSE)

train_dat <- ted_talks[trainObs,]
test_dat <- ted_talks[testObs,]

I will now apply all the pre-processing steps to my training and test data (separately). Somehow, I was in dual mindset: whether to split the DTM into train and test or split the dataset and then prepare their DTM individually. Somehow, I chose the latter option. You can try with the former option and let me know if it works out fine for you.

I also took care of sparsity, something which I discussed in good detail in my blog. I also renamed my target variable as “y” instead of highest_rating for better intuitiveness.

train_corpus <- VCorpus(VectorSource(train_dat$transcript))

##Removing Punctuation
train_corpus <- tm_map(train_corpus, content_transformer(removePunctuation))


##Removing numbers
train_corpus <- tm_map(train_corpus, removeNumbers)


##Converting to lower case
train_corpus <- tm_map(train_corpus, content_transformer(tolower))

##Removing stop words
train_corpus <- tm_map(train_corpus, content_transformer(removeWords), stopwords(“english”))

##Stemming
train_corpus <- tm_map(train_corpus, stemDocument)

##Whitespace
train_corpus <- tm_map(train_corpus, stripWhitespace)

# Create Document Term Matrix
dtm_train <- DocumentTermMatrix(train_corpus)   

train_corpus <- removeSparseTerms(dtm_train, 0.4)

dtm_train_matrix <- as.matrix(train_corpus)
dtm_train_matrix <- cbind(dtm_train_matrix, train_dat$highest_rating)

colnames(dtm_train_matrix)[ncol(dtm_train_matrix)] <- “y”

training_set_ted_talk <- as.data.frame(dtm_train_matrix)
training_set_ted_talk$y <- as.factor(training_set_ted_talk$y)

Now that we have our training dataset ready, we can train our model. I am using caret package and svmLinear3 method in caret. svmLinear3 provides L2 regularization in SVM with Linear Kernel. Agreed, that’s a lot of technical jargon which I am purposely not explaining here because that’s for another blog altogether. Meanwhile, I am going to leave some links for you to understand L2 regularization, and SVM with Linear Kernel.

library(caret)

review_ted_model <- train(y ~., data = training_set_ted_talk, method = ‘svmLinear3’)

Preparing our test data. It’s the same repetitive procedure.

test_corpus <- VCorpus(VectorSource(test_dat$transcript))

##Removing Punctuation
test_corpus <- tm_map(test_corpus, content_transformer(removePunctuation))

##Removing numbers
test_corpus <- tm_map(test_corpus, removeNumbers)

##Converting to lower case
test_corpus <- tm_map(test_corpus, content_transformer(tolower))

##Removing stop words
test_corpus <- tm_map(test_corpus, content_transformer(removeWords), stopwords(“english”))

##Stemming
test_corpus <- tm_map(test_corpus, stemDocument)

##Whitespace
test_corpus <- tm_map(test_corpus, stripWhitespace)

# Create Document Term Matrix
dtm_test <- DocumentTermMatrix(test_corpus)   

test_corpus <- removeSparseTerms(dtm_test, 0.4)

dtm_test_matrix <- as.matrix(test_corpus)

Now, I will check the accuracy/performance of our model on test data.

#Build the prediction  
model_ted_talk_result <- predict(review_ted_model, newdata = dtm_test_matrix)

check_accuracy <- as.data.frame(cbind(prediction = model_ted_talk_result, rating = test_dat$highest_rating))

library(dplyr)
check_accuracy <- check_accuracy %>% mutate(prediction = as.integer(prediction) – 1)

check_accuracy$accuracy <- if_else(check_accuracy$prediction == check_accuracy$rating, 1, 0)
round(prop.table(table(check_accuracy$accuracy)), 3)

 

library(performanceEstimation)
classificationMetrics(as.integer(test_dat$highest_rating), model_ted_talk_result)


most_common_misclassified_ratings = check_accuracy %>% filter(check_accuracy$accuracy == 0) %>%
                                   group_by(rating) %>%
                                   summarise(Count = n()) %>%
                                   arrange(desc(Count)) %>%
                                   head(3)

##Most commong missclassified rating
levels(train_dat$highest_rating)[most_common_misclassified_ratings$rating]

The model metrics are:

acc err microF macroF macroRec macroPrec
0.513 0.486 0.513 0.425 0.452 0.495

The top 3 most commonly misclassified ratings are: “Inspiring”, “Informative”, “Fascinating”.

You can read more about micro and macro F1 scores from here and here. The scores above are heavily reliant on the method we use in the train method. I used SVM with Linear Kernel and L2 regularization which in itself was computationally heavy. There are other methods which you can try, but computational resources could be an issue.

Please do let me know what other method you used and the scores (accuracy and F1) that you get.

We will discuss a new topic related to text analytics in my next blog. Until then, stay tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *