Exploring Sentiment Analysis Performance - A Comparative Study of Machine Learning Models on Movie Reviews

A Machine Learning Experiment for MSc Computer Science, University of Surrey

Introduction

As part of my MSc coursework at the University of Surrey, I conducted a comprehensive experiment comparing different machine learning models for sentiment analysis on movie reviews. This post details my methodology, experimental setup, and findings from this investigation conducted in June 2021.

Problem Statement

Sentiment analysis remains a crucial task in natural language processing, with applications ranging from product review analysis to social media monitoring. I aimed to compare the performance of traditional machine learning algorithms against more recent deep learning approaches on the IMDb movie review dataset.

Dataset

I used the IMDb Large Movie Review Dataset, containing 50,000 movie reviews equally split between positive and negative sentiments. The dataset was divided as follows:

Training set: 20,000 reviews (10,000 positive, 10,000 negative)
Validation set: 5,000 reviews (2,500 positive, 2,500 negative)
Test set: 25,000 reviews (12,500 positive, 12,500 negative)

Methodology

Data Preprocessing

I implemented several preprocessing steps:

Text Cleaning: Removed HTML tags, special characters, and converted text to lowercase
Tokenization: Split reviews into individual words
Stopword Removal: Eliminated common English stopwords
Feature Engineering:
- For traditional ML models: TF-IDF vectorization with max features of 5,000
- For deep learning models: Word embeddings using pre-trained GloVe vectors (100 dimensions)

Models Tested

I implemented and compared five different models:

Naive Bayes (Multinomial)
Support Vector Machine (Linear SVM)
Random Forest Classifier
Logistic Regression
LSTM Neural Network

Experimental Setup

All experiments were conducted on Google Colab with the following specifications:

Python 3.7
scikit-learn 0.24.2
TensorFlow 2.5.0
Training time limit: 2 hours per model
Hardware: Tesla K80 GPU (for LSTM only)

Results

Performance Metrics

I evaluated each model using accuracy, precision, recall, and F1-score on the test set:

+--------------------+----------+-----------+--------+----------+---------------+
| Model              | Accuracy | Precision | Recall | F1-Score | Training Time |
+--------------------+----------+-----------+--------+----------+---------------+
| Naive Bayes        |  84.3%   |   83.9%   | 84.7%  |  84.3%   |  12 seconds   |
| Linear SVM         |  87.2%   |   87.5%   | 86.8%  |  87.1%   |  3.5 minutes  |
| Random Forest      |  85.1%   |   85.4%   | 84.6%  |  85.0%   |  18 minutes   |
| Logistic Regression|  86.8%   |   86.6%   | 87.0%  |  86.8%   |  45 seconds   |
| LSTM               |  88.9%   |   89.2%   | 88.6%  |  88.9%   |  94 minutes   |
+--------------------+----------+-----------+--------+----------+---------------+

Confusion Matrices

The LSTM model showed the best overall performance with the following confusion matrix on the test set:

                 Predicted
                 Positive  Negative
Actual Positive   11,075    1,425
       Negative    1,352   11,148

Learning Curves

I plotted learning curves for each model to understand their behavior with varying training set sizes. The LSTM showed steady improvement with more data, while traditional models plateaued earlier. The SVM demonstrated the most stable learning curve among traditional approaches.

Error Analysis

I analyzed 100 misclassified reviews from each model and identified common failure patterns:

Sarcasm and Irony: All models struggled with sarcastic reviews like “Oh great, another masterpiece of terrible acting”
Mixed Sentiments: Reviews containing both positive and negative aspects confused most models
Domain-Specific Language: Technical film terminology and director/actor references impacted performance

Key Findings

LSTM Superiority: The LSTM network achieved the highest accuracy (88.9%), benefiting from its ability to capture sequential dependencies in text.
SVM Efficiency: Linear SVM provided the best trade-off between performance (87.2%) and training time (3.5 minutes), making it suitable for real-time applications.
Naive Bayes Baseline: Despite its simplicity, Naive Bayes achieved respectable performance (84.3%) with minimal computational requirements.
Feature Importance: Analysis of the logistic regression coefficients revealed that words like “awful,” “brilliant,” “masterpiece,” and “waste” were among the strongest predictors.

Challenges and Limitations

Computational Resources: Limited GPU access restricted extensive hyperparameter tuning for the LSTM model
Dataset Bias: The IMDb dataset may not generalize well to other domains
Binary Classification: Real-world sentiment often exists on a spectrum rather than binary labels

Conclusion

This experiment demonstrated that while deep learning models like LSTM achieve superior performance in sentiment analysis, traditional machine learning algorithms remain competitive, especially when considering computational efficiency. The choice of model should depend on specific application requirements balancing accuracy needs against resource constraints.

Future work could explore transformer-based models like BERT, which have shown promising results in NLP tasks since their introduction. Additionally, investigating multi-class sentiment classification and cross-domain transfer learning would provide valuable insights.

Code Repository

The complete code for this experiment, including data preprocessing scripts, model implementations, and visualization notebooks, is available at: github.com/dostogircse171/sentiment-analysis-comparison

This experiment was conducted as part of the Machine Learning module (COM3025) at the University of Surrey.

Author

Dostogir

Mohammad Golam Dostogir, Software Engineer specializing in Python, Django, and AI solutions. Active contributor to open-source projects and tech communities, with experience delivering applications for global companies.
GitHub
View all posts