IMDB Sentiment Analysis | ML Model Comparison

Introduction

Background

Sentiment analysis is one of the most common real-world applications of Natural Language Processing (NLP). From product reviews to social media posts, understanding how people feel about something can drive better decisions.

In this project, I built and compared four popular machine learning models — Logistic Regression, Naive Bayes, Random Forest, and Support Vector Machine (SVM) — to classify movie reviews from the IMDB dataset as positive or negative.

The Dataset

In this project I used the IMDB movie reviews dataset from Kaggle. It contains 50,000 labeled reviews, evenly split between positive and negative sentiment.

📂 View on GitHub 📊 Get Dataset

Exploring the Data with Word Clouds

Before training any models, I wanted to get a quick, intuitive sense of what the reviews look like and which words appear most often in positive and negative reviews.

A simple way to do this is with word clouds.

Word clouds visualize the most frequent words in a corpus, where larger words appear more often in the text. While word clouds are not a quantitative analysis tool, they are useful for building intuition about the dataset.

I generated two word clouds:

One using only positive reviews
One using only negative reviews

Word Clouds for Positive and Negative Reviews

Word clouds showing distinctive words in positive (left) and negative (right) reviews

What We Can Observe

From the positive reviews word cloud, we often see words such as:

great, excellent, love, amazing, best, wonderful

From the negative reviews word cloud, common words include:

bad, worst, boring, waste, awful, poor

✓ Key Insight

These patterns match our expectations and provide a quick sanity check that the dataset is labeled correctly and contains meaningful sentiment-bearing words.

Word clouds don't replace proper modeling, but they offer a helpful first glance at the data before moving into feature extraction and training.

How the Text Was Prepared

Before training any model, the text needs to be converted into numbers.

Here's what I did:

Removed HTML tags, punctuation, numbers, and extra spaces
Tokenized each review into words
Converted reviews into numerical vectors using TF-IDF (Term Frequency–Inverse Document Frequency)

📖 What is TF-IDF?

TF-IDF helps capture not only how often a word appears in a review, but also how important that word is across the entire dataset. Common words like "the" and "is" get lower scores, while distinctive words like "brilliant" or "terrible" get higher scores.

Machine Learning Models Tested

🔵 Logistic Regression

A linear classifier that models the probability of a binary outcome using a logistic function.

🔴 Naive Bayes

A probabilistic classifier based on Bayes' theorem with an assumption of conditional independence among features.

🟢 Random Forest

An ensemble learning method that combines multiple decision trees to improve robustness and accuracy.

🟠 SVM

Support Vector Machine attempts to find an optimal hyperplane that separates data points with maximum margin.

Evaluation Metrics

To comprehensively evaluate model performance, multiple metrics are used. Relying on a single metric such as accuracy can be misleading, especially when different types of classification errors have different practical implications.

Therefore, this project reports Accuracy, Precision (Positive class), Recall (Positive class), F1-score (Positive class), and ROC-AUC.

Accuracy

Definition: Accuracy measures the proportion of correctly classified samples out of all samples.

Meaning: Accuracy indicates how often the model makes the correct prediction overall.

Why it's not enough

Accuracy looks good at first glance, but it doesn't tell us what kind of mistakes the model is making. A model can have high accuracy but still perform poorly in one class.

In balanced datasets like IMDB, accuracy is useful — but we still need more context.

Precision (Positive Class)

Definition: Precision measures the proportion of predicted positive reviews that are actually positive.

Meaning: When the model predicts a review as positive, how often is it correct?

Why it matters: High precision means the model doesn't wrongly label too many negative reviews as positive. If you were analyzing customer feedback, high precision means you're not mistakenly treating unhappy customers as satisfied ones.

Recall (Positive Class)

Definition: Recall measures the proportion of actual positive reviews that are correctly identified.

Meaning: Out of all truly positive reviews, how many did the model capture?

Why it matters: High recall means we're not missing too many positive reviews. If your goal is to find all satisfied users, recall becomes very important.

F1-Score (Positive Class)

Definition: The F1-score is the harmonic mean of precision and recall.

Meaning: F1-score provides a single metric that balances precision and recall.

Why it matters: Sometimes precision is high but recall is low (or vice versa). F1-score gives a combined view and helps evaluate the trade-off between the two. If you want a balanced model, F1-score is usually the metric to look at.

ROC-AUC

Definition: ROC-AUC measures the model's ability to distinguish between positive and negative classes across all classification thresholds.

Meaning: ROC-AUC represents the probability that the model ranks a randomly chosen positive review higher than a randomly chosen negative review.

📊 Interpretation

AUC = 0.5 → No discriminative ability (random guessing)
AUC = 1.0 → Perfect discrimination

Higher AUC values indicate better overall separability.

Why it matters: Unlike accuracy, ROC-AUC doesn't depend on one fixed threshold (like 0.5). It evaluates the model's overall ability to distinguish between classes. This makes it especially useful when comparing models fairly.

Why Use All of These?

Each metric highlights a different aspect of performance:

Accuracy → overall correctness
Precision → how reliable positive predictions are
Recall → how many positives we catch
F1 → balance between precision and recall
ROC-AUC → overall ranking ability

Looking at all of them together gives a much clearer picture than any single metric alone.

Understanding Confusion Matrices

In addition to numerical metrics, I also used confusion matrices to better understand how each model makes mistakes.

A confusion matrix breaks predictions into four categories:

True Positive (TP): Positive review correctly predicted as positive
True Negative (TN): Negative review correctly predicted as negative
False Positive (FP): Negative review predicted as positive
False Negative (FN): Positive review predicted as negative

2×2 grid showing confusion matrices for each model with counts and percentages

Why Confusion Matrices Are Useful

While metrics like accuracy or F1-score give a single number, confusion matrices show where errors are happening.

For example:

A model with many false positives tends to label negative reviews as positive
A model with many false negatives tends to miss positive reviews

This helps answer questions like:

Is the model overly optimistic?
Is it too conservative about predicting positive sentiment?

💡 Practical Insight

By looking at confusion matrices alongside performance metrics, we gain a more complete understanding of model behavior.

Two models might have similar accuracy, but one may produce more false positives while the other produces more false negatives — which could matter depending on the application.

Results

Performance Comparison

Model	Accuracy	Precision	Recall	F1-Score	Training Time
SVM	84.5%	85.2%	85.8%	84.9%	0.2s
Logistic Regression	84.4%	85.0%	85.5%	84.8%	0.1s
Naive Bayes	82.0%	81.0%	86.0%	83.4%	0.01s
Random Forest	81.7%	81.5%	83.0%	82.2%	9.0s

Comparison of accuracy, precision, recall, and F1-score across all methods

Precision-Recall trade-off showing how models balance these metrics

ROC curves demonstrating each model's ability to separate classes

What Do These Numbers Tell Us?

1. Best Overall Performance

SVM achieves the highest scores across almost all metrics. Logistic Regression is extremely close — so close that in many practical cases, the difference is negligible.

2. Training Time Matters

Fastest: Naive Bayes (~0.01s)
Slowest: Random Forest (~9s)
Logistic Regression is about 4× faster than SVM

If you care about fast training and frequent retraining, this matters.

3. So Which Model Should You Use?

It depends on your goal:

Want the absolute best performance? → SVM
Want almost the same performance but much faster? → Logistic Regression
Want a quick baseline? → Naive Bayes

🎯 Recommendation

Given how close SVM and Logistic Regression are, Logistic Regression is arguably the best practical choice here.

Interactive Model Comparison

Explore the performance metrics interactively! Select different metrics from the dropdown to compare all four models dynamically.

Open in Full Screen

Final Thoughts

This project shows that you don't always need complex deep learning models to get strong results. Classic machine learning algorithms, when paired with good preprocessing and TF-IDF features, can perform remarkably well for sentiment analysis.

A natural next step would be to test:

Word embeddings (Word2Vec, GloVe)
Deep learning models (LSTM, Transformers)

Then ask an important question:

Do the performance gains justify the extra complexity and computational cost?

📂 View Full Code on GitHub 💼 Connect on LinkedIn

Comparative Study of Machine Learning Algorithms for Sentiment Analysis