Comparative Study of Machine Learning Algorithms for Sentiment Analysis

Logistic Regression vs Naive Bayes vs Random Forest vs SVM

πŸ‘©β€πŸ’» Leila Soltani β€’ πŸ“Š IMDB Dataset (50K Reviews) β€’ 🎯 84.5% Best Accuracy
IMDB Sentiment Analysis Banner

Introduction

Background

Sentiment analysis is one of the most common real-world applications of Natural Language Processing (NLP). From product reviews to social media posts, understanding how people feel about something can drive better decisions.

In this project, I built and compared four popular machine learning models β€” Logistic Regression, Naive Bayes, Random Forest, and Support Vector Machine (SVM) β€” to classify movie reviews from the IMDB dataset as positive or negative.

The Dataset

In this project I used the IMDB movie reviews dataset from Kaggle. It contains 50,000 labeled reviews, evenly split between positive and negative sentiment.

Exploring the Data with Word Clouds

Before training any models, I wanted to get a quick, intuitive sense of what the reviews look like and which words appear most often in positive and negative reviews.

A simple way to do this is with word clouds.

Word clouds visualize the most frequent words in a corpus, where larger words appear more often in the text. While word clouds are not a quantitative analysis tool, they are useful for building intuition about the dataset.

I generated two word clouds:

Word Clouds for Positive and Negative Reviews

Word clouds showing distinctive words in positive (left) and negative (right) reviews

What We Can Observe

From the positive reviews word cloud, we often see words such as:

great, excellent, love, amazing, best, wonderful

From the negative reviews word cloud, common words include:

bad, worst, boring, waste, awful, poor

βœ“ Key Insight

These patterns match our expectations and provide a quick sanity check that the dataset is labeled correctly and contains meaningful sentiment-bearing words.

Word clouds don't replace proper modeling, but they offer a helpful first glance at the data before moving into feature extraction and training.

How the Text Was Prepared

Before training any model, the text needs to be converted into numbers.

Here's what I did:

  1. Removed HTML tags, punctuation, numbers, and extra spaces
  2. Tokenized each review into words
  3. Converted reviews into numerical vectors using TF-IDF (Term Frequency–Inverse Document Frequency)

πŸ“– What is TF-IDF?

TF-IDF helps capture not only how often a word appears in a review, but also how important that word is across the entire dataset. Common words like "the" and "is" get lower scores, while distinctive words like "brilliant" or "terrible" get higher scores.

Machine Learning Models Tested

πŸ”΅ Logistic Regression

A linear classifier that models the probability of a binary outcome using a logistic function.

πŸ”΄ Naive Bayes

A probabilistic classifier based on Bayes' theorem with an assumption of conditional independence among features.

🟒 Random Forest

An ensemble learning method that combines multiple decision trees to improve robustness and accuracy.

🟠 SVM

Support Vector Machine attempts to find an optimal hyperplane that separates data points with maximum margin.

Evaluation Metrics

To comprehensively evaluate model performance, multiple metrics are used. Relying on a single metric such as accuracy can be misleading, especially when different types of classification errors have different practical implications.

Therefore, this project reports Accuracy, Precision (Positive class), Recall (Positive class), F1-score (Positive class), and ROC-AUC.

Accuracy

Definition: Accuracy measures the proportion of correctly classified samples out of all samples.

Meaning: Accuracy indicates how often the model makes the correct prediction overall.

Why it's not enough

Accuracy looks good at first glance, but it doesn't tell us what kind of mistakes the model is making. A model can have high accuracy but still perform poorly in one class.

In balanced datasets like IMDB, accuracy is useful β€” but we still need more context.

Precision (Positive Class)

Definition: Precision measures the proportion of predicted positive reviews that are actually positive.

Meaning: When the model predicts a review as positive, how often is it correct?

Why it matters: High precision means the model doesn't wrongly label too many negative reviews as positive. If you were analyzing customer feedback, high precision means you're not mistakenly treating unhappy customers as satisfied ones.

Recall (Positive Class)

Definition: Recall measures the proportion of actual positive reviews that are correctly identified.

Meaning: Out of all truly positive reviews, how many did the model capture?

Why it matters: High recall means we're not missing too many positive reviews. If your goal is to find all satisfied users, recall becomes very important.

F1-Score (Positive Class)

Definition: The F1-score is the harmonic mean of precision and recall.

Meaning: F1-score provides a single metric that balances precision and recall.

Why it matters: Sometimes precision is high but recall is low (or vice versa). F1-score gives a combined view and helps evaluate the trade-off between the two. If you want a balanced model, F1-score is usually the metric to look at.

ROC-AUC

Definition: ROC-AUC measures the model's ability to distinguish between positive and negative classes across all classification thresholds.

Meaning: ROC-AUC represents the probability that the model ranks a randomly chosen positive review higher than a randomly chosen negative review.

πŸ“Š Interpretation

  • AUC = 0.5 β†’ No discriminative ability (random guessing)
  • AUC = 1.0 β†’ Perfect discrimination

Higher AUC values indicate better overall separability.

Why it matters: Unlike accuracy, ROC-AUC doesn't depend on one fixed threshold (like 0.5). It evaluates the model's overall ability to distinguish between classes. This makes it especially useful when comparing models fairly.

Why Use All of These?

Each metric highlights a different aspect of performance:

Looking at all of them together gives a much clearer picture than any single metric alone.

Understanding Confusion Matrices

In addition to numerical metrics, I also used confusion matrices to better understand how each model makes mistakes.

A confusion matrix breaks predictions into four categories:

Confusion Matrices for All 4 Models

2Γ—2 grid showing confusion matrices for each model with counts and percentages

Why Confusion Matrices Are Useful

While metrics like accuracy or F1-score give a single number, confusion matrices show where errors are happening.

For example:

This helps answer questions like:

πŸ’‘ Practical Insight

By looking at confusion matrices alongside performance metrics, we gain a more complete understanding of model behavior.

Two models might have similar accuracy, but one may produce more false positives while the other produces more false negatives β€” which could matter depending on the application.

Results

Performance Comparison

Model Accuracy Precision Recall F1-Score Training Time
SVM 84.5% 85.2% 85.8% 84.9% 0.2s
Logistic Regression 84.4% 85.0% 85.5% 84.8% 0.1s
Naive Bayes 82.0% 81.0% 86.0% 83.4% 0.01s
Random Forest 81.7% 81.5% 83.0% 82.2% 9.0s
All Metrics Comparison

Comparison of accuracy, precision, recall, and F1-score across all methods

Precision-Recall Trade-off

Precision-Recall trade-off showing how models balance these metrics

ROC Curves

ROC curves demonstrating each model's ability to separate classes

What Do These Numbers Tell Us?

1. Best Overall Performance

SVM achieves the highest scores across almost all metrics. Logistic Regression is extremely close β€” so close that in many practical cases, the difference is negligible.

2. Training Time Matters

If you care about fast training and frequent retraining, this matters.

3. So Which Model Should You Use?

It depends on your goal:

🎯 Recommendation

Given how close SVM and Logistic Regression are, Logistic Regression is arguably the best practical choice here.

Interactive Model Comparison

Explore the performance metrics interactively! Select different metrics from the dropdown to compare all four models dynamically.

Open in Full Screen

Final Thoughts

This project shows that you don't always need complex deep learning models to get strong results. Classic machine learning algorithms, when paired with good preprocessing and TF-IDF features, can perform remarkably well for sentiment analysis.

A natural next step would be to test:

Then ask an important question:

Do the performance gains justify the extra complexity and computational cost?