IMDB Sentiment Classification

Problem

Build a binary sentiment classifier that predicts whether a movie review is positive or negative using IMDB text data. The goal is to compare different feature representations (one-hot encoding vs learned embeddings) and evaluate their impact on model performance and generalization.

Data

The dataset consists of 50,000 IMDB movie reviews split into 25,000 training and 25,000 test samples. Each review is tokenized into integer sequences where each integer represents a word in a vocabulary of ~88,000 tokens.

Preprocessing

Several preprocessing steps were applied to standardize inputs:

Methods

Three main modeling approaches were evaluated:

One-Hot Logistic Regression

Flattening produced a high-dimensional sparse feature space (~20,000 features), while averaging reduced dimensionality to 1,000 features.

Results showed:

Embedding-Based Model

Word embeddings were learned using a trainable embedding layer followed by global average pooling. This reduces sparsity and captures semantic relationships between words.

Embedding size was varied from 2 to 64 dimensions.

Results

Model Train Accuracy Validation Accuracy Params
One-Hot Flatten (LR-C) ~80.9% ~68.4% ~60K
One-Hot Average (LR-A) ~69.8% ~68.7% ~3K
Embedding Model (best) ~75–76% ~73% ~96K (32-dim)

Key Findings

Embedding Interpretation

Learned embeddings revealed semantic structure where sentiment-related words cluster together. However, many words remain neutral, and rare or numeric tokens behave unpredictably due to limited training signal.

The two learned embedding dimensions loosely reflect:

Impact

This project demonstrates the transition from sparse representations (one-hot encoding) to dense learned embeddings, highlighting how representation learning improves generalization in NLP tasks. It also shows how simple models (logistic regression) can still perform competitively when paired with strong feature engineering.