Missael Vasquez | IMDB Sentiment Classification

Problem

Build a binary sentiment classifier that predicts whether a movie review is positive or negative using IMDB text data. The goal is to compare different feature representations (one-hot encoding vs learned embeddings) and evaluate their impact on model performance and generalization.

Data

The dataset consists of 50,000 IMDB movie reviews split into 25,000 training and 25,000 test samples. Each review is tokenized into integer sequences where each integer represents a word in a vocabulary of ~88,000 tokens.

Preprocessing

Several preprocessing steps were applied to standardize inputs:

Decoded integer token sequences into human-readable text using a reverse word index
Computed review length statistics (min, max, mean) for positive vs negative reviews
Explored token frequency differences between sentiment classes
Truncated reviews to fixed length (20 or 300 tokens depending on experiment)
Applied padding (value = 0) to ensure uniform sequence length
Reduced vocabulary size (e.g., top 1000 tokens, others mapped to OOV token)

Methods

Three main modeling approaches were evaluated:

Logistic Regression with One-Hot Encoding (Flattening): treats each token-position pair as an independent feature
Logistic Regression with One-Hot Averaging: averages token vectors across sequence positions
Neural Embedding Model: learns dense word representations and aggregates them using global average pooling

One-Hot Logistic Regression

Flattening produced a high-dimensional sparse feature space (~20,000 features), while averaging reduced dimensionality to 1,000 features.

Results showed:

Flattening model achieved higher training accuracy (~81%) but slightly lower validation accuracy (~68%)
Averaging model had lower training accuracy (~70%) but similar or slightly better validation accuracy (~68.7%)
Flattening showed mild overfitting due to large feature space

Embedding-Based Model

Word embeddings were learned using a trainable embedding layer followed by global average pooling. This reduces sparsity and captures semantic relationships between words.

Embedding size was varied from 2 to 64 dimensions.

Validation accuracy improved steadily with embedding size
Performance plateaued around 32–64 dimensions (~73% validation accuracy)
Larger embeddings increased parameters but gave diminishing returns

Results

Model	Train Accuracy	Validation Accuracy	Params
One-Hot Flatten (LR-C)	~80.9%	~68.4%	~60K
One-Hot Average (LR-A)	~69.8%	~68.7%	~3K
Embedding Model (best)	~75–76%	~73%	~96K (32-dim)

Key Findings

Flattened one-hot models overfit due to high dimensionality and sparse representation
Averaging reduces variance and improves generalization slightly
Learned embeddings significantly improve performance by capturing semantic relationships between words
Embedding size improves performance up to a point, after which gains plateau

Embedding Interpretation

Learned embeddings revealed semantic structure where sentiment-related words cluster together. However, many words remain neutral, and rare or numeric tokens behave unpredictably due to limited training signal.

The two learned embedding dimensions loosely reflect:

Sentiment polarity (positive vs negative orientation)
Strength or intensity of sentiment expression

Impact

This project demonstrates the transition from sparse representations (one-hot encoding) to dense learned embeddings, highlighting how representation learning improves generalization in NLP tasks. It also shows how simple models (logistic regression) can still perform competitively when paired with strong feature engineering.