Problem
Build a binary sentiment classifier that predicts whether a movie review is positive or negative using IMDB text data.
The goal is to compare different feature representations (one-hot encoding vs learned embeddings) and evaluate their impact on model performance and generalization.
Data
The dataset consists of 50,000 IMDB movie reviews split into 25,000 training and 25,000 test samples.
Each review is tokenized into integer sequences where each integer represents a word in a vocabulary of ~88,000 tokens.
Preprocessing
Several preprocessing steps were applied to standardize inputs:
- Decoded integer token sequences into human-readable text using a reverse word index
- Computed review length statistics (min, max, mean) for positive vs negative reviews
- Explored token frequency differences between sentiment classes
- Truncated reviews to fixed length (20 or 300 tokens depending on experiment)
- Applied padding (value = 0) to ensure uniform sequence length
- Reduced vocabulary size (e.g., top 1000 tokens, others mapped to OOV token)
Methods
Three main modeling approaches were evaluated:
- Logistic Regression with One-Hot Encoding (Flattening): treats each token-position pair as an independent feature
- Logistic Regression with One-Hot Averaging: averages token vectors across sequence positions
- Neural Embedding Model: learns dense word representations and aggregates them using global average pooling
One-Hot Logistic Regression
Flattening produced a high-dimensional sparse feature space (~20,000 features), while averaging reduced dimensionality to 1,000 features.
Results showed:
- Flattening model achieved higher training accuracy (~81%) but slightly lower validation accuracy (~68%)
- Averaging model had lower training accuracy (~70%) but similar or slightly better validation accuracy (~68.7%)
- Flattening showed mild overfitting due to large feature space
Embedding-Based Model
Word embeddings were learned using a trainable embedding layer followed by global average pooling.
This reduces sparsity and captures semantic relationships between words.
Embedding size was varied from 2 to 64 dimensions.
- Validation accuracy improved steadily with embedding size
- Performance plateaued around 32–64 dimensions (~73% validation accuracy)
- Larger embeddings increased parameters but gave diminishing returns
Results
| Model |
Train Accuracy |
Validation Accuracy |
Params |
| One-Hot Flatten (LR-C) |
~80.9% |
~68.4% |
~60K |
| One-Hot Average (LR-A) |
~69.8% |
~68.7% |
~3K |
| Embedding Model (best) |
~75–76% |
~73% |
~96K (32-dim) |
Key Findings
- Flattened one-hot models overfit due to high dimensionality and sparse representation
- Averaging reduces variance and improves generalization slightly
- Learned embeddings significantly improve performance by capturing semantic relationships between words
- Embedding size improves performance up to a point, after which gains plateau
Embedding Interpretation
Learned embeddings revealed semantic structure where sentiment-related words cluster together.
However, many words remain neutral, and rare or numeric tokens behave unpredictably due to limited training signal.
The two learned embedding dimensions loosely reflect:
- Sentiment polarity (positive vs negative orientation)
- Strength or intensity of sentiment expression
Impact
This project demonstrates the transition from sparse representations (one-hot encoding) to dense learned embeddings,
highlighting how representation learning improves generalization in NLP tasks. It also shows how simple models
(logistic regression) can still perform competitively when paired with strong feature engineering.