
Decoding Sentiment: Analysis of 4 Million Amazon Reviews
Group capstone: binary sentiment on the Amazon Reviews corpus—EDA, TF–IDF baselines, clustering, MLP and DistilBERT, and a locked test evaluation with leakage controls.
Peter Mangoro
Given the text of an Amazon product review, can you reliably predict whether the writer felt negative or positive? At capstone scale that question stops being a toy exercise: the fastText polarity release bundles 3.6 million training reviews and 400,000 held-out test rows—4 million labeled examples in total, split 50/50 between classes. The hard part is not only fitting models, but keeping evaluation honest when every preprocessing choice can silently peek at data you promised to hold out.
Team context
This capstone was completed as a group. Team members were:
- Innocent Mujokoro
- Tapiwanashe Mutarimanja
- Satya Sai Priya Devireddy
- Masheia Dzimba
- Peter Mangoro
What we set out to do
Build an end-to-end NLP analytics story that:
- Profiles the fastText corpus (balance, length tails, malformed rows) before any modeling.
- Benchmarks classical predictors (TF–IDF + logistic regression, linear SVM, Random Forest) at scale.
- Explores unsupervised structure (TF–IDF → SVD and sentence-embedding clusters).
- Compares neural baselines (bag-of-words MLP, fine-tuned DistilBERT) on matched splits.
- Finishes with leakage checks, 5-fold CV, and a single locked evaluation on
test.ft.txt.
Data landscape
After parsing __label__1 → negative (0) and __label__2 → positive (1), both train and test sets are perfectly balanced. Review lengths are right-skewed: median about 70 words, with a long tail that informed truncation for transformers (we used max_length=96 for DistilBERT with sensitivity checks). For bag-of-words models we used 300,000 TF–IDF features with 1–2 grams, English stop words, and sublinear term weighting—enough capacity to capture sentiment cues like great, terrible, and negation patterns that raw counts miss.

Model results
On a 500k training-file cap with an 80/20 validation split, TF–IDF + SGD (hinge) and logistic regression tie near the top—about 90.5% accuracy and F1 ≈ 0.907, with ROC-AUC ≈ 0.966. Random Forest on the same sparse features lags (F1 ≈ 0.834): trees struggle to beat a regularized linear margin in this high-dimensional text setting.
On a matched 120k subset (fair comparison with neural models), the ranking holds:
| Model | Validation F1 (approx.) |
|---|---|
| TF–IDF + SGD (hinge) | 0.898 |
| MLP (bag-of-words) | 0.888 |
| DistilBERT (fine-tuned) | 0.874 |
Most of the gap between the 120k neural runs and the 500k linear leader is data scale, not architecture alone. 5-fold CV on the 500k cap reported mean F1 0.907 (std 0.001), aligned with the single holdout.
Locked test (one refit on the training-file cap, then predict on all 400k held-out rows—labels never used for tuning): F1 0.906, ROC-AUC 0.966, within ~0.001 F1 of development and CV. That stability supports deploying TF–IDF + SGD for polarity under cost, speed, and interpretability constraints.
Unsupervised discovery
Beyond supervised scores, we clustered review representations to see whether sentiment-aligned language also forms topic neighborhoods. Sentence embeddings surfaced roughly 5–6 coherent groups—themes such as product quality, customer service, pricing, durability, and value—useful for scoping monitoring and explaining model behavior to non-ML stakeholders.
Artifacts
- Code and notebooks: GitHub —
dataScience/final - Technical report (PDF): Amazon review sentiment — group notebook export (PDF)
What I learned
At millions of rows, evaluation discipline matters as much as model choice: fit vectorizers on train only, fingerprint-check for train/test text overlap, and touch test.ft.txt once. Linear models on strong sparse features still win many text-classification leaderboards when compute and latency count—transformers add contextual power, but on our matched subsets they did not justify the training cost for deployment.
If you are reproducing similar work, lock your metric protocol early (we used F1 with ROC-AUC tie-break), document truncation from length quantiles, and treat ablations (e.g. log1p(word_len)) as exploratory—not as silent changes to the production pipeline.