May 18, 2026 - 10 MIN READ

Decoding Sentiment: Analysis of 4 Million Amazon Reviews

Group capstone: binary sentiment on the Amazon Reviews corpus—EDA, TF–IDF baselines, clustering, MLP and DistilBERT, and a locked test evaluation with leakage controls.

Peter Mangoro

Given the text of an Amazon product review, can you reliably predict whether the writer felt negative or positive? At capstone scale that question stops being a toy exercise: the fastText polarity release bundles 3.6 million training reviews and 400,000 held-out test rows—4 million labeled examples in total, split 50/50 between classes. The hard part is not only fitting models, but keeping evaluation honest when every preprocessing choice can silently peek at data you promised to hold out.

Team context

This capstone was completed as a group. Team members were:

Innocent Mujokoro
Tapiwanashe Mutarimanja
Satya Sai Priya Devireddy
Masheia Dzimba
Peter Mangoro

What we set out to do

Build an end-to-end NLP analytics story that:

Profiles the fastText corpus (balance, length tails, malformed rows) before any modeling.
Benchmarks classical predictors (TF–IDF + logistic regression, linear SVM, Random Forest) at scale.
Explores unsupervised structure (TF–IDF → SVD and sentence-embedding clusters).
Compares neural baselines (bag-of-words MLP, fine-tuned DistilBERT) on matched splits.
Finishes with leakage checks, 5-fold CV, and a single locked evaluation on test.ft.txt.

Data landscape

After parsing __label__1 → negative (0) and __label__2 → positive (1), both train and test sets are perfectly balanced. Review lengths are right-skewed: median about 70 words, with a long tail that informed truncation for transformers (we used max_length=96 for DistilBERT with sensitivity checks). For bag-of-words models we used 300,000 TF–IDF features with 1–2 grams, English stop words, and sublinear term weighting—enough capacity to capture sentiment cues like great, terrible, and negation patterns that raw counts miss.

Decoding Sentiment — capstone infographic

Model results

On a 500k training-file cap with an 80/20 validation split, TF–IDF + SGD (hinge) and logistic regression tie near the top—about 90.5% accuracy and F1 ≈ 0.907, with ROC-AUC ≈ 0.966. Random Forest on the same sparse features lags (F1 ≈ 0.834): trees struggle to beat a regularized linear margin in this high-dimensional text setting.

On a matched 120k subset (fair comparison with neural models), the ranking holds:

Model	Validation F1 (approx.)
TF–IDF + SGD (hinge)	0.898
MLP (bag-of-words)	0.888
DistilBERT (fine-tuned)	0.874

Most of the gap between the 120k neural runs and the 500k linear leader is data scale, not architecture alone. 5-fold CV on the 500k cap reported mean F1 0.907 (std 0.001), aligned with the single holdout.

Locked test (one refit on the training-file cap, then predict on all 400k held-out rows—labels never used for tuning): F1 0.906, ROC-AUC 0.966, within ~0.001 F1 of development and CV. That stability supports deploying TF–IDF + SGD for polarity under cost, speed, and interpretability constraints.

Unsupervised discovery

Beyond supervised scores, we clustered review representations to see whether sentiment-aligned language also forms topic neighborhoods. Sentence embeddings surfaced roughly 5–6 coherent groups—themes such as product quality, customer service, pricing, durability, and value—useful for scoping monitoring and explaining model behavior to non-ML stakeholders.

Artifacts

Live demo: Hugging Face Space — Amazon Review Sentiment
Code and notebooks: GitHub — dataScience/final
Technical report (PDF): Amazon review sentiment — group notebook export (PDF)

What I learned

At millions of rows, evaluation discipline matters as much as model choice: fit vectorizers on train only, fingerprint-check for train/test text overlap, and touch test.ft.txt once. Linear models on strong sparse features still win many text-classification leaderboards when compute and latency count—transformers add contextual power, but on our matched subsets they did not justify the training cost for deployment.

If you are reproducing similar work, lock your metric protocol early (we used F1 with ROC-AUC tie-break), document truncation from length quantiles, and treat ablations (e.g. log1p(word_len)) as exploratory—not as silent changes to the production pipeline.

Analyzing a Healthcare Knowledge Graph with Cypher and Graph Data Science

How I explored a FAERS-style healthcare graph, moved from Cypher EDA to GDS workflows, and turned graph results into practical analytical insights.