[{"data":1,"prerenderedAt":508},["ShallowReactive",2],{"navigation":3,"\u002Fblog\u002Famazon-review-sentiment-capstone":71,"\u002Fblog\u002Famazon-review-sentiment-capstone-surround":504},[4,58],{"title":5,"path":6,"stem":7,"children":8,"page":57},"Blog","\u002Fblog","blog",[9,13,17,21,25,29,33,37,41,45,49,53],{"title":10,"path":11,"stem":12},"Decoding Sentiment: Analysis of 4 Million Amazon Reviews","\u002Fblog\u002Famazon-review-sentiment-capstone","blog\u002Famazon-review-sentiment-capstone",{"title":14,"path":15,"stem":16},"Analyzing a Healthcare Knowledge Graph with Cypher and Graph Data Science","\u002Fblog\u002Fanalyzing-a-healthcare-knowledge-graph-with-cypher-and-gds","blog\u002Fanalyzing-a-healthcare-knowledge-graph-with-cypher-and-gds",{"title":18,"path":19,"stem":20},"Navigating the Web: Prioritizing Supply Chain Risk with Neo4j","\u002Fblog\u002Fautomotive-supply-chain-neo4j-capstone","blog\u002Fautomotive-supply-chain-neo4j-capstone",{"title":22,"path":23,"stem":24},"My First AWS Adventure: Building a Sentiment Analysis System on a Budget","\u002Fblog\u002Faws-sentiment-analysis-journey","blog\u002Faws-sentiment-analysis-journey",{"title":26,"path":27,"stem":28},"Building a Hybrid Movie Recommender with Neo4j and Graph Data Science","\u002Fblog\u002Fbuilding-a-hybrid-movie-recommender-with-neo4j-and-gds","blog\u002Fbuilding-a-hybrid-movie-recommender-with-neo4j-and-gds",{"title":30,"path":31,"stem":32},"Designing and Building a Neo4j Knowledge Graph from Relational Data","\u002Fblog\u002Fdesigning-and-building-a-neo4j-knowledge-graph-from-relational-data","blog\u002Fdesigning-and-building-a-neo4j-knowledge-graph-from-relational-data",{"title":34,"path":35,"stem":36},"Developing a GraphRAG Research Chatbot with Neo4j","\u002Fblog\u002Fdeveloping-a-graphrag-research-chatbot-with-neo4j","blog\u002Fdeveloping-a-graphrag-research-chatbot-with-neo4j",{"title":38,"path":39,"stem":40},"From Code to Insights: My Journey from Software Development to Data Analytics","\u002Fblog\u002Ffrom-code-to-insights-journey","blog\u002Ffrom-code-to-insights-journey",{"title":42,"path":43,"stem":44},"Predicting Hospital Readmissions: A Machine Learning Journey","\u002Fblog\u002Fhospital-readmissions","blog\u002Fhospital-readmissions",{"title":46,"path":47,"stem":48},"My First Steps into Graph Databases: Learning Neo4j Fundamentals","\u002Fblog\u002Fneo4j-graph-databases-fundamentals","blog\u002Fneo4j-graph-databases-fundamentals",{"title":50,"path":51,"stem":52},"Building a Serverless ETL Pipeline on AWS: From Raw Data to Interactive Dashboards","\u002Fblog\u002Fserverless-etl-pipeline-aws","blog\u002Fserverless-etl-pipeline-aws",{"title":54,"path":55,"stem":56},"From Traffic Violations to Safety Culture: My Data Analytics Framework","\u002Fblog\u002Ftraffic-violation-analytics-framework","blog\u002Ftraffic-violation-analytics-framework",false,{"title":59,"path":60,"stem":61,"children":62,"page":57},"Publications","\u002Fpublications","publications",[63,67],{"title":64,"path":65,"stem":66},"PCT-led early warning vital sign escalation","\u002Fpublications\u002Fpct-led-early-warning-escalation","publications\u002Fpct-led-early-warning-escalation",{"title":68,"path":69,"stem":70},"Reducing Phlebotomy Redraws Through Pre-Analytic SOPs: Training, Fidelity, and Outcome Metrics from a Community Hospital","\u002Fpublications\u002Freducing-phlebotomy-redraws-pre-analytic-sops","publications\u002Freducing-phlebotomy-redraws-pre-analytic-sops",{"id":72,"title":10,"author":73,"body":77,"date":489,"description":490,"extension":491,"image":246,"meta":492,"minRead":500,"navigation":501,"path":11,"seo":502,"stem":12,"__hash__":503},"blog\u002Fblog\u002Famazon-review-sentiment-capstone.md",{"name":74,"avatar":75},"Peter Mangoro",{"src":76,"alt":74},"\u002Fprofile.jpg",{"type":78,"value":79,"toc":478},"minimark",[80,113,118,125,143,147,150,192,196,240,247,251,290,296,339,358,379,383,411,415,445,449,463],[81,82,83,84,88,89,92,93,96,97,100,101,104,105,108,109,112],"p",{},"Given the text of an Amazon product review, can you reliably predict whether the writer felt ",[85,86,87],"strong",{},"negative"," or ",[85,90,91],{},"positive","? At capstone scale that question stops being a toy exercise: the fastText polarity release bundles ",[85,94,95],{},"3.6 million"," training reviews and ",[85,98,99],{},"400,000"," held-out test rows—",[85,102,103],{},"4 million"," labeled examples in total, split ",[85,106,107],{},"50\u002F50"," between classes. The hard part is not only fitting models, but ",[85,110,111],{},"keeping evaluation honest"," when every preprocessing choice can silently peek at data you promised to hold out.",[114,115,117],"h2",{"id":116},"team-context","Team context",[81,119,120,121,124],{},"This capstone was completed ",[85,122,123],{},"as a group",". Team members were:",[126,127,128,132,135,138,141],"ul",{},[129,130,131],"li",{},"Innocent Mujokoro",[129,133,134],{},"Tapiwanashe Mutarimanja",[129,136,137],{},"Satya Sai Priya Devireddy",[129,139,140],{},"Masheia Dzimba",[129,142,74],{},[114,144,146],{"id":145},"what-we-set-out-to-do","What we set out to do",[81,148,149],{},"Build an end-to-end NLP analytics story that:",[126,151,152,155,162,169,176],{},[129,153,154],{},"Profiles the fastText corpus (balance, length tails, malformed rows) before any modeling.",[129,156,157,158,161],{},"Benchmarks ",[85,159,160],{},"classical"," predictors (TF–IDF + logistic regression, linear SVM, Random Forest) at scale.",[129,163,164,165,168],{},"Explores ",[85,166,167],{},"unsupervised structure"," (TF–IDF → SVD and sentence-embedding clusters).",[129,170,171,172,175],{},"Compares ",[85,173,174],{},"neural"," baselines (bag-of-words MLP, fine-tuned DistilBERT) on matched splits.",[129,177,178,179,182,183,186,187,191],{},"Finishes with ",[85,180,181],{},"leakage checks",", 5-fold CV, and a ",[85,184,185],{},"single locked evaluation"," on ",[188,189,190],"code",{},"test.ft.txt",".",[114,193,195],{"id":194},"data-landscape","Data landscape",[81,197,198,199,202,203,206,207,210,211,214,215,218,219,222,223,226,227,230,231,235,236,239],{},"After parsing ",[188,200,201],{},"__label__1"," → negative (0) and ",[188,204,205],{},"__label__2"," → positive (1), both train and test sets are ",[85,208,209],{},"perfectly balanced",". Review lengths are ",[85,212,213],{},"right-skewed",": median about ",[85,216,217],{},"70 words",", with a long tail that informed truncation for transformers (we used ",[188,220,221],{},"max_length=96"," for DistilBERT with sensitivity checks). For bag-of-words models we used ",[85,224,225],{},"300,000"," TF–IDF features with ",[85,228,229],{},"1–2 grams",", English stop words, and sublinear term weighting—enough capacity to capture sentiment cues like ",[232,233,234],"em",{},"great",", ",[232,237,238],{},"terrible",", and negation patterns that raw counts miss.",[81,241,242],{},[243,244],"img",{"alt":245,"src":246},"Decoding Sentiment — capstone infographic","\u002Fdatascience\u002Ffinal.png",[114,248,250],{"id":249},"model-results","Model results",[81,252,253,254,257,258,261,262,265,266,269,270,273,274,277,278,281,282,285,286,289],{},"On a ",[85,255,256],{},"500k"," training-file cap with an ",[85,259,260],{},"80\u002F20"," validation split, ",[85,263,264],{},"TF–IDF + SGD (hinge)"," and ",[85,267,268],{},"logistic regression"," tie near the top—about ",[85,271,272],{},"90.5%"," accuracy and ",[85,275,276],{},"F1 ≈ 0.907",", with ",[85,279,280],{},"ROC-AUC ≈ 0.966",". ",[85,283,284],{},"Random Forest"," on the same sparse features lags (",[85,287,288],{},"F1 ≈ 0.834","): trees struggle to beat a regularized linear margin in this high-dimensional text setting.",[81,291,253,292,295],{},[85,293,294],{},"matched 120k"," subset (fair comparison with neural models), the ranking holds:",[297,298,299,312],"table",{},[300,301,302],"thead",{},[303,304,305,309],"tr",{},[306,307,308],"th",{},"Model",[306,310,311],{},"Validation F1 (approx.)",[313,314,315,323,331],"tbody",{},[303,316,317,320],{},[318,319,264],"td",{},[318,321,322],{},"0.898",[303,324,325,328],{},[318,326,327],{},"MLP (bag-of-words)",[318,329,330],{},"0.888",[303,332,333,336],{},[318,334,335],{},"DistilBERT (fine-tuned)",[318,337,338],{},"0.874",[81,340,341,342,345,346,349,350,353,354,357],{},"Most of the gap between the 120k neural runs and the 500k linear leader is ",[85,343,344],{},"data scale",", not architecture alone. ",[85,347,348],{},"5-fold CV"," on the 500k cap reported mean F1 ",[85,351,352],{},"0.907"," (std ",[85,355,356],{},"0.001","), aligned with the single holdout.",[81,359,360,363,364,367,368,235,371,374,375,378],{},[85,361,362],{},"Locked test"," (one refit on the training-file cap, then predict on all ",[85,365,366],{},"400k"," held-out rows—labels never used for tuning): ",[85,369,370],{},"F1 0.906",[85,372,373],{},"ROC-AUC 0.966",", within ~0.001 F1 of development and CV. That stability supports deploying ",[85,376,377],{},"TF–IDF + SGD"," for polarity under cost, speed, and interpretability constraints.",[114,380,382],{"id":381},"unsupervised-discovery","Unsupervised discovery",[81,384,385,386,389,390,393,394,235,397,235,400,235,403,406,407,410],{},"Beyond supervised scores, we clustered review representations to see whether sentiment-aligned language also forms ",[85,387,388],{},"topic neighborhoods",". Sentence embeddings surfaced roughly ",[85,391,392],{},"5–6"," coherent groups—themes such as ",[85,395,396],{},"product quality",[85,398,399],{},"customer service",[85,401,402],{},"pricing",[85,404,405],{},"durability",", and ",[85,408,409],{},"value","—useful for scoping monitoring and explaining model behavior to non-ML stakeholders.",[114,412,414],{"id":413},"artifacts","Artifacts",[126,416,417,435],{},[129,418,419,422,423],{},[85,420,421],{},"Code and notebooks",": ",[424,425,431,432],"a",{"href":426,"target":427,"rel":428},"https:\u002F\u002Fgithub.com\u002FPeterMangoro\u002FdataScience\u002Ftree\u002Fmain\u002Ffinal","_blank",[429,430],"noopener","noreferrer","GitHub — ",[188,433,434],{},"dataScience\u002Ffinal",[129,436,437,422,440],{},[85,438,439],{},"Technical report (PDF)",[424,441,444],{"href":442,"target":427,"rel":443},"\u002Fdatascience\u002Famazon-review-sentiment-notebook.pdf",[429],"Amazon review sentiment — group notebook export (PDF)",[114,446,448],{"id":447},"what-i-learned","What I learned",[81,450,451,452,455,456,458,459,462],{},"At millions of rows, ",[85,453,454],{},"evaluation discipline"," matters as much as model choice: fit vectorizers on train only, fingerprint-check for train\u002Ftest text overlap, and touch ",[188,457,190],{}," once. ",[85,460,461],{},"Linear models on strong sparse features"," still win many text-classification leaderboards when compute and latency count—transformers add contextual power, but on our matched subsets they did not justify the training cost for deployment.",[81,464,465,466,469,470,473,474,477],{},"If you are reproducing similar work, lock your metric protocol early (we used ",[85,467,468],{},"F1"," with ",[85,471,472],{},"ROC-AUC"," tie-break), document truncation from length quantiles, and treat ablations (e.g. ",[188,475,476],{},"log1p(word_len)",") as exploratory—not as silent changes to the production pipeline.",{"title":479,"searchDepth":480,"depth":480,"links":481},"",2,[482,483,484,485,486,487,488],{"id":116,"depth":480,"text":117},{"id":145,"depth":480,"text":146},{"id":194,"depth":480,"text":195},{"id":249,"depth":480,"text":250},{"id":381,"depth":480,"text":382},{"id":413,"depth":480,"text":414},{"id":447,"depth":480,"text":448},"2026-05-18","Group capstone: binary sentiment on the Amazon Reviews corpus—EDA, TF–IDF baselines, clustering, MLP and DistilBERT, and a locked test evaluation with leakage controls.","md",{"tags":493},[494,495,496,497,498,499],"NLP","Sentiment Analysis","scikit-learn","DistilBERT","TF-IDF","Capstone",10,true,{"title":10,"description":490},"lvhyRmmswl19ELh6evcgbObDKX-fL1gGfAG8Rl8sjfM",[505,506],null,{"title":14,"path":15,"stem":16,"description":507,"children":-1},"How I explored a FAERS-style healthcare graph, moved from Cypher EDA to GDS workflows, and turned graph results into practical analytical insights.",1780158383442]