
Predicting Hospital Readmissions: A Machine Learning Journey
How I built three ML models to predict 30-day readmissions in diabetes patients using Logistic Regression, CART, and Random Forest
Peter Mangoro
The Problem: Why Hospital Readmissions Matter
Imagine you're a hospital administrator. Every day, you see patients being discharged, and you wonder: "Will this patient be back within 30 days?"
Hospital readmissions are a $15 billion problem in the United States. For patients with diabetes, the stakes are even higher—they're more likely to be readmitted, which means:
- Higher costs for healthcare systems
- Worse outcomes for patients
- Strain on resources that could be better allocated
But what if we could predict which patients are at high risk of readmission? That's exactly what I set out to do in this project.
The Data: 25,000 Patient Stories
I worked with a dataset of 24,996 patient encounters from 130 US hospitals over 10 years (1999-2008). Each row tells a story:
- How long did they stay in the hospital?
- How many medications were they on?
- What was their primary diagnosis?
- Had they been to the hospital before?
The dataset had a 47.02% readmission rate—nearly half of all patients came back within 30 days. This is a significant problem that needs solving.

My Approach: Three Models, One Goal
I decided to build three different machine learning models to see which approach worked best:
- Logistic Regression - The classic statistical approach, great for interpretability
- CART (Decision Trees) - Simple, visual, easy to explain
- Random Forest - An ensemble method that combines many trees
Each model has its strengths, and I wanted to see which one would give us the best predictions.
Model 1: Logistic Regression
Logistic Regression was my starting point. It's interpretable and provides odds ratios that clinicians can understand.
Key Insight: The model showed that n_inpatient (OR: 1.47) was the strongest predictor of readmission. Patients with more previous visits had significantly higher odds of being readmitted.
Performance:
- Accuracy: 61.84%
- AUC-ROC: 0.648
- Interpretability: ⭐⭐⭐⭐⭐ (Excellent - provides odds ratios and p-values)

Model 2: CART (Decision Trees)
Decision trees are like a flowchart—they ask yes/no questions to classify patients. I loved how visual and intuitive this approach was.
The tree showed that total_previous_visits (42.32% importance) was the most important factor, splitting patients into high-risk and low-risk groups.
Performance:
- Accuracy: 60.78%
- AUC-ROC: 0.605
- Interpretability: ⭐⭐⭐⭐ (Great - visual decision rules)

Model 3: Random Forest
Random Forest combines hundreds of decision trees, each trained on a different subset of the data. It's like asking a committee of experts instead of just one.
Performance:
- Accuracy: 61.46%
- AUC-ROC: 0.648
- Interpretability: ⭐⭐⭐ (Good - shows feature importance)

The Results: What We Learned
Performance Comparison
| Model | Accuracy | AUC | Interpretability |
|---|---|---|---|
| Logistic Regression | 61.84% | 0.648 | ⭐⭐⭐⭐⭐ |
| CART | 60.78% | 0.605 | ⭐⭐⭐⭐ |
| Random Forest | 61.46% | 0.648 | ⭐⭐⭐ |
Key Findings
- All three models performed similarly (~60-62% accuracy), suggesting the problem is inherently challenging with the available features.
- Previous hospital visits consistently emerged as the strongest predictor across all models. This makes intuitive sense—patients with complex medical histories are more likely to need readmission.
- Number of diagnoses was another important factor. Patients with multiple conditions are at higher risk.
- Medical specialty mattered too. Patients seen in Emergency/Trauma departments had higher readmission rates.
ROC Curves: Visualizing Model Performance

The ROC curves show how well each model distinguishes between patients who will and won't be readmitted. An AUC of 0.65 means the model is better than random guessing, but there's definitely room for improvement.
What I Learned: Challenges and Insights
The Challenge of ~60% Accuracy
At first, I was disappointed with ~60% accuracy. But then I realized:
- This is a hard problem. Even experienced clinicians struggle to predict readmissions.
- The dataset has limitations. We're missing important clinical variables like lab values, vital signs, and social determinants of health.
- 60% is a starting point. With better features and more data, we could improve significantly.
What Would Improve the Model?
If I had access to more data, I would add:
- Lab values: Blood glucose, HbA1c, creatinine, etc.
- Vital signs: Blood pressure, heart rate, temperature
- Social determinants: Insurance type, socioeconomic status, housing stability
- Medication adherence: Are patients taking their medications as prescribed?
- Follow-up care: Did patients attend follow-up appointments?
The Power of Interpretability
One of my biggest takeaways was the importance of interpretability in healthcare. Clinicians need to understand why a model makes a prediction, not just that it does. That's why Logistic Regression, despite similar performance, might be more useful in practice—it provides odds ratios and statistical significance that doctors can interpret.
Building an Interactive Dashboard
To make this project more practical, I created a Shiny web application where users can input patient information and get real-time readmission risk predictions from all three models.
The app allows healthcare providers to:
- Input patient demographics and medical history
- See predictions from all three models
- Compare model probabilities
- View feature importance and ROC curves
Try the Interactive Dashboard: https://petermangoro.shinyapps.io/hospitalreadmission/
Technical Stack
For those interested in the technical details:
- Language: R
- Libraries:
caret,rpart,randomForest,pROC,ggplot2,dplyr - Visualization:
ggplot2,rpart.plot - Interactive App: R Shiny
- Report Generation: R Markdown
Conclusion: What's Next?
This project taught me that predicting healthcare outcomes is complex, but machine learning can provide valuable insights. While 60% accuracy might not seem impressive, it's a solid foundation that could be improved with:
- Better feature engineering
- More clinical variables
- Advanced techniques like gradient boosting
- Ensemble methods combining all three models
Most importantly, I learned that in healthcare, interpretability matters just as much as accuracy. A model that doctors can understand and trust is often more valuable than a black box with slightly better performance.
Try It Yourself
- 🚀 Interactive Dashboard - Try the models yourself with real patient data
- 💻 GitHub Repository - Full code, analysis, and R Markdown reports
- 📊 View the Full Report - Detailed technical analysis and methodology
This project was completed as part of a Computational Mathematics course in collaboration with Masheia Dzimba. All code and analysis are available on GitHub.
From Code to Insights: My Journey from Software Development to Data Analytics
How my background in software development shaped my approach to data analytics, the projects that defined my transition, and the lessons learned along the way.
Building a Serverless ETL Pipeline on AWS: From Raw Data to Interactive Dashboards
How I built a complete data analytics pipeline using AWS serverless services—processing CSV files, cleaning data, enabling SQL analytics, and serving insights through interactive dashboards—all for less than $2/month.