Predicting Patient Readmission

A machine learning approach to predict hospital readmissions within 30 days

Dataset Overview

10 years (1999-2008) of clinical care data from 130 US hospitals

101,766
Total Samples
151
Features
10
Years of Data

Key Features

Patient Information
  • Demographics (Age, Gender, Race)
  • Time in Hospital: 1-14 days
  • Number of Lab Procedures: 1-132
  • Previous Visits (Outpatient, Emergency, Inpatient)
Medical Information
  • 23 Different Medications Tracked
  • Primary, Secondary & Additional Diagnoses
  • A1C & Glucose Serum Test Results
  • Medication Changes During Stay

Methodology & Analysis

Data Preprocessing

Handled missing values, encoded categorical variables, and standardized numerical features across 101,766 patient records.

Feature Engineering

Created 151 features from raw data, including patient demographics, medical history, and hospital outcomes.

Model Development

Implemented and compared multiple machine learning models, addressing class imbalance through sampling techniques.

Data Preprocessing Steps

Missing Value Treatment
  • Race: 2.2% missing values - filled with 'Other'
  • Diagnosis 1: 0.02% missing - most frequent imputation
  • Diagnosis 2: 0.35% missing - most frequent imputation
  • Diagnosis 3: 1.4% missing - most frequent imputation
Feature Engineering
  • Age: Categorized into 10 groups
  • Diagnosis: Mapped to 9 distinct categories
  • Categorical Encoding: One-hot encoding
  • Final Features: 151 after preprocessing

Technical Challenges & Solutions

Data Imbalance

Addressed 11% positive class imbalance using SMOTE and Random Under-sampling techniques.

Missing Data

Implemented sophisticated imputation strategies for handling missing values in key features.

Feature Selection

Carefully selected and engineered features to maximize model performance.

Model Performance

Logistic Regression
0.671
AUC Score
0.177
Precision
0.555
Recall
0.671
Accuracy
Hyperparameters:
  • Penalty: L2
  • C: 0.1
  • Solver: liblinear
Random Forest
0.665
AUC Score
0.175
Precision
0.550
Recall
0.665
Accuracy
Hyperparameters:
  • n_estimators: 700
  • max_depth: 10
  • criterion: entropy

Class Imbalance Handling

SMOTE Oversampling

Synthetic Minority Over-sampling Technique to balance classes

  • Original positive samples: 5,744
  • After SMOTE: 45,138 samples per class
Random Undersampling

Reduced majority class to match minority

  • Original negative samples: 45,138
  • After undersampling: 5,744 samples per class

Impact & Key Findings

Healthcare Impact

This model can help hospitals identify high-risk patients and implement preventive measures to reduce readmission rates.

  • Model Performance: Achieved 66.5% AUC score, significantly outperforming baseline
  • Key Factors: Identified age, medication changes, and diagnosis type as crucial predictors
  • Practical Impact: Model catches 55% of readmissions, enabling proactive intervention
  • Scale: Analysis of 100,000+ patient records across 130 US hospitals