Resilient Clinical Outcome Prediction from Noisy Medical Records

Distributed PySpark Pipeline with Adversarial Noise Simulation

CS4074 – Big Data Analytics | Spring 2026

Project Overview

This project builds a scalable distributed PySpark pipeline to predict 30-day hospital readmission for diabetic patients using the UCI Diabetes 130-US Hospitals dataset containing 101,766 patient encounters and 50 features. The system addresses noisy and incomplete healthcare records by simulating adversarial corruption such as NULL injection, sentinel amplification, label flipping, and outlier injection. Spark MLlib is then used for distributed preprocessing, feature engineering, model training, and evaluation.

Pipeline Architecture

Raw Data → Noise Injection → Data Cleaning → Feature Engineering → MLlib Preprocessing → Model Training → Evaluation → Scalability Analysis

Models Used

Logistic Regression
Random Forest
Decision Tree

Key Findings

Logistic Regression achieved the best balance of precision and recall with F1 Score = 0.710
Random Forest achieved the most stable discriminative performance with AUC-ROC = 0.589
Decision Tree performed weakly on high-dimensional clinical data with AUC = 0.498
The 7-step cleaning pipeline successfully recovered 89.97% of corrupted records
A 300% increase in dataset volume caused only a 29% increase in training time
This confirms strong sub-linear scalability and efficient Spark parallelization

Interactive Visualizations

Team Members

Project Links

GitHub Repository Live Demo Technical Report (PDF)