Skip to main content

Command Palette

Search for a command to run...

When a Hypothesis Fights Back: My DataraFlow Research Log

Published
3 min read
When a Hypothesis Fights Back: My DataraFlow Research Log

What do you do when your model refuses to confirm your hypothesis? That moment Week became real research for me. Not tutorial data science. Not guided exercises. But a situation where the code runs perfectly… and still tells you something you didn’t expect.

This week I moved into predictive modeling for forecasting 30-day hospital readmissions for diabetes patients. The goal was to build a transparent, reproducible workflow that could survive scrutiny. And that meant every preprocessing decision, every model comparison and every metric had to be defendable.

And honestly? That’s where the hard work started.

Cleaning Before Intelligence

The dataset was messy in the way real healthcare data always is. Administrative IDs, sparse columns, inconsistent missing value markers and diagnosis codes that looked numeric but behaved like categories.

The first major decision was feature pruning. Dropping irrelevant identifiers reduced dimensionality and forced me to ask what information actually helps prediction? Removing noise is modeling philosophy.

Then came missing value handling. The dataset used ‘?‘ instead of true nulls, which meant pandas didn’t even recognize them at first. That small detail caused silent errors until I explicitly converted them. This beacame one of the hardest lessons of the week that preprocessing bugs can quietly corrupt your pipeline.

Imputation strategies had to be intentional:

  • Mode for categorical race data

  • Constant replacement for diagnosis codes

  • Explicit ‘missing’ categories for lab values

Each decision shaped how the model interpreted the world.

# Drop irrelevant column
Diabetic_data = Diabetic_data.drop(columns=['encounter_id', 'patient_nbr', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'payer_code', 'medical_specialty'])
# Converts every '?' in the dataset into a true missing value that pandas recognizes
Diabetic_data.replace('?', pd.NA, inplace=True)

Hypothesis vs Reality

My first hypothesis was bold:

Random Forest would outperform Logistic Regression.

On paper, it made sense. Random Forests capture nonlinear patterns. Logistic Regression is simple, but data science only rewards evidence.

After encoding, splitting and training, the ROC curves told a different story. Logistic Regression achieved higher AUC and F1 scores. Not by a lot, but enough to overturn the hypothesis.

That moment was uncomfortable and exciting at the same time. It forced me to confront a core truth of research:

You don’t prove yourself right. You test whether you’re wrong.

The hardest part of this section was validating the evaluation pipeline. Ensuring probabilities were used correctly. Avoiding leakage and confirming label encoding didn’t bias results. Every metric had to mean what I thought it meant.

# Train Logistic Regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, Y_train)

lr_pred = lr.predict(X_test)
lr_prob = lr.predict_proba(X_test)[:, 1]
# Train Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300, random_state=0)
rf.fit(X_train, Y_train)

rf_pred = rf.predict(X_test)
rf_prob = rf.predict_proba(X_test)[:, 1]

Debugging the Invisible

Some of the toughest codes weren’t flashy. It was the small pipeline glue that holds everything together:

  • Reshaping arrays after imputation

  • Encoding high-cardinality categorical features

  • Preventing dimension mismatches

  • Handling pandas ↔ numpy conversions

  • Maintaining alignment between X and Y

These errors rarely throw dramatic exceptions. They just produce subtly wrong outputs and spotting those requires discipline, not luck.

Good data science is 70% verification.


Closing Thoughts

This week felt less like coursework and more like stepping into the mindset of a researcher.

Real data resisted assumptions. The code demanded patience, metrics forced honesty and the final result wasn’t a victory lap — it was a conversation with evidence.

That’s what made it powerful.

Machine learning isn’t about proving you’re clever. It’s about building systems that tell the truth, even when it surprises you. And this week, the truth was more valuable than being right.

More from this blog

The First Week of the Rest of My Career

12 posts