Header Ads Widget

Responsive Advertisement

Ticker

6/recent/ticker-posts

Day-10 – Train/Test Split + Overfitting

Today we will learn one of the MOST important concepts in Machine Learning.

Without this:
❌ Models look good
❌ But fail in real world


🎯 Goal of Day-10

You will:

✅ Understand train/test split
✅ Learn overfitting vs underfitting
✅ Build more realistic ML workflow


🧠 Problem With Previous Days

Until now:

model.fit(X, y)
model.predict(X)

👉 We trained AND tested on SAME data ❌

That’s cheating.


🧠 Real-Life Example

Imagine:

  • Student sees exam questions before exam
  • Scores 100%

Does that mean student is smart? ❌

Same in ML.


🚀 Solution → Train/Test Split

Split data:

  • Train set → model learns
  • Test set → model is evaluated

Common split:

80% Train
20% Test


🚀 Part 1 – Import Library

from sklearn.model_selection import train_test_split


🚀 Part 2 – Create Dataset

import pandas as pd

data = {
"Hours": [1,2,3,4,5,6,7,8],
"Pass": [0,0,0,0,1,1,1,1]
}

df = pd.DataFrame(data)

X = df[["Hours"]]
y = df["Pass"]


🚀 Part 3 – Split Data

X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)


🧠 Meaning

Parameter   Meaning
     test_size=0.2                    20% test data
     random_state=42                    Same split every run


🚀 Part 4 – Train Model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)


🚀 Part 5 – Test Model

y_pred = model.predict(X_test)

print(y_pred)


🚀 Part 6 – Accuracy

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


🧠 Overfitting Explained

❌ Overfitting

Model memorizes training data.

Training accuracy:

100%

But test accuracy:

50%

👉 Bad real-world performance


🧠 Underfitting

Model too simple.

  • Bad training accuracy
  • Bad testing accuracy

🧠 Ideal Model

Good:

  • Training accuracy
  • Testing accuracy

AND close to each other.


📈 Visualization Concept

Think:

This is usually healthier than:

  • 100% train
  • 50% test


🚀 Part 7 – Compare Train vs Test Accuracy

train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)

print("Train Accuracy:", train_acc)
print("Test Accuracy:", test_acc)


⚠ Important Interview Question

Q:

Why do we split train and test data?

Answer:

To evaluate model performance on unseen data.


🧠 Real AI Insight

In real projects:

  • Data leakage = huge issue
  • Overfitting = common problem
  • Evaluation matters more than training

🎯 End of Day-10 Goals

You now:

✅ Understand train/test split
✅ Understand overfitting
✅ Test model properly


Github Link: https://github.com/dotnetfullstackdeveloper/ai-engineer-journey/blob/main/Week-02-Machine-Learning/Day-10

Post a Comment

0 Comments