This tutorial builds a complete ML pipeline to predict real estate prices: data exploration, preprocessing, feature engineering, training multiple models, cross-validation and selecting the best model.
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
# โโ Load California Housing dataset โโ
housing = fetch_california_housing(as_frame=True)
df = housing.frame
print(df.shape) # (20640, 9)
print(df.info())
print(df.describe())
# Columns:
# MedInc โ median household income
# HouseAge โ median house age
# AveRooms โ average rooms
# AveBedrms โ average bedrooms
# Population โ population of block
# AveOccup โ average occupancy
# Latitude, Longitude
# MedHouseVal โ TARGET PRICE (in hundreds of thousands $)
# โโ Analyze target variable โโ
print(f"Mean price: ${df['MedHouseVal'].mean() * 100_000:,.0f}")
print(f"Median price: ${df['MedHouseVal'].median() * 100_000:,.0f}")
# โโ Detect outliers โโ
Q1 = df['MedHouseVal'].quantile(0.25)
Q3 = df['MedHouseVal'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['MedHouseVal'] < Q1 - 1.5*IQR) | (df['MedHouseVal'] > Q3 + 1.5*IQR)]
print(f"Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")
# โโ Correlations with target โโ
correlations = df.corr()['MedHouseVal'].sort_values(ascending=False)
print("\nCorrelations with price:")
print(correlations)
.info() shows types and missing values, .describe() gives descriptive statistics (mean, std, quartiles). Outlier detection uses IQR (Interquartile Range): outliers are values beyond 1.5 ร IQR above Q3 or below Q1. Pearson correlations identify features most strongly linked to target price.Definition: A Pipeline is a chain of transformations and a final model executed in order. Each step transforms data for the next, until the model makes predictions.
Purpose: Guarantee the same preprocessing is automatically applied at training and prediction, preventing data leakage.
Why here: Without Pipeline, it's easy to forget to normalize test data, causing 10-100x errors. Pipeline eliminates this common error by encapsulating preprocessing.
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
housing = fetch_california_housing(as_frame=True)
df = housing.frame
# โโ Feature Engineering โโ
df['rooms_per_household'] = df['AveRooms'] / df['AveOccup']
df['bedrooms_ratio'] = df['AveBedrms'] / df['AveRooms']
df['population_per_household'] = df['Population'] / df['AveOccup']
df['income_per_room'] = df['MedInc'] / df['AveRooms']
# Log-transform for skewed distributions
df['log_population'] = np.log1p(df['Population'])
# โโ Separate features/target โโ
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']
# โโ Split train/test (80/20) โโ
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)} | Test: {len(X_test)}")
# โโ Preprocessing pipeline โโ
numeric_features = X.columns.tolist()
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')), # Handle NaN
('scaler', StandardScaler()) # Standardize
])
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features)
])
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
import numpy as np
# โโ Define models to compare โโ
models = {
'Linear Regression': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
}
results = {}
for name, model in models.items():
# Pipeline = preprocessor + model
pipe = Pipeline([
('preprocessor', preprocessor),
('model', model)
])
# 5-fold cross-validation: split into 5 folds, train 5 times, evaluate on each fold
cv_scores = cross_val_score(
pipe, X_train, y_train,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1
)
rmse_cv = -cv_scores.mean()
# 5 scores computed on 5 different folds, averaged for robustness
# Train on full train set and evaluate on test set
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
results[name] = {
'CV RMSE': round(rmse_cv, 4),
'Test RMSE': round(rmse, 4),
'Test MAE': round(mae, 4),
'Rยฒ': round(r2, 4)
}
print(f"โ
{name} โ Rยฒ={r2:.3f}, RMSE={rmse:.4f}")
# โโ Results table โโ
import pandas as pd
results_df = pd.DataFrame(results).T.sort_values('Rยฒ', ascending=False)
print("\n" + results_df.to_string())
Definition: Technique that divides training data into k groups (folds), trains model k times using each fold as validation set and other k-1 folds as train set, then averages k scores. With k=5: 5 folds ร (4 folds training + 1 fold validation).
Purpose: Obtain more robust and reliable model evaluation on limited data, without wasting data on static test set.
Why here: A simple train/test split can be misleading if you get unlucky with random draw (an "easy" or "hard" validation fold). Cross-validation does k independent experiments and averages results โ much more stable and recommended in practice. It's the standard for evaluating models before using final test set.
Definition: Situation where a model learns noise and specific quirks of training data instead of general patterns. It achieves excellent training performance but poor test performance.
Purpose: Recognize and avoid overfitting by comparing train vs test performance and cross-validation.
Why here: Overfitting is the main ML pitfall. A model can look excellent on training data (Rยฒ=0.99) but be completely useless in production (Rยฒ=0.2). Here, we use cross-validation and independent test set to detect it quickly.
Analogy: Like memorizing exam answers without understanding the subject โ perfect on that exam but fail on surprise test.
Definition: A model parameter you must set BEFORE training (unlike weights/coefficients learned during training). Examples: number of trees in Random Forest (n_estimators), max tree depth (max_depth), learning rate (learning_rate) in Gradient Boosting, regularization strength (alpha) in Ridge.
Purpose: Control model complexity, speed and behavior.
Why here: Hyperparameters can vary performance by 10x or more. Optimizing them correctly is crucial for best model. Most of ML time is spent optimizing hyperparameters.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import permutation_importance
# โโ Search for hyperparameters โโ
param_grid = {
'model__n_estimators': [100, 200],
'model__max_depth': [3, 5, 7],
'model__learning_rate': [0.05, 0.1, 0.2],
'model__subsample': [0.8, 1.0]
}
best_pipe = Pipeline([
('preprocessor', preprocessor),
('model', GradientBoostingRegressor(random_state=42))
])
grid_search = GridSearchCV(
best_pipe, param_grid,
cv=5, scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {-grid_search.best_score_:.4f}")
# GridSearchCV tests all param combinations: 2 ร 3 ร 3 ร 2 = 36 models
# โโ Feature importance โโ
best_model = grid_search.best_estimator_
y_pred_final = best_model.predict(X_test)
perm_importance = permutation_importance(
best_model, X_test, y_test, n_repeats=10, random_state=42
)
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': perm_importance.importances_mean
}).sort_values('importance', ascending=False)
print("\nFeature importance:")
print(importance_df.head(8))
Definition: Scikit-learn utility that tests all possible combinations of hyperparameters in a defined grid, using cross-validation to evaluate each combination, and returns best parameters with best performance.
Purpose: Automatically optimize hyperparameters without guesswork or manual trial-and-error.
Why here: Standard and systematic way to select best hyperparameters. "Grid search" tests the Cartesian product of all params โ e.g. [100,200] ร [3,5,7] ร [0.05,0.1,0.2] ร [0.8,1.0] = 2ร3ร3ร2 = 36 combinations. Each evaluated with 5-fold cross-validation, so 36 ร 5 = 180 total trained models.
Definition: Measure of each feature's impact on model predictions. Multiple ways to calculate it: importance based on tree splits (information gain), or permutation_importance which measures performance drop if you randomly shuffle that feature.
Purpose: Understand which features contribute most to predictions and identify useless features.
Why here: Feature importance helps interpret black-box models (Gradient Boosting) and identify features you could remove without performance loss. Permutation importance is model-agnostic and more reliable than tree-based importance.
import joblib
import numpy as np
# โโ Save the complete pipeline โโ
joblib.dump(best_model, 'housing_model.pkl')
print("โ
Model saved")
# โโ Load and use โโ
loaded_model = joblib.load('housing_model.pkl')
# Predict for a new house
# [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Lat, Long, ...]
new_house = pd.DataFrame([{
'MedInc': 5.5, # Median income $55K
'HouseAge': 15, # 15 years
'AveRooms': 6.0,
'AveBedrms': 1.1,
'Population': 1500,
'AveOccup': 3.0,
'Latitude': 37.5,
'Longitude': -122.0,
'rooms_per_household': 2.0,
'bedrooms_ratio': 0.18,
'population_per_household': 500,
'income_per_room': 0.92,
'log_population': np.log1p(1500)
}])
predicted_price = loaded_model.predict(new_house)[0]
print(f"Predicted price: ${predicted_price * 100_000:,.0f}")
Definition: Technique that adds a penalty to model coefficients during training to keep it simple and avoid overfitting. Ridge (L2) adds sum of squared coefficients, Lasso (L1) adds sum of absolute values.
Purpose: Reduce overfitting by penalizing complex models with high weights.
Why here: Ridge and Lasso are basic regularized regression models. Ridge adds soft penalty (keep all coefficients but reduce magnitude), Lasso can force some coefficients exactly to zero (automatic feature selection). L2 (Ridge) is numerically more stable, L1 (Lasso) is more interpretable (natural zeros).