Name: Real Estate Price Prediction with Scikit-learn
Author: Alderi KAMTCHOUA

1. Data and exploration

1_exploration.py

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

# ── Load California Housing dataset ──
housing = fetch_california_housing(as_frame=True)
df = housing.frame

print(df.shape)       # (20640, 9)
print(df.info())
print(df.describe())

# Columns:
# MedInc      — median household income
# HouseAge    — median house age
# AveRooms    — average rooms
# AveBedrms   — average bedrooms
# Population  — population of block
# AveOccup    — average occupancy
# Latitude, Longitude
# MedHouseVal — TARGET PRICE (in hundreds of thousands $)

# ── Analyze target variable ──
print(f"Mean price: ${df['MedHouseVal'].mean() * 100_000:,.0f}")
print(f"Median price: ${df['MedHouseVal'].median() * 100_000:,.0f}")

# ── Detect outliers ──
Q1 = df['MedHouseVal'].quantile(0.25)
Q3 = df['MedHouseVal'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['MedHouseVal'] < Q1 - 1.5*IQR) | (df['MedHouseVal'] > Q3 + 1.5*IQR)]
print(f"Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")

# ── Correlations with target ──
correlations = df.corr()['MedHouseVal'].sort_values(ascending=False)
print("\nCorrelations with price:")
print(correlations)

This block loads the California Housing dataset of 20,640 homes with 8 features. .info() shows types and missing values, .describe() gives descriptive statistics (mean, std, quartiles). Outlier detection uses IQR (Interquartile Range): outliers are values beyond 1.5 × IQR above Q3 or below Q1. Pearson correlations identify features most strongly linked to target price.

2. Preprocessing and feature engineering

📖 Term: Scikit-learn Pipeline

Definition: A Pipeline is a chain of transformations and a final model executed in order. Each step transforms data for the next, until the model makes predictions.

Purpose: Guarantee the same preprocessing is automatically applied at training and prediction, preventing data leakage.

Why here: Without Pipeline, it's easy to forget to normalize test data, causing 10-100x errors. Pipeline eliminates this common error by encapsulating preprocessing.

Data leakage prevention via Pipeline is critical: if you learn normalization statistics (mean, std) on ALL data (train + test) then split, the model "sees" test data indirectly during preprocessing. With a Pipeline, stats only see X_train.

2_preprocessing.py

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

housing = fetch_california_housing(as_frame=True)
df = housing.frame

# ── Feature Engineering ──
df['rooms_per_household'] = df['AveRooms'] / df['AveOccup']
df['bedrooms_ratio'] = df['AveBedrms'] / df['AveRooms']
df['population_per_household'] = df['Population'] / df['AveOccup']
df['income_per_room'] = df['MedInc'] / df['AveRooms']

# Log-transform for skewed distributions
df['log_population'] = np.log1p(df['Population'])

# ── Separate features/target ──
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

# ── Split train/test (80/20) ──
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)} | Test: {len(X_test)}")

# ── Preprocessing pipeline ──
numeric_features = X.columns.tolist()
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Handle NaN
    ('scaler', StandardScaler())                    # Standardize
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features)
])

This block builds a preprocessing Pipeline in two steps: (1) impute missing values with median (robust strategy), (2) normalize (StandardScaler, centers on 0 and scales to variance 1). The Pipeline ensures transformations learned on X_train only and applied identically to X_test — eliminating major data leakage source.

3. Train and compare models

3_models.py

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
import numpy as np

# ── Define models to compare ──
models = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
}

results = {}

for name, model in models.items():
    # Pipeline = preprocessor + model
    pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('model', model)
    ])

    # 5-fold cross-validation: split into 5 folds, train 5 times, evaluate on each fold
    cv_scores = cross_val_score(
        pipe, X_train, y_train,
        cv=5,
        scoring='neg_root_mean_squared_error',
        n_jobs=-1
    )
    rmse_cv = -cv_scores.mean()
    # 5 scores computed on 5 different folds, averaged for robustness

    # Train on full train set and evaluate on test set
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    rmse = mean_squared_error(y_test, y_pred, squared=False)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    results[name] = {
        'CV RMSE': round(rmse_cv, 4),
        'Test RMSE': round(rmse, 4),
        'Test MAE': round(mae, 4),
        'R²': round(r2, 4)
    }
    print(f"✅ {name} — R²={r2:.3f}, RMSE={rmse:.4f}")

# ── Results table ──
import pandas as pd
results_df = pd.DataFrame(results).T.sort_values('R²', ascending=False)
print("\n" + results_df.to_string())

This block trains and compares 5 different models. For each: create a Pipeline (preprocessing + model), evaluate with 5-fold cross-validation on X_train (robust estimate without using test set), then evaluate on test set to verify generalization. 5-fold cross-validation tests the model 5 times: each fold becomes validation set, other 4 serve as training. 5 scores are averaged for robust evaluation.

📖 Term: Cross-validation (k-fold)

Definition: Technique that divides training data into k groups (folds), trains model k times using each fold as validation set and other k-1 folds as train set, then averages k scores. With k=5: 5 folds × (4 folds training + 1 fold validation).

Purpose: Obtain more robust and reliable model evaluation on limited data, without wasting data on static test set.

Why here: A simple train/test split can be misleading if you get unlucky with random draw (an "easy" or "hard" validation fold). Cross-validation does k independent experiments and averages results — much more stable and recommended in practice. It's the standard for evaluating models before using final test set.

Typical results on California Housing (20,640 samples):

Linear Regression — R²: 0.61, RMSE: 0.73, CV RMSE: 0.74
Random Forest — R²: 0.81, RMSE: 0.51, CV RMSE: 0.52
Gradient Boosting — R²: 0.84, RMSE: 0.46, CV RMSE: 0.47 ← Best model

Note: CV RMSE and Test RMSE close indicates no overfitting.

📖 Term: Overfitting

Definition: Situation where a model learns noise and specific quirks of training data instead of general patterns. It achieves excellent training performance but poor test performance.

Purpose: Recognize and avoid overfitting by comparing train vs test performance and cross-validation.

Why here: Overfitting is the main ML pitfall. A model can look excellent on training data (R²=0.99) but be completely useless in production (R²=0.2). Here, we use cross-validation and independent test set to detect it quickly.

Analogy: Like memorizing exam answers without understanding the subject — perfect on that exam but fail on surprise test.

4. Optimize the best model

📖 Term: Hyperparameter

Definition: A model parameter you must set BEFORE training (unlike weights/coefficients learned during training). Examples: number of trees in Random Forest (n_estimators), max tree depth (max_depth), learning rate (learning_rate) in Gradient Boosting, regularization strength (alpha) in Ridge.

Purpose: Control model complexity, speed and behavior.

Why here: Hyperparameters can vary performance by 10x or more. Optimizing them correctly is crucial for best model. Most of ML time is spent optimizing hyperparameters.

4_optimization.py

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import permutation_importance

# ── Search for hyperparameters ──
param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [3, 5, 7],
    'model__learning_rate': [0.05, 0.1, 0.2],
    'model__subsample': [0.8, 1.0]
}

best_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(random_state=42))
])

grid_search = GridSearchCV(
    best_pipe, param_grid,
    cv=5, scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {-grid_search.best_score_:.4f}")

# GridSearchCV tests all param combinations: 2 × 3 × 3 × 2 = 36 models

# ── Feature importance ──
best_model = grid_search.best_estimator_
y_pred_final = best_model.predict(X_test)

perm_importance = permutation_importance(
    best_model, X_test, y_test, n_repeats=10, random_state=42
)
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': perm_importance.importances_mean
}).sort_values('importance', ascending=False)
print("\nFeature importance:")
print(importance_df.head(8))

📖 Term: GridSearchCV

Definition: Scikit-learn utility that tests all possible combinations of hyperparameters in a defined grid, using cross-validation to evaluate each combination, and returns best parameters with best performance.

Purpose: Automatically optimize hyperparameters without guesswork or manual trial-and-error.

Why here: Standard and systematic way to select best hyperparameters. "Grid search" tests the Cartesian product of all params — e.g. [100,200] × [3,5,7] × [0.05,0.1,0.2] × [0.8,1.0] = 2×3×3×2 = 36 combinations. Each evaluated with 5-fold cross-validation, so 36 × 5 = 180 total trained models.

GridSearchCV tests ALL 36 hyperparameter combinations (the Cartesian product) with 5-fold cross-validation for each. For each combination, it trains the model 5 times on different train/validation splits and averages the score. Then returns the combination with best average CV score. It's brute-force but guarantees finding the best combination in the grid. With n_jobs=-1, all 180 models train in parallel on all CPU cores.

📖 Term: Variable Importance (feature importance)

Definition: Measure of each feature's impact on model predictions. Multiple ways to calculate it: importance based on tree splits (information gain), or permutation_importance which measures performance drop if you randomly shuffle that feature.

Purpose: Understand which features contribute most to predictions and identify useless features.

Why here: Feature importance helps interpret black-box models (Gradient Boosting) and identify features you could remove without performance loss. Permutation importance is model-agnostic and more reliable than tree-based importance.

5. Save and reuse the model

5_deployment.py

import joblib
import numpy as np

# ── Save the complete pipeline ──
joblib.dump(best_model, 'housing_model.pkl')
print("✅ Model saved")

# ── Load and use ──
loaded_model = joblib.load('housing_model.pkl')

# Predict for a new house
# [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Lat, Long, ...]
new_house = pd.DataFrame([{
    'MedInc': 5.5,        # Median income $55K
    'HouseAge': 15,       # 15 years
    'AveRooms': 6.0,
    'AveBedrms': 1.1,
    'Population': 1500,
    'AveOccup': 3.0,
    'Latitude': 37.5,
    'Longitude': -122.0,
    'rooms_per_household': 2.0,
    'bedrooms_ratio': 0.18,
    'population_per_household': 500,
    'income_per_room': 0.92,
    'log_population': np.log1p(1500)
}])

predicted_price = loaded_model.predict(new_house)[0]
print(f"Predicted price: ${predicted_price * 100_000:,.0f}")

This block shows how to save the trained model with joblib (pickle format .pkl), then load and use it for predictions in production. The entire Pipeline — preprocessing (imputation + normalization) + model (Gradient Boosting) — is serialized in one file. This ensures future predictions use exactly the same normalization and model, preventing train/prod discrepancies.

Save the complete Pipeline (not just model) for production. If you saved only the model without preprocessing, you'd have to remember to normalize each new prediction before sending to model — guaranteed source of bugs (forgetting normalization or using wrong mean/std).

📖 Term: Regularization

Definition: Technique that adds a penalty to model coefficients during training to keep it simple and avoid overfitting. Ridge (L2) adds sum of squared coefficients, Lasso (L1) adds sum of absolute values.

Purpose: Reduce overfitting by penalizing complex models with high weights.

Why here: Ridge and Lasso are basic regularized regression models. Ridge adds soft penalty (keep all coefficients but reduce magnitude), Lasso can force some coefficients exactly to zero (automatic feature selection). L2 (Ridge) is numerically more stable, L1 (Lasso) is more interpretable (natural zeros).