Name: Train ML Models on Google Colab
Author: Alderi KAMTCHOUA

1. Enable GPU

In Colab: Runtime → Change runtime type → Hardware accelerator → GPU (T4)

📖 Term: GPU vs CPU for ML

Definition: A GPU (Graphics Processing Unit) is a processor specialized for massive parallel computations. A CPU (Central Processing Unit) is the computer's general-purpose processor, better for sequential tasks.

Purpose: GPUs drastically accelerate matrix operations (multiply, dot product, convolution) that dominate deep learning.

Why here: Training a CNN on CIFAR-10 with a T4 GPU takes ~10 min, on CPU without GPU ~1h. That's a 6x difference. A100/V100 GPUs are even faster. On larger models (ResNets, Transformers), the difference is 10-50x.

GPUs have thousands of small cores doing multiplications in parallel. A CNN forward pass on 128 images (batch) in parallel on GPU = very fast. On CPU, it's sequential or weakly parallel = slow. Frameworks (TensorFlow, PyTorch) automatically detect the GPU and use it without special code.

Cell 1 — Check GPU

import torch
import tensorflow as tf

# Check PyTorch GPU
print(f"PyTorch — GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Check TensorFlow GPU
gpus = tf.config.list_physical_devices('GPU')
print(f"TensorFlow — GPUs: {gpus}")

# System info
import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
print(result.stdout)

This block verifies the GPU is accessible. torch.cuda.is_available() returns True if CUDA (PyTorch's GPU runtime) works. A T4 GPU has ~16GB of VRAM. nvidia-smi displays current GPU usage in real time.

2. Mount Google Drive

📖 Term: Google Drive in Colab (mounting)

Definition: Mounting = filesystem access to Google Drive storage from Colab. Without mounting, any file created in Colab is lost when disconnecting (session limited to 12h).

Purpose: Persist data, models and results beyond the Colab session.

Why here: Colab is stateless — there's no persistent "hard drive" for you. Drive mount is the only way to save training and checkpoints.

Cell 2 — Mount Drive

from google.colab import drive
import os

# Mount Google Drive (asks for permission)
drive.mount('/content/drive')

# Create working folder on Drive
PROJECT_DIR = '/content/drive/MyDrive/ML_Projects/classification_cifar10'
os.makedirs(PROJECT_DIR, exist_ok=True)

MODELS_DIR = f'{PROJECT_DIR}/models'
DATA_DIR = f'{PROJECT_DIR}/data'
os.makedirs(MODELS_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)

print(f"Project: {PROJECT_DIR}")
print(f"Contents: {os.listdir(PROJECT_DIR)}")

drive.mount() requests permission to access your Drive. After permission, /content/drive points to your Drive. We create a folder structure to organize models, data and checkpoints. If you rerun the notebook (after disconnecting), it's the same folder — your old files are still there.

3. Train a Complete Model on Colab

Cell 3 — Model and Training

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# ── Data ──
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train, X_test = X_train / 255.0, X_test / 255.0
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

# ── Model ──
def create_model():
    return keras.Sequential([
        keras.Input(shape=(32, 32, 3)),
        layers.RandomFlip('horizontal'),
        layers.RandomRotation(0.1),

        layers.Conv2D(32, 3, padding='same', activation='relu'),
        layers.BatchNormalization(),
        layers.Conv2D(32, 3, padding='same', activation='relu'),
        layers.BatchNormalization(),
        layers.MaxPooling2D(),
        layers.Dropout(0.3),

        layers.Conv2D(64, 3, padding='same', activation='relu'),
        layers.BatchNormalization(),
        layers.MaxPooling2D(),
        layers.Dropout(0.3),

        layers.GlobalAveragePooling2D(),
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(10, activation='softmax')
    ])

model = create_model()
model.compile(
    optimizer=keras.optimizers.Adam(0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)
model.summary()

Cell 4 — Train with Auto-Save

import time

MODEL_PATH = f'{MODELS_DIR}/cifar10_best.keras'
CHECKPOINT_PATH = f'{MODELS_DIR}/cifar10_checkpoint.keras'

callbacks = [
    # Save best model to Drive
    keras.callbacks.ModelCheckpoint(
        MODEL_PATH,
        monitor='val_accuracy',
        save_best_only=True,
        verbose=1
    ),
    # Save regularly every epoch (protection against disconnects)
    keras.callbacks.ModelCheckpoint(
        CHECKPOINT_PATH,
        save_freq='epoch',
        verbose=0
    ),
    keras.callbacks.EarlyStopping(patience=15, restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=7, min_lr=1e-7),
]

start = time.time()
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=128,
    validation_split=0.15,
    callbacks=callbacks,
    verbose=1
)
elapsed = time.time() - start

test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"\nTraining finished in {elapsed/60:.1f} min")
print(f"Test accuracy: {test_acc*100:.2f}%")
print(f"Model saved to Drive: {MODEL_PATH}")

📖 Term: ModelCheckpoint (Auto-Save)

Definition: Keras callback that saves the model to disk (Drive, in our case) regularly during training. With save_best_only=True and monitor='val_accuracy', we only save if val_accuracy improves.

Purpose: Protect against Colab disconnects and persist the best model in case of interruption.

Why here: Without ModelCheckpoint, if your Colab session crashes or disconnects after 30 min of training, everything is lost. With two checkpoints (best + last), you can resume training at the current epoch.

We use two callbacks: MODEL_PATH (best model by val_accuracy) and CHECKPOINT_PATH (latest checkpoint each epoch). If Colab disconnects after epoch 40, you can resume training from epoch 40, not from the beginning. Keras files (.keras) include architecture + weights + optimizer state.

4. Resume After Disconnect

Cell 5 — Resume Training

import os

# Remount Drive after disconnect
from google.colab import drive
drive.mount('/content/drive')

# Load latest checkpoint
if os.path.exists(CHECKPOINT_PATH):
    model = keras.models.load_model(CHECKPOINT_PATH)
    print(f"✅ Model loaded from: {CHECKPOINT_PATH}")

    # Resume training from there
    history2 = model.fit(
        X_train, y_train,
        epochs=50,
        initial_epoch=len(history.history['loss']),  # Resume at right epoch
        batch_size=128,
        validation_split=0.15,
        callbacks=callbacks
    )
else:
    print("❌ No checkpoint found — restart training")

After disconnect, remount Drive, load the checkpoint (model saved at latest epoch), then resume training with initial_epoch = completed epochs. This avoids replaying already-completed epochs.

5. Colab Best Practices

Colab Tips

# ── Avoid automatic disconnect ──
# Colab disconnects after 90 min of inactivity (no code output)
# Solution: open browser console and run:
# function KeepAlive() { document.querySelector("colab-connect-button").click(); setTimeout(KeepAlive, 60000); } KeepAlive();

# ── Check disk space ──
import shutil
total, used, free = shutil.disk_usage("/")
print(f"Disk: {used/1e9:.1f}GB / {total/1e9:.1f}GB used")

# ── Download generated files ──
from google.colab import files
# Download a model to your computer
files.download('/content/my_model.pkl')

# ── Upload files from your computer ──
uploaded = files.upload()  # Opens file picker
for name, data in uploaded.items():
    print(f"File received: {name} ({len(data)} bytes)")

# ── Install packages ──
import subprocess
subprocess.run(['pip', 'install', 'transformers', 'datasets', '-q'])

📖 Term: VRAM (GPU Memory)

Definition: VRAM (Video RAM) is fast memory on the GPU. Unlike system RAM (for CPU), VRAM is dedicated to tensors and GPU computations. T4 GPUs have ~16 GB VRAM, A100 80 GB.

Purpose: Limit batch sizes to what fits in VRAM. If batch_size is too large, OutOfMemory error.

Why here: On T4 (16GB VRAM), batch_size=128 works for CIFAR-10 CNN. batch_size=1024 would cause OutOfMemory. For large models (ResNets, Transformers), batch_size must be much smaller.

Free Colab sessions have a 12h runtime limit and GPU can be revoked anytime if Google thinks you're using too many resources (e.g., if you use too much). For long trainings (>12h) or massive models, use Colab Pro ($10/month) or cloud infrastructure (AWS, GCP).

Train ML Models Without Local GPU

1. Enable GPU

2. Mount Google Drive

3. Train a Complete Model on Colab

4. Resume After Disconnect

5. Colab Best Practices