Google Colab provides free GPUs (A100, V100, T4) in the browser. This tutorial shows how to use it effectively: mount Drive to persist data, use the GPU, save models and avoid common pitfalls.
In Colab: Runtime → Change runtime type → Hardware accelerator → GPU (T4)
Definition: A GPU (Graphics Processing Unit) is a processor specialized for massive parallel computations. A CPU (Central Processing Unit) is the computer's general-purpose processor, better for sequential tasks.
Purpose: GPUs drastically accelerate matrix operations (multiply, dot product, convolution) that dominate deep learning.
Why here: Training a CNN on CIFAR-10 with a T4 GPU takes ~10 min, on CPU without GPU ~1h. That's a 6x difference. A100/V100 GPUs are even faster. On larger models (ResNets, Transformers), the difference is 10-50x.
import torch
import tensorflow as tf
# Check PyTorch GPU
print(f"PyTorch — GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# Check TensorFlow GPU
gpus = tf.config.list_physical_devices('GPU')
print(f"TensorFlow — GPUs: {gpus}")
# System info
import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
print(result.stdout)
Definition: Mounting = filesystem access to Google Drive storage from Colab. Without mounting, any file created in Colab is lost when disconnecting (session limited to 12h).
Purpose: Persist data, models and results beyond the Colab session.
Why here: Colab is stateless — there's no persistent "hard drive" for you. Drive mount is the only way to save training and checkpoints.
from google.colab import drive
import os
# Mount Google Drive (asks for permission)
drive.mount('/content/drive')
# Create working folder on Drive
PROJECT_DIR = '/content/drive/MyDrive/ML_Projects/classification_cifar10'
os.makedirs(PROJECT_DIR, exist_ok=True)
MODELS_DIR = f'{PROJECT_DIR}/models'
DATA_DIR = f'{PROJECT_DIR}/data'
os.makedirs(MODELS_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)
print(f"Project: {PROJECT_DIR}")
print(f"Contents: {os.listdir(PROJECT_DIR)}")
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
# ── Data ──
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train, X_test = X_train / 255.0, X_test / 255.0
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)
# ── Model ──
def create_model():
return keras.Sequential([
keras.Input(shape=(32, 32, 3)),
layers.RandomFlip('horizontal'),
layers.RandomRotation(0.1),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.BatchNormalization(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.BatchNormalization(),
layers.MaxPooling2D(),
layers.Dropout(0.3),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.BatchNormalization(),
layers.MaxPooling2D(),
layers.Dropout(0.3),
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
model = create_model()
model.compile(
optimizer=keras.optimizers.Adam(0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
import time
MODEL_PATH = f'{MODELS_DIR}/cifar10_best.keras'
CHECKPOINT_PATH = f'{MODELS_DIR}/cifar10_checkpoint.keras'
callbacks = [
# Save best model to Drive
keras.callbacks.ModelCheckpoint(
MODEL_PATH,
monitor='val_accuracy',
save_best_only=True,
verbose=1
),
# Save regularly every epoch (protection against disconnects)
keras.callbacks.ModelCheckpoint(
CHECKPOINT_PATH,
save_freq='epoch',
verbose=0
),
keras.callbacks.EarlyStopping(patience=15, restore_best_weights=True),
keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=7, min_lr=1e-7),
]
start = time.time()
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=128,
validation_split=0.15,
callbacks=callbacks,
verbose=1
)
elapsed = time.time() - start
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"\nTraining finished in {elapsed/60:.1f} min")
print(f"Test accuracy: {test_acc*100:.2f}%")
print(f"Model saved to Drive: {MODEL_PATH}")
Definition: Keras callback that saves the model to disk (Drive, in our case) regularly during training. With save_best_only=True and monitor='val_accuracy', we only save if val_accuracy improves.
Purpose: Protect against Colab disconnects and persist the best model in case of interruption.
Why here: Without ModelCheckpoint, if your Colab session crashes or disconnects after 30 min of training, everything is lost. With two checkpoints (best + last), you can resume training at the current epoch.
import os
# Remount Drive after disconnect
from google.colab import drive
drive.mount('/content/drive')
# Load latest checkpoint
if os.path.exists(CHECKPOINT_PATH):
model = keras.models.load_model(CHECKPOINT_PATH)
print(f"✅ Model loaded from: {CHECKPOINT_PATH}")
# Resume training from there
history2 = model.fit(
X_train, y_train,
epochs=50,
initial_epoch=len(history.history['loss']), # Resume at right epoch
batch_size=128,
validation_split=0.15,
callbacks=callbacks
)
else:
print("❌ No checkpoint found — restart training")
# ── Avoid automatic disconnect ──
# Colab disconnects after 90 min of inactivity (no code output)
# Solution: open browser console and run:
# function KeepAlive() { document.querySelector("colab-connect-button").click(); setTimeout(KeepAlive, 60000); } KeepAlive();
# ── Check disk space ──
import shutil
total, used, free = shutil.disk_usage("/")
print(f"Disk: {used/1e9:.1f}GB / {total/1e9:.1f}GB used")
# ── Download generated files ──
from google.colab import files
# Download a model to your computer
files.download('/content/my_model.pkl')
# ── Upload files from your computer ──
uploaded = files.upload() # Opens file picker
for name, data in uploaded.items():
print(f"File received: {name} ({len(data)} bytes)")
# ── Install packages ──
import subprocess
subprocess.run(['pip', 'install', 'transformers', 'datasets', '-q'])
Definition: VRAM (Video RAM) is fast memory on the GPU. Unlike system RAM (for CPU), VRAM is dedicated to tensors and GPU computations. T4 GPUs have ~16 GB VRAM, A100 80 GB.
Purpose: Limit batch sizes to what fits in VRAM. If batch_size is too large, OutOfMemory error.
Why here: On T4 (16GB VRAM), batch_size=128 works for CIFAR-10 CNN. batch_size=1024 would cause OutOfMemory. For large models (ResNets, Transformers), batch_size must be much smaller.