Name: NumPy & Pandas: data manipulation
Author: Alderi KAMTCHOUA

Part 1 — NumPy: matrix computation

📖 Term: NumPy array (ndarray)

Definition: A ndarray is a multidimensional homogeneous NumPy array — all elements have the same data type (int, float, etc.). Unlike Python lists, elements are stored contiguously in memory, enabling very fast operations.

Purpose: Provide an optimized data structure for numerical and scientific computation, replacing slow Python loops with vectorized operations.

Why here: NumPy is the foundation of all data processing in Python. Understanding ndarrays is essential before manipulating data with Pandas.

numpy_bases.py

import numpy as np

# ── Create arrays ──
arr1d = np.array([1, 2, 3, 4, 5])
arr2d = np.array([[1, 2, 3], [4, 5, 6]])

print(arr2d.shape)    # (2, 3) — 2 rows, 3 columns
print(arr2d.dtype)    # int64
print(arr2d.ndim)     # 2

# ── Special arrays ──
zeros = np.zeros((3, 4))          # 3×4 matrix of zeros
ones = np.ones((2, 3), dtype=np.float32)
eye = np.eye(4)                   # 4×4 identity matrix
rand = np.random.randn(100, 10)   # 100×10 — normal distribution
seq = np.arange(0, 10, 0.5)      # [0, 0.5, 1.0, ..., 9.5]

# ── Vectorized operations (NO loop!) ──
price = np.array([100, 200, 150, 300, 250])
# Apply 20% tax to all prices in one line
price_tax = price * 1.20               # [120, 240, 180, 360, 300]
price_discounted = np.where(price > 200, price * 0.9, price)  # -10% if price > 200

# ── Broadcasting ──
arr_3 = np.array([1, 2, 3])           # shape (3,)
arr_3x1 = np.array([[10], [20], [30]])  # shape (3, 1)
result = arr_3 + arr_3x1                  # Broadcasting — result shape (3, 3)
# [[11, 12, 13], [21, 22, 23], [31, 32, 33]]

# ── Statistics ──
print(f"Mean: {price.mean():.2f}")   # 200.00
print(f"Median: {np.median(price):.2f}")
print(f"Std dev: {price.std():.2f}")
print(f"Min/Max: {price.min()} / {price.max()}")
print(f"75th percentile: {np.percentile(price, 75)}")

This block shows the power of vectorization: instead of looping over each price with a Python loop (10-100x slower), we apply the arithmetic operation directly to the entire array. NumPy executes the operation on all elements in compiled C. Operations like .mean(), .std() and .percentile() all work without explicit loops, making them extremely fast even on millions of elements.

📖 Term: Broadcasting

Definition: NumPy mechanism that allows operations between arrays of different shapes by automatically "repeating" small arrays to match larger dimensions.

Purpose: Avoid explicitly creating copies or manually looping to align shapes.

Why here: Broadcasting is a frequent source of errors — it's a key concept to master for efficient NumPy. Example: adding a vector of shape (3,) to a matrix of shape (3,1) produces a matrix (3,3). NumPy repeats the vector over each row, then the column vector over each column.

Broadcasting in action: arr_3 + arr_3x1. NumPy detects shapes don't match (3,) vs (3,1). It applies broadcasting rules: align to the right, then extend dimensions of size 1. The result is a 3×3 matrix where each row gets a copy of arr_3, with each element modified by the corresponding value from arr_3x1 (10, 20 or 30 per row).

📖 Term: Vectorization

Definition: Technique that replaces explicit Python loops with operations on entire arrays. Vectorization delegates computation to optimized code (C/Fortran) instead of Python.

Purpose: Drastically speed up numerical computation — a vectorized operation is typically 10 to 100 times faster than equivalent Python loops.

Why here: Understanding vectorization is central to writing performant Data Science code. Instead of for i in range(len(arr)): result[i] = arr[i] * 2, just write arr * 2. Performance difference grows with data size — on 1 million elements, it's the difference between milliseconds and several seconds.

The operation price * 1.20 runs entirely in compiled C, without passing through the Python interpreter loop. NumPy processes all 5 prices simultaneously (or nearly) using SIMD (Single Instruction Multiple Data) CPU instructions. That's why NumPy is so fast for numerical computation.

numpy_matrices.py

import numpy as np

# ── Linear algebra — practical ML case ──
# Linear regression: y = Xw (matrix form)

# Data: 5 houses with [surface, rooms, age]
X = np.array([
    [50, 2, 10],
    [80, 3, 5],
    [120, 4, 2],
    [40, 1, 20],
    [100, 3, 8]
], dtype=np.float64)

y = np.array([200000, 320000, 480000, 150000, 380000])

# Normalization (standardization)
X_mean = X.mean(axis=0)  # Mean of each column
X_std = X.std(axis=0)
X_scaled = (X - X_mean) / X_std

# Add bias column (intercept)
X_b = np.column_stack([np.ones(len(X)), X_scaled])

# Normal equation: w = (X^T X)^-1 X^T y
w = np.linalg.lstsq(X_b, y, rcond=None)[0]
print(f"Parameters: {w}")

# Predict for new house [75m², 3 rooms, 7 years]
new_house = np.array([75, 3, 7])
new_scaled = (new_house - X_mean) / X_std
prediction = np.dot([1, *new_scaled], w)
print(f"Predicted price: {prediction:,.0f} USD")

# ── Matrix operations ──
A = np.random.randn(4, 4)
print(f"Determinant: {np.linalg.det(A):.4f}")
print(f"Rank: {np.linalg.matrix_rank(A)}")
eigenvalues, _ = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")

This block shows how to use NumPy for a real Machine Learning problem: predicting house prices from features (surface, rooms, age). Data normalization (standardization) is crucial for algorithms that use distance or gradients — it centers data around 0 with variance 1. The normal equation uses matrix algebra to calculate optimal weights directly, without iteration. To predict, we apply dot product between normalized features and learned weights.

Part 2 — Pandas: real data analysis

📖 Term: DataFrame

Definition: A DataFrame is a labeled 2D table with rows and columns, similar to an Excel sheet or SQL table. Columns can have different types (int, string, datetime, etc.), unlike ndarrays.

Purpose: Represent and manipulate real tabular data with intelligent labels for each axis (index for rows, names for columns).

Why here: DataFrames are the central data structure in Pandas — 90% of Data Science work uses DataFrames. A DataFrame combines NumPy's computation power with SQL-like usability.

📖 Term: Series

Definition: A Series is a single DataFrame column — a labeled 1D array with an index. It combines a NumPy ndarray with labels.

Purpose: Represent a single variable or data dimension with index labels, allowing access by label or position.

Why here: A Series is what you get when accessing a DataFrame column (e.g. df['name']). Understanding Series helps understand DataFrames — a DataFrame is really a dictionary of aligned Series.

pandas_bases.py

import pandas as pd
import numpy as np

# ── Create a DataFrame ──
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [28, 34, 45, 29, 38],
    'salary': [45000, 62000, 85000, 51000, 72000],
    'department': ['Tech', 'Tech', 'Sales', 'HR', 'Tech'],
    'hire_date': pd.to_datetime(['2020-03-15', '2019-07-01', '2015-01-20', '2022-11-05', '2018-04-12'])
}
df = pd.DataFrame(data)

# ── Exploration ──
print(df.info())          # Types, missing values
print(df.describe())      # Descriptive statistics for numeric columns
print(df.head(3))         # First rows

# ── Selection ──
tech_employees = df[df['department'] == 'Tech']
high_earners = df[df['salary'] > 60000][['name', 'salary']]

# ── Calculated new columns ──
df['years_employed'] = (pd.Timestamp.now() - df['hire_date']).dt.days / 365
df['seniority'] = pd.cut(
    df['years_employed'],
    bins=[0, 2, 5, 10, float('inf')],
    labels=['Junior', 'Mid', 'Senior', 'Lead']
)

This block creates a DataFrame from a Python dictionary and shows common exploration operations: .info() to see types and missing values, .describe() for quick statistics (mean, std, quartiles), and .head() to visualize first rows. Conditional selection with df[condition] filters rows by criteria. Calculated columns enrich the DataFrame with derived variables (years employed, seniority categories).

pandas_advanced.py

import pandas as pd
import numpy as np

# ── Simulate sales data ──
np.random.seed(42)
dates = pd.date_range(start='2024-01-01', end='2024-12-31', freq='D')
sales_df = pd.DataFrame({
    'date': dates,
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet'], len(dates)),
    'quantity': np.random.randint(1, 20, len(dates)),
    'unit_price': np.random.choice([999, 599, 449], len(dates)),
    'region': np.random.choice(['North', 'South', 'East', 'West'], len(dates))
})
sales_df['revenue'] = sales_df['quantity'] * sales_df['unit_price']
sales_df['month'] = sales_df['date'].dt.to_period('M')

# ── GroupBy — aggregations ──
monthly = sales_df.groupby('month').agg(
    total_revenue=('revenue', 'sum'),
    avg_daily_revenue=('revenue', 'mean'),
    transactions=('revenue', 'count')
).reset_index()
print(monthly.tail(3))

📖 Term: GroupBy

Definition: Operation that divides a DataFrame into groups by values in one or more columns, applies a function to each group, then combines results. This pattern is called split-apply-combine.

Purpose: Calculate statistics or transformations separately for each data group (by category, period, region, etc.).

Why here: GroupBy is one of the most powerful Pandas operations. Instead of manually looping over each group, you describe the operation and Pandas applies it automatically — both faster and more readable. In the example, group by month, then calculate total revenue, average, and transaction count for each month.

GroupBy in action: sales_df.groupby('month').agg(...) first divides 365 rows into 12 groups (one per month). Then for each month, Pandas calculates sum of revenues, mean of revenues, and transaction count. Finally, .reset_index() brings 'month' back to an ordinary column so result is a simple DataFrame with one row per month. Much more efficient than manual Python loops.

The split-apply-combine pattern is powerful: divide data (split), apply aggregation or transformation (apply), combine result. Much more efficient than manually looping over unique groups and building results with Python lists — GroupBy is vectorized and uses NumPy/C underneath.

pandas_pivot.py

# ── Pivot table ──
pivot = pd.pivot_table(
    sales_df,
    values='revenue',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0
)
print(pivot)

📖 Term: Pivot Table

Definition: Reorganization of tabular data by rearranging rows and columns, with aggregation of values (sum, mean, count, etc.). Result is a table where one dimension becomes rows, another becomes columns, and the third is aggregated in cells.

Purpose: Summarize and cross two data dimensions to see patterns and comparisons quickly — useful for cross-analysis and reports.

Why here: Pivot vs GroupBy: GroupBy is for simple or complex summaries (total per month, with multiple aggregations), pivot_table is for crossing two dimensions and seeing result in a grid (revenue by region AND product). Pivot_table calls GroupBy internally but offers intuitive interface for this specific use case.

Pivot_table restructures data by putting rows (regions) in index, columns (products) in headers, and revenues as values. Result is a grid where each cell shows total revenue for a region-product combination. Ideal to quickly see: which product earns most per region? Where is data missing? fill_value=0 replaces missing combinations with 0 (instead of NaN).

Use pivot_table when crossing two dimensions to see results in a grid — ideal for reports and cross-analysis. Use groupby when applying complex transformations, handling multiple grouping levels, or keeping data in long format for machine learning.

pandas_cleaning.py

# ── Clean real data ──
dirty_df = pd.DataFrame({
    'price': ['$1,299', 'N/A', '$899', '', '$1,499'],
    'date': ['15/01/2024', '2024-01-20', 'invalid', '22/01/2024', '23-01-2024']
})

# Clean price
dirty_df['price_clean'] = (
    dirty_df['price']
    .str.replace('[$,]', '', regex=True)
    .replace(['N/A', ''], pd.NA)
    .astype('Float64')
)

# Analyze missing values
print(dirty_df.isnull().sum())                    # Count per column
print(dirty_df.isnull().mean() * 100)              # % missing values

# Imputation — replace NaN with median
dirty_df['price_imputed'] = dirty_df['price_clean'].fillna(dirty_df['price_clean'].median())

This block shows real Data Science work: data arrives messy — inconsistent formats ($1,299 vs numeric), missing values (N/A, empty strings), noise (dates in different formats). Here, clean prices in three steps: (1) remove non-numeric characters with regex, (2) replace missing markers (N/A, empty string) with pd.NA, (3) convert to Float64 type. .isnull().sum() counts NaN per column for diagnosis. .fillna() imputes missing values — here with median, robust against outliers. This cleaning can take 80% of real Data Science time.

Pandas uses NumPy underneath. Golden rule: avoid Python loops over DataFrames — use vectorized methods (.apply(), .str, .dt) that are 10-100x faster. For example, df['price'].str.replace(...) transforms the entire column in one optimized operation, not element-by-element.