NumPy and Pandas are the foundations of Python Data Science. NumPy brings vectorized computation on matrices, Pandas brings DataFrames to manipulate tabular data โ like Excel, but with code.
Definition: A ndarray is a multidimensional homogeneous NumPy array โ all elements have the same data type (int, float, etc.). Unlike Python lists, elements are stored contiguously in memory, enabling very fast operations.
Purpose: Provide an optimized data structure for numerical and scientific computation, replacing slow Python loops with vectorized operations.
Why here: NumPy is the foundation of all data processing in Python. Understanding ndarrays is essential before manipulating data with Pandas.
import numpy as np
# โโ Create arrays โโ
arr1d = np.array([1, 2, 3, 4, 5])
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2d.shape) # (2, 3) โ 2 rows, 3 columns
print(arr2d.dtype) # int64
print(arr2d.ndim) # 2
# โโ Special arrays โโ
zeros = np.zeros((3, 4)) # 3ร4 matrix of zeros
ones = np.ones((2, 3), dtype=np.float32)
eye = np.eye(4) # 4ร4 identity matrix
rand = np.random.randn(100, 10) # 100ร10 โ normal distribution
seq = np.arange(0, 10, 0.5) # [0, 0.5, 1.0, ..., 9.5]
# โโ Vectorized operations (NO loop!) โโ
price = np.array([100, 200, 150, 300, 250])
# Apply 20% tax to all prices in one line
price_tax = price * 1.20 # [120, 240, 180, 360, 300]
price_discounted = np.where(price > 200, price * 0.9, price) # -10% if price > 200
# โโ Broadcasting โโ
arr_3 = np.array([1, 2, 3]) # shape (3,)
arr_3x1 = np.array([[10], [20], [30]]) # shape (3, 1)
result = arr_3 + arr_3x1 # Broadcasting โ result shape (3, 3)
# [[11, 12, 13], [21, 22, 23], [31, 32, 33]]
# โโ Statistics โโ
print(f"Mean: {price.mean():.2f}") # 200.00
print(f"Median: {np.median(price):.2f}")
print(f"Std dev: {price.std():.2f}")
print(f"Min/Max: {price.min()} / {price.max()}")
print(f"75th percentile: {np.percentile(price, 75)}")
.mean(), .std() and .percentile() all work without explicit loops, making them extremely fast even on millions of elements.Definition: NumPy mechanism that allows operations between arrays of different shapes by automatically "repeating" small arrays to match larger dimensions.
Purpose: Avoid explicitly creating copies or manually looping to align shapes.
Why here: Broadcasting is a frequent source of errors โ it's a key concept to master for efficient NumPy. Example: adding a vector of shape (3,) to a matrix of shape (3,1) produces a matrix (3,3). NumPy repeats the vector over each row, then the column vector over each column.
arr_3 + arr_3x1. NumPy detects shapes don't match (3,) vs (3,1). It applies broadcasting rules: align to the right, then extend dimensions of size 1. The result is a 3ร3 matrix where each row gets a copy of arr_3, with each element modified by the corresponding value from arr_3x1 (10, 20 or 30 per row).Definition: Technique that replaces explicit Python loops with operations on entire arrays. Vectorization delegates computation to optimized code (C/Fortran) instead of Python.
Purpose: Drastically speed up numerical computation โ a vectorized operation is typically 10 to 100 times faster than equivalent Python loops.
Why here: Understanding vectorization is central to writing performant Data Science code. Instead of for i in range(len(arr)): result[i] = arr[i] * 2, just write arr * 2. Performance difference grows with data size โ on 1 million elements, it's the difference between milliseconds and several seconds.
price * 1.20 runs entirely in compiled C, without passing through the Python interpreter loop. NumPy processes all 5 prices simultaneously (or nearly) using SIMD (Single Instruction Multiple Data) CPU instructions. That's why NumPy is so fast for numerical computation.import numpy as np
# โโ Linear algebra โ practical ML case โโ
# Linear regression: y = Xw (matrix form)
# Data: 5 houses with [surface, rooms, age]
X = np.array([
[50, 2, 10],
[80, 3, 5],
[120, 4, 2],
[40, 1, 20],
[100, 3, 8]
], dtype=np.float64)
y = np.array([200000, 320000, 480000, 150000, 380000])
# Normalization (standardization)
X_mean = X.mean(axis=0) # Mean of each column
X_std = X.std(axis=0)
X_scaled = (X - X_mean) / X_std
# Add bias column (intercept)
X_b = np.column_stack([np.ones(len(X)), X_scaled])
# Normal equation: w = (X^T X)^-1 X^T y
w = np.linalg.lstsq(X_b, y, rcond=None)[0]
print(f"Parameters: {w}")
# Predict for new house [75mยฒ, 3 rooms, 7 years]
new_house = np.array([75, 3, 7])
new_scaled = (new_house - X_mean) / X_std
prediction = np.dot([1, *new_scaled], w)
print(f"Predicted price: {prediction:,.0f} USD")
# โโ Matrix operations โโ
A = np.random.randn(4, 4)
print(f"Determinant: {np.linalg.det(A):.4f}")
print(f"Rank: {np.linalg.matrix_rank(A)}")
eigenvalues, _ = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")
Definition: A DataFrame is a labeled 2D table with rows and columns, similar to an Excel sheet or SQL table. Columns can have different types (int, string, datetime, etc.), unlike ndarrays.
Purpose: Represent and manipulate real tabular data with intelligent labels for each axis (index for rows, names for columns).
Why here: DataFrames are the central data structure in Pandas โ 90% of Data Science work uses DataFrames. A DataFrame combines NumPy's computation power with SQL-like usability.
Definition: A Series is a single DataFrame column โ a labeled 1D array with an index. It combines a NumPy ndarray with labels.
Purpose: Represent a single variable or data dimension with index labels, allowing access by label or position.
Why here: A Series is what you get when accessing a DataFrame column (e.g. df['name']). Understanding Series helps understand DataFrames โ a DataFrame is really a dictionary of aligned Series.
import pandas as pd
import numpy as np
# โโ Create a DataFrame โโ
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'age': [28, 34, 45, 29, 38],
'salary': [45000, 62000, 85000, 51000, 72000],
'department': ['Tech', 'Tech', 'Sales', 'HR', 'Tech'],
'hire_date': pd.to_datetime(['2020-03-15', '2019-07-01', '2015-01-20', '2022-11-05', '2018-04-12'])
}
df = pd.DataFrame(data)
# โโ Exploration โโ
print(df.info()) # Types, missing values
print(df.describe()) # Descriptive statistics for numeric columns
print(df.head(3)) # First rows
# โโ Selection โโ
tech_employees = df[df['department'] == 'Tech']
high_earners = df[df['salary'] > 60000][['name', 'salary']]
# โโ Calculated new columns โโ
df['years_employed'] = (pd.Timestamp.now() - df['hire_date']).dt.days / 365
df['seniority'] = pd.cut(
df['years_employed'],
bins=[0, 2, 5, 10, float('inf')],
labels=['Junior', 'Mid', 'Senior', 'Lead']
)
.info() to see types and missing values, .describe() for quick statistics (mean, std, quartiles), and .head() to visualize first rows. Conditional selection with df[condition] filters rows by criteria. Calculated columns enrich the DataFrame with derived variables (years employed, seniority categories).import pandas as pd
import numpy as np
# โโ Simulate sales data โโ
np.random.seed(42)
dates = pd.date_range(start='2024-01-01', end='2024-12-31', freq='D')
sales_df = pd.DataFrame({
'date': dates,
'product': np.random.choice(['Laptop', 'Phone', 'Tablet'], len(dates)),
'quantity': np.random.randint(1, 20, len(dates)),
'unit_price': np.random.choice([999, 599, 449], len(dates)),
'region': np.random.choice(['North', 'South', 'East', 'West'], len(dates))
})
sales_df['revenue'] = sales_df['quantity'] * sales_df['unit_price']
sales_df['month'] = sales_df['date'].dt.to_period('M')
# โโ GroupBy โ aggregations โโ
monthly = sales_df.groupby('month').agg(
total_revenue=('revenue', 'sum'),
avg_daily_revenue=('revenue', 'mean'),
transactions=('revenue', 'count')
).reset_index()
print(monthly.tail(3))
Definition: Operation that divides a DataFrame into groups by values in one or more columns, applies a function to each group, then combines results. This pattern is called split-apply-combine.
Purpose: Calculate statistics or transformations separately for each data group (by category, period, region, etc.).
Why here: GroupBy is one of the most powerful Pandas operations. Instead of manually looping over each group, you describe the operation and Pandas applies it automatically โ both faster and more readable. In the example, group by month, then calculate total revenue, average, and transaction count for each month.
sales_df.groupby('month').agg(...) first divides 365 rows into 12 groups (one per month). Then for each month, Pandas calculates sum of revenues, mean of revenues, and transaction count. Finally, .reset_index() brings 'month' back to an ordinary column so result is a simple DataFrame with one row per month. Much more efficient than manual Python loops.# โโ Pivot table โโ
pivot = pd.pivot_table(
sales_df,
values='revenue',
index='region',
columns='product',
aggfunc='sum',
fill_value=0
)
print(pivot)
Definition: Reorganization of tabular data by rearranging rows and columns, with aggregation of values (sum, mean, count, etc.). Result is a table where one dimension becomes rows, another becomes columns, and the third is aggregated in cells.
Purpose: Summarize and cross two data dimensions to see patterns and comparisons quickly โ useful for cross-analysis and reports.
Why here: Pivot vs GroupBy: GroupBy is for simple or complex summaries (total per month, with multiple aggregations), pivot_table is for crossing two dimensions and seeing result in a grid (revenue by region AND product). Pivot_table calls GroupBy internally but offers intuitive interface for this specific use case.
fill_value=0 replaces missing combinations with 0 (instead of NaN).# โโ Clean real data โโ
dirty_df = pd.DataFrame({
'price': ['$1,299', 'N/A', '$899', '', '$1,499'],
'date': ['15/01/2024', '2024-01-20', 'invalid', '22/01/2024', '23-01-2024']
})
# Clean price
dirty_df['price_clean'] = (
dirty_df['price']
.str.replace('[$,]', '', regex=True)
.replace(['N/A', ''], pd.NA)
.astype('Float64')
)
# Analyze missing values
print(dirty_df.isnull().sum()) # Count per column
print(dirty_df.isnull().mean() * 100) # % missing values
# Imputation โ replace NaN with median
dirty_df['price_imputed'] = dirty_df['price_clean'].fillna(dirty_df['price_clean'].median())
.isnull().sum() counts NaN per column for diagnosis. .fillna() imputes missing values โ here with median, robust against outliers. This cleaning can take 80% of real Data Science time..apply(), .str, .dt) that are 10-100x faster. For example, df['price'].str.replace(...) transforms the entire column in one optimized operation, not element-by-element.