๐Ÿ—„๏ธ SQL / NoSQL ยท Intermediate

Advanced SQL and comparison with MongoDB

โฑ 40 minutes๐Ÿ˜ PostgreSQL 16๐Ÿƒ MongoDB 7

SQL remains essential for structured data. This tutorial covers advanced joins, aggregations, CTEs and indexes in PostgreSQL, then shows how to model the same data in MongoDB.

The demo schema

A simple e-commerce system: users, orders, products and order items.

SQL โ€” schema creation
-- Users
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    country VARCHAR(50) DEFAULT 'US',
    created_at TIMESTAMP DEFAULT NOW()
);

-- Product categories
CREATE TABLE categories (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    parent_id INT REFERENCES categories(id)  -- Self-reference for hierarchy
);

-- Products
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL,
    price DECIMAL(10,2) NOT NULL,
    stock INT DEFAULT 0,
    category_id INT REFERENCES categories(id)
);

-- Orders
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INT NOT NULL REFERENCES users(id),
    status VARCHAR(20) DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT NOW(),
    shipped_at TIMESTAMP
);

-- Order items
CREATE TABLE order_items (
    id SERIAL PRIMARY KEY,
    order_id INT NOT NULL REFERENCES orders(id),
    product_id INT NOT NULL REFERENCES products(id),
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL  -- Price at time of purchase
);

1. Advanced joins

๐Ÿ“– Term: JOIN (INNER/LEFT/SELF)

Definition: Merging data from multiple tables in a single query, based on a relationship key.

Purpose: Fetch related data without making multiple sequential queries.

Why here: INNER JOIN returns only rows with matches in both tables. LEFT JOIN returns all rows from the left table, even without matches on the right (filled with NULL).

INNER JOIN: Only orders linked to users (user_id exists) are returned. LEFT JOIN: All users are returned, even those without orders (COUNT=0, total=0).
SQL โ€” joins
-- โ”€โ”€ INNER JOIN: orders with user details โ”€โ”€
SELECT
    o.id AS order_id,
    u.name AS customer,
    u.email,
    o.status,
    o.created_at,
    COUNT(oi.id) AS items_count,
    SUM(oi.quantity * oi.unit_price) AS total
FROM orders o
INNER JOIN users u ON o.user_id = u.id
INNER JOIN order_items oi ON oi.order_id = o.id
WHERE o.created_at >= NOW() - INTERVAL '30 days'
GROUP BY o.id, u.name, u.email, o.status, o.created_at
ORDER BY total DESC;

-- โ”€โ”€ LEFT JOIN: users WITH or WITHOUT orders โ”€โ”€
SELECT
    u.name,
    u.email,
    COUNT(o.id) AS total_orders,
    COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS lifetime_value
FROM users u
LEFT JOIN orders o ON o.user_id = u.id AND o.status = 'completed'
LEFT JOIN order_items oi ON oi.order_id = o.id
GROUP BY u.id, u.name, u.email
ORDER BY lifetime_value DESC;

-- โ”€โ”€ SELF JOIN: parent/child categories โ”€โ”€
SELECT
    child.name AS subcategory,
    parent.name AS parent_category
FROM categories child
LEFT JOIN categories parent ON child.parent_id = parent.id
ORDER BY parent.name, child.name;

2. Aggregations and window functions

๐Ÿ“– Term: Window Function

Definition: Function that operates on a subset of rows (window) defined by PARTITION BY, while preserving individual rows (unlike GROUP BY which aggregates them).

Purpose: Calculate partial aggregations (rank, cumulative, previous value) without losing row-level details.

Why here: PARTITION BY divides rows into groups (e.g. by category), ORDER BY sorts them in each partition. RANK() ranks by descending sales, LAG() accesses the previous row.

RANK() vs ROW_NUMBER(): RANK() gives the same rank to ties (1, 1, 3), ROW_NUMBER() gives unique ranks (1, 2, 3). PARTITION BY c.id creates separate windows per category.
SQL โ€” window functions (PostgreSQL)
-- โ”€โ”€ Ranking best-selling products by category โ”€โ”€
SELECT
    c.name AS category,
    p.name AS product,
    SUM(oi.quantity) AS total_sold,
    -- RANK() โ€” same rank for ties
    RANK() OVER (
        PARTITION BY c.id
        ORDER BY SUM(oi.quantity) DESC
    ) AS rank_in_category,
    -- Cumulative revenue in category
    SUM(SUM(oi.quantity * oi.unit_price)) OVER (
        PARTITION BY c.id
        ORDER BY SUM(oi.quantity) DESC
    ) AS cumulative_revenue
FROM products p
JOIN categories c ON p.category_id = c.id
JOIN order_items oi ON oi.product_id = p.id
GROUP BY c.id, c.name, p.id, p.name
ORDER BY c.name, rank_in_category;

-- โ”€โ”€ Month-over-month revenue evolution with variance โ”€โ”€
WITH monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', o.created_at) AS month,
        SUM(oi.quantity * oi.unit_price) AS revenue
    FROM orders o
    JOIN order_items oi ON oi.order_id = o.id
    WHERE o.status = 'completed'
    GROUP BY 1
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    ROUND(
        (revenue - LAG(revenue) OVER (ORDER BY month))
        / LAG(revenue) OVER (ORDER BY month) * 100, 2
    ) AS growth_pct
FROM monthly_revenue
ORDER BY month;

3. Recursive CTEs

๐Ÿ“– Term: CTE (Common Table Expression) Recursive

Definition: Named query (WITH clause) that calls itself, enabling processing of hierarchical or graph structures.

Purpose: Traverse data trees (parent/child categories, org charts) without complex application code.

Why here: A recursive CTE has two parts: the base case (root categories) and the recursive case (children of each parent).

Base case: SELECT all categories where parent_id IS NULL (roots), depth=0. Recursive case: JOIN the CTE with categories to find children, increment depth. Recursion stops when no new rows are found.
SQL โ€” recursive CTE for hierarchy
-- โ”€โ”€ Traverse a category hierarchy โ”€โ”€
WITH RECURSIVE category_tree AS (
    -- Base case: root categories (no parent)
    SELECT id, name, parent_id, 0 AS depth, name::text AS path
    FROM categories
    WHERE parent_id IS NULL

    UNION ALL

    -- Recursive case: child categories
    SELECT c.id, c.name, c.parent_id, ct.depth + 1,
           (ct.path || ' > ' || c.name)::text
    FROM categories c
    JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT
    REPEAT('  ', depth) || name AS indented_name,
    depth,
    path
FROM category_tree
ORDER BY path;

4. Indexes and performance

๐Ÿ“– Term: Index (B-tree / GIN)

Definition: Data structure optimized for fast search. B-tree indexes scalar columns, GIN indexes complex types (JSONB, arrays, full-text).

Purpose: Accelerate filters (WHERE), avoid full table scans (Seq Scan).

Why here: Without an index, PostgreSQL scans all rows (Seq Scan). With an index, it uses a tree structure to find matching rows in O(log n).

GIN = Generalized Inverted Index, ideal for full-text search: to_tsvector() tokenizes text, GIN indexes tokens, @@ searches matching tokens. Partial index: WHERE status IN ('pending', ...) reduces size by excluding completed orders (rarely filtered).
SQL โ€” strategic indexing
-- โ”€โ”€ Analyze a slow query โ”€โ”€
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 42 AND status = 'pending';

-- Before index: Seq Scan cost=0..1234 rows=15000

-- โ”€โ”€ Create a composite index โ”€โ”€
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- Partial index: only unprocessed orders
CREATE INDEX idx_orders_pending ON orders(user_id)
    WHERE status IN ('pending', 'processing');

-- After index: Index Scan cost=0..8 rows=3

-- โ”€โ”€ Index for full-text search โ”€โ”€
CREATE INDEX idx_products_search ON products
    USING GIN(to_tsvector('english', name));

SELECT name, price
FROM products
WHERE to_tsvector('english', name) @@ to_tsquery('english', 'laptop & portable')
ORDER BY ts_rank(to_tsvector('english', name), to_tsquery('english', 'laptop & portable')) DESC;

-- โ”€โ”€ Partial unique index โ”€โ”€
CREATE UNIQUE INDEX idx_users_active_email ON users(email)
    WHERE deleted_at IS NULL;  -- Email unique only for active accounts

5. The same model in MongoDB

๐Ÿ“– Term: Denormalization

Definition: Intentional copying of related data into the same document, instead of referencing it.

Purpose: Avoid costly JOINs (which don't exist in MongoDB), reduce queries, improve read performance.

Why here: Tradeoff: denormalization favors reads (one query = one complete document) but complicates updates (update user.name in all orders if name changes).

๐Ÿ“– Term: MongoDB aggregation pipeline

Definition: Series of transformation stages applied sequentially to documents.

Purpose: Filter, group, transform and sort data without client-side processing.

Why here: Each stage $match, $group, $sort produces a stream of documents feeding the next stage.

In MongoDB, data is denormalized to avoid costly joins:

MongoDB โ€” denormalized model
// โ”€โ”€ Order document (complete order) โ”€โ”€
{
  _id: ObjectId("..."),
  // User data copied at purchase time
  user: {
    id: "user_42",
    name: "Alderi KAMTCHOUA",
    email: "alderi@example.com"
  },
  status: "completed",
  // Products included in order (no join needed)
  items: [
    {
      product: {
        id: "prod_1",
        name: "MacBook Pro 16\"",
        sku: "MBP-16-M3"
      },
      quantity: 1,
      unitPrice: 2499.99
    },
    {
      product: { id: "prod_5", name: "Magic Mouse" },
      quantity: 2,
      unitPrice: 79.99
    }
  ],
  totalAmount: 2659.97,
  createdAt: ISODate("2025-01-15T10:30:00Z")
}

// โ”€โ”€ MongoDB aggregation: monthly revenue โ”€โ”€
db.orders.aggregate([
  { $match: { status: "completed" } },
  {
    $group: {
      _id: {
        year: { $year: "$createdAt" },
        month: { $month: "$createdAt" }
      },
      totalRevenue: { $sum: "$totalAmount" },
      orderCount: { $sum: 1 },
      avgOrderValue: { $avg: "$totalAmount" }
    }
  },
  { $sort: { "_id.year": -1, "_id.month": -1 } }
]);

// โ”€โ”€ MongoDB indexes โ”€โ”€
db.orders.createIndex({ "user.id": 1, status: 1 });
db.orders.createIndex({ createdAt: -1 });
db.orders.createIndex({ "items.product.id": 1 });
Denormalization: user and items are nested in orders, no joins. At order creation time, copy user.name and items[].product.name. If the user updates their name, old orders keep the original name (snapshot at purchase time).

SQL vs NoSQL: when to choose what?

๐Ÿ“– Term: Foreign Key

Definition: Column in one table that references the primary key of another table, enforcing referential integrity.

Purpose: Guarantee that each order.user_id corresponds to an existing user.id, prevent orphaned data.

Why here: SQL enforces foreign keys at the database level. MongoDB, without mandatory schema, cannot enforce this (application code responsibility).

Comparison
# PostgreSQL / SQL
โœ… Structured data with complex relationships
โœ… Data integrity (ACID, constraints, foreign keys)
โœ… Reports and analytics with JOINs and aggregations
โœ… Multi-table transactions
โœ… Stable, well-defined schema
โŒ Horizontal scalability is complex
โŒ Rigid schema (migrations required)

# MongoDB / NoSQL
โœ… Semi-structured data (flexible JSON documents)
โœ… Native horizontal scalability (sharding)
โœ… Rapid development (no migrations)
โœ… Natural hierarchies and nested lists
โœ… High availability with replica sets
โŒ No native joins โ†’ denormalization required
โŒ No multi-document ACID (since v4.0 only)