SQL Deep Dive

Beyond basic CRUD: Query optimization, execution plans, window functions, transactions, and interview patterns.

Query Execution Indexing EXPLAIN Plans Join Algorithms Window Functions CTEs & Recursion Transactions Locking Interview Patterns

1. How Queries Execute

Understanding the query lifecycle helps you write better queries and debug performance issues.

flowchart LR subgraph "Query Lifecycle" A[SQL Query] --> B[Parser] B --> C[Analyzer] C --> D[Optimizer] D --> E[Executor] E --> F[Results] end B -->|Syntax Tree| C C -->|Logical Plan| D D -->|Physical Plan| E style A fill:#5e81ac,color:#eceff4 style B fill:#88c0d0,color:#2e3440 style C fill:#88c0d0,color:#2e3440 style D fill:#a3be8c,color:#2e3440 style E fill:#ebcb8b,color:#2e3440 style F fill:#a3be8c,color:#2e3440

Stage 1: Parser

Converts SQL text into an Abstract Syntax Tree (AST). Checks syntax but not semantics. SELECT * FORM users fails here (FORM vs FROM).

Stage 2: Analyzer (Binder)

Resolves table/column names, checks permissions, validates types. SELECT * FROM nonexistent_table fails here.

Stage 3: Optimizer

The most complex and important stage. The optimizer:

Generates candidate plans - Different ways to execute the query
Estimates costs - Using table statistics (row counts, value distributions)
Chooses cheapest plan - Based on I/O, CPU, and memory costs

The optimizer might choose a "worse-looking" plan because it has better statistics. This is why ANALYZE (updating statistics) can fix slow queries.

Stage 4: Executor

Executes the physical plan using a pull-based iterator model:

Project (id, name) │ ▼ Filter (age > 21) │ ▼ Seq Scan (users) │ ▼ [Disk Pages] Each operator "pulls" rows from child operators on demand. No intermediate materialization unless necessary.

2. Indexing Deep Dive

B-Tree Index (Default)

The workhorse of database indexing. Keeps data sorted for efficient range queries.

[ 50 | 100 ] <- Root (1 page) / | \ [10|30] [60|80] [120|150] <- Internal (1 page each) / | \ / | \ / | \ [1-9][11-29][31-49]... <- Leaf nodes (data/pointers)

B-Trees have O(log N) lookup because each level reduces the search space. With a branching factor of 100 and 3 levels, you can index 1 million rows (100 × 100 × 100) with just 3 page reads.

When B-Tree Helps

-- Equality
SELECT * FROM users WHERE id = 42;

-- Range queries
SELECT * FROM orders WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31';

-- Prefix matching
SELECT * FROM users WHERE name LIKE 'John%';

-- Sorting
SELECT * FROM products ORDER BY price LIMIT 10;

When B-Tree Doesn't Help

-- Leading wildcard (must scan all)
SELECT * FROM users WHERE name LIKE '%smith';

-- Functions on indexed column
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';

-- OR conditions on different columns
SELECT * FROM users WHERE status = 'active' OR role = 'admin';

-- NOT conditions
SELECT * FROM users WHERE status != 'deleted';

Composite (Multi-Column) Indexes

Column order matters! Index on (country, city) is useless for WHERE city = 'Paris'.

-- Index on (country, city, zipcode)

-- Uses full index (all 3 columns)
WHERE country = 'USA' AND city = 'NYC' AND zipcode = '10001'

-- Uses first 2 columns
WHERE country = 'USA' AND city = 'NYC'

-- Uses first column only
WHERE country = 'USA'

-- CANNOT use index (missing leftmost column)
WHERE city = 'NYC'

Ordering rule: Put equality conditions first, then range conditions. Index on (status, created_at) works well for WHERE status = 'active' AND created_at > '2024-01-01'.

Covering Indexes

When an index contains ALL columns needed by a query, it's a "covering index" - no table access needed.

-- Query
SELECT email, name FROM users WHERE status = 'active';

-- Covering index - includes all columns in query
CREATE INDEX idx_users_status_covering
ON users(status) INCLUDE (email, name);

-- PostgreSQL will show "Index Only Scan" in EXPLAIN

Covering indexes avoid the "bookmark lookup" - the expensive random I/O to fetch remaining columns from the heap. This can be 10-100x faster for large tables.

Partial (Filtered) Indexes

Index only a subset of rows. Smaller index = faster updates + fits in memory.

-- Only 5% of orders are pending, but we query them constantly
CREATE INDEX idx_orders_pending
ON orders(created_at)
WHERE status = 'pending';

-- Query must include the filter condition
SELECT * FROM orders
WHERE status = 'pending' AND created_at > NOW() - INTERVAL '1 day';

Hash Indexes

-- Only for equality comparisons, O(1) lookup
CREATE INDEX idx_users_email_hash ON users USING hash(email);

-- Good for
WHERE email = 'john@example.com'

-- Cannot use for ranges, ordering, or prefix matching

Index Types Comparison

Index Type	Best For	Limitations
B-Tree	Equality, ranges, sorting, prefix	Not great for full-text or spatial
Hash	Exact equality lookups	No ranges, no ordering
GiST	Geometric, full-text, ranges	More complex, slower updates
GIN	Arrays, JSONB, full-text	Expensive to update
BRIN	Very large tables with natural ordering	Only for correlated data

3. Reading EXPLAIN Plans

EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id;

HashAggregate (cost=1234..1240 rows=100) (actual time=45.2..45.8 rows=95) Group Key: u.id -> Hash Join (cost=100..1200 rows=5000) (actual time=10.1..40.2 rows=4800) Hash Cond: (o.user_id = u.id) -> Seq Scan on orders o (cost=0..800 rows=50000) (actual time=0.01..15.3 rows=50000) -> Hash (cost=80..80 rows=100) (actual time=5.2..5.2 rows=95) Buckets: 1024 -> Index Scan on users u (cost=0..80 rows=100) (actual time=0.02..4.8 rows=95) Index Cond: (created_at > '2024-01-01')

Key Metrics to Watch

Metric	Meaning	Red Flag
`cost=X..Y`	Startup cost..Total cost (arbitrary units)	Very high numbers
`rows=N`	Estimated rows to process	Estimate ≠ Actual (bad stats)
`actual time`	Real execution time (ms)	High times on simple ops
`loops=N`	Times this node executed	High loops = nested loop issue

Scan Types (Worst to Best)

Seq Scan

Reads entire table. Bad for large tables with selective conditions.

-- Usually bad
Seq Scan on users
  Filter: (status = 'active')
  Rows Removed by Filter: 999000

Index Scan

Uses index, then fetches rows from table (random I/O).

-- Good for selective queries
Index Scan using idx_users_status
  Index Cond: (status = 'active')

Index Only Scan

All data from index - no table access. Best possible.

-- Best - covering index
Index Only Scan using idx_covering
  Index Cond: (status = 'active')
  Heap Fetches: 0

Bitmap Index Scan

Collects row locations, sorts, then fetches. Good for medium selectivity.

-- Good for OR or multiple indexes
Bitmap Heap Scan on users
  -> BitmapOr
     -> Bitmap Index Scan
     -> Bitmap Index Scan

Common Performance Issues

Row estimate mismatch: rows=100 actual rows=100000
Fix: Run ANALYZE tablename; to update statistics.

Nested Loop with high loops: loops=10000 on a Seq Scan
Fix: Add an index on the join column or use a different join strategy.

Sort with high memory: Sort Method: external merge Disk
Fix: Increase work_mem or add an index to avoid sorting.

4. Join Algorithms

The optimizer chooses join algorithms based on table sizes, indexes, and available memory.

Nested Loop Join

For each row in outer table: O(N) For each row in inner table: O(M) If join condition matches: Output row Complexity: O(N × M) - terrible for large tables!

Best when: Outer table is small AND inner table has an index on join column. The index turns the inner loop from O(M) to O(log M).

-- Good for nested loop: small outer, indexed inner
SELECT * FROM orders o          -- 100 rows (filtered)
JOIN products p ON o.product_id = p.id  -- Index on p.id
WHERE o.user_id = 42;

Hash Join

Phase 1 - Build (smaller table): Create hash table on join key O(N) Phase 2 - Probe (larger table): For each row in larger table: O(M) Look up in hash table O(1) If found, output row Complexity: O(N + M) - linear!

Best when: No useful indexes, tables fit in memory, equality joins only. Cannot use hash join for range conditions like a.date > b.date.

flowchart LR subgraph "Build Phase" A[Small Table] --> B[Hash Table] end subgraph "Probe Phase" C[Large Table] --> D{Hash Lookup} D --> E[Output Matches] end B --> D style A fill:#88c0d0,color:#2e3440 style B fill:#a3be8c,color:#2e3440 style C fill:#5e81ac,color:#eceff4 style D fill:#ebcb8b,color:#2e3440 style E fill:#a3be8c,color:#2e3440

Merge Join (Sort-Merge)

Phase 1 - Sort both tables by join key O(N log N) + O(M log M) (Skip if already sorted via index) Phase 2 - Merge: Advance pointers in sorted order O(N + M) Output matching rows Complexity: O(N log N + M log M) or O(N + M) if pre-sorted

Best when: Both tables have indexes on join column (pre-sorted), or when result needs to be sorted anyway.

Join Algorithm Comparison

Algorithm	Time	Memory	Best When
Nested Loop	O(N × M)	O(1)	Small outer + indexed inner
Hash Join	O(N + M)	O(N)	No indexes, equality only
Merge Join	O(N + M)*	O(1)	Pre-sorted inputs, range joins

*Plus sort cost if not pre-sorted

5. Window Functions

Window functions compute values across rows related to the current row without collapsing rows like GROUP BY.

flowchart LR subgraph "GROUP BY" A[10 rows] --> B[3 groups] end subgraph "Window Function" C[10 rows] --> D[10 rows + computed values] end style A fill:#5e81ac,color:#eceff4 style B fill:#bf616a,color:#eceff4 style C fill:#5e81ac,color:#eceff4 style D fill:#a3be8c,color:#2e3440

Anatomy of a Window Function

function_name(...) OVER (
    PARTITION BY column      -- Groups rows (like GROUP BY but keeps all rows)
    ORDER BY column          -- Defines row order within partition
    ROWS BETWEEN ... AND ... -- Frame: which rows to include
)

Ranking Functions

SELECT
    name,
    department,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
    RANK()       OVER (PARTITION BY department ORDER BY salary DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank
FROM employees;

name	department	salary	row_num	rank	dense_rank
Alice	Engineering	150k	1	1	1
Bob	Engineering	150k	2	1	1
Carol	Engineering	120k	3	3	2
Dave	Sales	100k	1	1	1

ROW_NUMBER: Always unique (1,2,3,4). Good for pagination.
RANK: Same values get same rank, gaps after ties (1,1,3). Good for competitions.
DENSE_RANK: Same values get same rank, no gaps (1,1,2). Good for "top N distinct values".

Offset Functions: LAG and LEAD

-- Compare each sale to the previous day's sale
SELECT
    date,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY date) AS prev_day_revenue,
    revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_over_day_change,
    LEAD(revenue, 1) OVER (ORDER BY date) AS next_day_revenue
FROM daily_sales;

date	revenue	prev_day	change	next_day
Jan 1	1000	NULL	NULL	1200
Jan 2	1200	1000	+200	1100
Jan 3	1100	1200	-100	NULL

Running Totals and Moving Averages

SELECT
    date,
    revenue,
    -- Running total
    SUM(revenue) OVER (
        ORDER BY date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total,

    -- 7-day moving average
    AVG(revenue) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM daily_sales;

Frame Specifications

ROWS BETWEEN: UNBOUNDED PRECEDING ─────┐ N PRECEDING ─────────────┤ CURRENT ROW ─────────────┼──── Frame includes these rows N FOLLOWING ─────────────┤ UNBOUNDED FOLLOWING ─────┘ Examples: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW = All rows up to current ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING = 7-row window ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING = Current row to end

FIRST_VALUE, LAST_VALUE, NTH_VALUE

SELECT
    department,
    name,
    salary,
    FIRST_VALUE(name) OVER w AS highest_paid,
    LAST_VALUE(name) OVER w AS lowest_paid,
    salary - FIRST_VALUE(salary) OVER w AS diff_from_top
FROM employees
WINDOW w AS (
    PARTITION BY department
    ORDER BY salary DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);

LAST_VALUE has a gotcha: Default frame is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, so LAST_VALUE returns current row, not the actual last. Always specify UNBOUNDED FOLLOWING explicitly.

6. CTEs and Recursive Queries

Common Table Expressions (WITH clause)

WITH
    monthly_totals AS (
        SELECT
            DATE_TRUNC('month', order_date) AS month,
            SUM(total) AS revenue
        FROM orders
        GROUP BY 1
    ),
    with_growth AS (
        SELECT
            month,
            revenue,
            LAG(revenue) OVER (ORDER BY month) AS prev_revenue
        FROM monthly_totals
    )
SELECT
    month,
    revenue,
    ROUND((revenue - prev_revenue) / prev_revenue * 100, 2) AS growth_pct
FROM with_growth;

CTEs improve readability by naming intermediate results. They're evaluated once (materialized) in PostgreSQL by default, which can help or hurt performance.

Recursive CTEs - Hierarchical Data

-- Employee org chart: find all reports under a manager
WITH RECURSIVE org_chart AS (
    -- Base case: the starting manager
    SELECT id, name, manager_id, 0 AS depth
    FROM employees
    WHERE id = 1  -- CEO

    UNION ALL

    -- Recursive case: find direct reports
    SELECT e.id, e.name, e.manager_id, oc.depth + 1
    FROM employees e
    JOIN org_chart oc ON e.manager_id = oc.id
)
SELECT * FROM org_chart;

flowchart TD subgraph "Iteration 0" A[CEO id=1] end subgraph "Iteration 1" B[VP Eng id=2] C[VP Sales id=3] end subgraph "Iteration 2" D[Engineer id=4] E[Engineer id=5] F[Sales Rep id=6] end A --> B A --> C B --> D B --> E C --> F style A fill:#88c0d0,color:#2e3440 style B fill:#a3be8c,color:#2e3440 style C fill:#a3be8c,color:#2e3440 style D fill:#ebcb8b,color:#2e3440 style E fill:#ebcb8b,color:#2e3440 style F fill:#ebcb8b,color:#2e3440

Recursive CTE - Graph Traversal

-- Find all paths from node A to node F
WITH RECURSIVE paths AS (
    -- Start at node A
    SELECT
        source,
        target,
        ARRAY[source, target] AS path,
        weight
    FROM edges
    WHERE source = 'A'

    UNION ALL

    -- Extend paths
    SELECT
        p.source,
        e.target,
        p.path || e.target,
        p.weight + e.weight
    FROM paths p
    JOIN edges e ON p.target = e.source
    WHERE NOT e.target = ANY(p.path)  -- Prevent cycles
)
SELECT path, weight
FROM paths
WHERE target = 'F'
ORDER BY weight;

Recursive CTEs can run forever without a termination condition. Always include cycle detection (NOT x = ANY(path)) or a depth limit (WHERE depth < 10).

7. Transactions & Isolation Levels

ACID Properties

Property	Meaning	Example
Atomicity	All or nothing	Transfer fails midway → both accounts unchanged
Consistency	Valid state to valid state	Balance never goes negative (if constrained)
Isolation	Concurrent txns don't interfere	Two transfers don't overwrite each other
Durability	Committed = permanent	Power loss after COMMIT → data survives

Isolation Levels and Anomalies

flowchart LR subgraph "More Isolation →" A[Read Uncommitted] --> B[Read Committed] B --> C[Repeatable Read] C --> D[Serializable] end style A fill:#bf616a,color:#eceff4 style B fill:#d08770,color:#2e3440 style C fill:#ebcb8b,color:#2e3440 style D fill:#a3be8c,color:#2e3440

Read Phenomena

Dirty Read

Reading uncommitted changes from another transaction.

T1: UPDATE balance = 0
T2: SELECT balance → 0  # Dirty!
T1: ROLLBACK
# T2 saw data that never existed

Non-Repeatable Read

Same query returns different results within one transaction.

T1: SELECT balance → 100
T2: UPDATE balance = 50
T2: COMMIT
T1: SELECT balance → 50  # Changed!

Phantom Read

New rows appear that match a previous query's conditions.

T1: SELECT COUNT(*) WHERE age>21 → 5
T2: INSERT (age=25)
T2: COMMIT
T1: SELECT COUNT(*) WHERE age>21 → 6

Lost Update

Two transactions overwrite each other's changes.

T1: SELECT balance → 100
T2: SELECT balance → 100
T1: UPDATE balance = 100 + 50
T2: UPDATE balance = 100 + 30
# Expected: 180, Actual: 130

Isolation Level Matrix

Level	Dirty Read	Non-Repeatable	Phantom	Use Case
Read Uncommitted	Possible	Possible	Possible	Almost never used
Read Committed	Prevented	Possible	Possible	PostgreSQL default
Repeatable Read	Prevented	Prevented	Varies*	MySQL default (InnoDB)
Serializable	Prevented	Prevented	Prevented	Financial systems

*PostgreSQL's Repeatable Read prevents phantoms; MySQL's does not.

MVCC - How Isolation Works

Multi-Version Concurrency Control: ┌─────────────────────────────────────────────────────────────┐ │ Row: id=1 │ │ │ │ Version 1: balance=100 │ created: txn_10 │ expired: txn_15│ │ Version 2: balance=150 │ created: txn_15 │ expired: txn_20│ │ Version 3: balance=200 │ created: txn_20 │ expired: - │ │ │ │ Transaction 18 (Repeatable Read) sees: Version 2 (150) │ │ Transaction 25 (any level) sees: Version 3 (200) │ └─────────────────────────────────────────────────────────────┘

MVCC allows readers and writers to not block each other. Each transaction sees a consistent snapshot based on when it started. Old versions are cleaned up by VACUUM.

8. Locking & Concurrency

Lock Types

Lock	Acquired By	Blocks
ACCESS SHARE	SELECT	ACCESS EXCLUSIVE only
ROW SHARE	SELECT FOR UPDATE	EXCLUSIVE, ACCESS EXCLUSIVE
ROW EXCLUSIVE	INSERT, UPDATE, DELETE	SHARE, EXCLUSIVE, ACCESS EXCLUSIVE
SHARE	CREATE INDEX (non-concurrent)	ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE
ACCESS EXCLUSIVE	ALTER TABLE, DROP, VACUUM FULL	Everything

SELECT FOR UPDATE - Pessimistic Locking

BEGIN;

-- Lock the row, preventing other transactions from modifying it
SELECT balance FROM accounts
WHERE id = 1
FOR UPDATE;

-- Safe to update - no one else can change it
UPDATE accounts SET balance = balance - 100 WHERE id = 1;

COMMIT;

FOR UPDATE Variants

-- Wait for lock (default)
SELECT * FROM orders WHERE id = 1 FOR UPDATE;

-- Error immediately if locked
SELECT * FROM orders WHERE id = 1 FOR UPDATE NOWAIT;

-- Skip locked rows (great for job queues)
SELECT * FROM jobs WHERE status = 'pending'
LIMIT 1 FOR UPDATE SKIP LOCKED;

Optimistic Locking

-- Add version column to table
ALTER TABLE products ADD COLUMN version INT DEFAULT 1;

-- Read current state
SELECT id, name, price, version FROM products WHERE id = 42;
-- Returns: version = 5

-- Update only if version unchanged
UPDATE products
SET price = 99.99, version = version + 1
WHERE id = 42 AND version = 5;

-- If 0 rows affected → someone else modified → retry

Pessimistic (FOR UPDATE)

Lock upfront, then modify
Guarantees success
Blocks other transactions
Best for: High contention

Optimistic (Version)

Read, modify, check version
May need to retry
No blocking
Best for: Low contention, read-heavy

Deadlocks

Transaction 1: Transaction 2: ───────────── ───────────── LOCK row A LOCK row B ↓ ↓ Waiting for B ─────────────────► Waiting for A ← DEADLOCK! →

Prevention: Always acquire locks in a consistent order (e.g., by ID ascending). If you need to lock rows 5 and 3, always lock 3 first.

-- Lock multiple rows in consistent order
SELECT * FROM accounts
WHERE id IN (3, 5, 7)
ORDER BY id  -- Consistent ordering!
FOR UPDATE;

9. Interview SQL Patterns

Pattern 1: Top N Per Group

Problem: Get top 3 highest paid employees per department.

-- Using window function (most common approach)
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY department_id
            ORDER BY salary DESC
        ) AS rn
    FROM employees
)
SELECT * FROM ranked WHERE rn <= 3;

-- Alternative: LATERAL join (PostgreSQL)
SELECT d.name, e.*
FROM departments d
CROSS JOIN LATERAL (
    SELECT * FROM employees
    WHERE department_id = d.id
    ORDER BY salary DESC
    LIMIT 3
) e;

Pattern 2: Running Total / Cumulative Sum

SELECT
    date,
    amount,
    SUM(amount) OVER (
        ORDER BY date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM transactions;

Pattern 3: Gaps and Islands

Problem: Find consecutive date ranges from sparse data.

-- Find islands of consecutive dates
WITH grouped AS (
    SELECT
        date,
        date - (ROW_NUMBER() OVER (ORDER BY date))::int AS grp
    FROM events
)
SELECT
    MIN(date) AS island_start,
    MAX(date) AS island_end,
    COUNT(*) AS consecutive_days
FROM grouped
GROUP BY grp
ORDER BY island_start;

Dates: Jan 1, Jan 2, Jan 3, Jan 5, Jan 6, Jan 10 Row Num: 1 2 3 4 5 6 Subtract: Dec 31 Dec 31 Dec 31 Jan 1 Jan 1 Jan 4 ──────────────── ────────── ───── Island 1 (3 days) Island 2 (2d) Island 3

Pattern 4: Pivot / Transpose

-- Transform rows to columns
SELECT
    user_id,
    SUM(CASE WHEN month = 'Jan' THEN revenue ELSE 0 END) AS jan_revenue,
    SUM(CASE WHEN month = 'Feb' THEN revenue ELSE 0 END) AS feb_revenue,
    SUM(CASE WHEN month = 'Mar' THEN revenue ELSE 0 END) AS mar_revenue
FROM monthly_sales
GROUP BY user_id;

-- PostgreSQL crosstab (requires tablefunc extension)
SELECT * FROM crosstab(
    'SELECT user_id, month, revenue FROM monthly_sales ORDER BY 1,2'
) AS ct(user_id INT, jan NUMERIC, feb NUMERIC, mar NUMERIC);

Pattern 5: Find Duplicates

-- Find duplicate emails
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Delete duplicates, keeping lowest id
DELETE FROM users
WHERE id NOT IN (
    SELECT MIN(id) FROM users GROUP BY email
);

-- Better: Using CTE with window function
WITH ranked AS (
    SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
    FROM users
)
DELETE FROM users WHERE id IN (SELECT id FROM ranked WHERE rn > 1);

Pattern 6: Median

-- PostgreSQL has built-in percentile functions
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
FROM employees;

-- Manual approach (works in most databases)
WITH ordered AS (
    SELECT
        salary,
        ROW_NUMBER() OVER (ORDER BY salary) AS rn,
        COUNT(*) OVER () AS total
    FROM employees
)
SELECT AVG(salary) AS median
FROM ordered
WHERE rn IN (total/2, total/2 + 1)
   OR (total % 2 = 1 AND rn = (total + 1) / 2);

Pattern 7: Consecutive Events

Problem: Find users who logged in 3+ consecutive days.

WITH login_groups AS (
    SELECT
        user_id,
        login_date,
        login_date - ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY login_date
        )::int AS grp
    FROM logins
    GROUP BY user_id, login_date  -- Dedupe same-day logins
)
SELECT user_id, COUNT(*) AS consecutive_days
FROM login_groups
GROUP BY user_id, grp
HAVING COUNT(*) >= 3;

Pattern 8: Self Join for Comparisons

-- Find employees who earn more than their manager
SELECT e.name AS employee, e.salary, m.name AS manager, m.salary AS mgr_salary
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;

Pattern 9: NOT EXISTS vs NOT IN

-- Find customers who never ordered

-- NOT IN: BEWARE of NULLs! Returns empty if any NULL in subquery
SELECT * FROM customers
WHERE id NOT IN (SELECT customer_id FROM orders);

-- NOT EXISTS: NULL-safe, often faster
SELECT * FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);

-- LEFT JOIN / IS NULL: Also NULL-safe
SELECT c.* FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.id IS NULL;

NOT IN with a subquery that can return NULL will always return 0 rows. Always prefer NOT EXISTS for anti-joins.

Pattern 10: Date Manipulation

-- Truncate to start of period
SELECT DATE_TRUNC('month', created_at) AS month_start FROM orders;

-- Generate date series (for filling gaps)
SELECT generate_series(
    '2024-01-01'::date,
    '2024-12-31'::date,
    '1 day'::interval
) AS date;

-- Join with date series to include zero-count days
WITH dates AS (
    SELECT generate_series('2024-01-01', '2024-01-31', '1 day'::interval)::date AS d
)
SELECT d.d, COALESCE(COUNT(o.id), 0) AS order_count
FROM dates d
LEFT JOIN orders o ON d.d = o.order_date::date
GROUP BY d.d
ORDER BY d.d;

Quick Reference: Performance Checklist

Before the interview, memorize:

Run EXPLAIN ANALYZE - check for Seq Scans on large tables
Check estimated vs actual rows - if way off, run ANALYZE
Composite index order: equality first, then range
Use covering indexes to avoid heap lookups
Prefer NOT EXISTS over NOT IN
Avoid functions on indexed columns in WHERE clause
For top-N per group, use ROW_NUMBER() OVER (PARTITION BY...)
Lock ordering prevents deadlocks
Use FOR UPDATE SKIP LOCKED for job queues
CTEs are materialized in PostgreSQL (can help or hurt)

← Back to Index