Database Performance & Optimization

Query Optimization, Indexing, and Best Practices

Database Indexes

What is an Index?

An index is a data structure that improves query performance by reducing the number of rows scanned. Think of it like a book index - instead of reading every page to find "database", you check the index.

Trade-offs:
  • Pros: Faster SELECT queries (especially WHERE, JOIN, ORDER BY)
  • Cons: Slower INSERT/UPDATE/DELETE (index must be updated), more disk space

Rule of thumb: Index columns used in WHERE clauses, JOIN conditions, and ORDER BY

Index Types

1. B-Tree Index (Most Common)

Default index type in PostgreSQL, MySQL. Balanced tree structure.

Good for:
  • Equality searches: WHERE user_id = 123
  • Range queries: WHERE created_at > '2024-01-01'
  • Sorting: ORDER BY last_name
  • Prefix searches: WHERE email LIKE 'john%' (but NOT '%@gmail.com')
Not good for:
  • Pattern matching: LIKE '%search%'
  • Very low cardinality (few distinct values): gender CHAR(1)
-- Create B-tree index (default)
CREATE INDEX idx_users_email ON users(email);

-- Composite index (multiple columns)
CREATE INDEX idx_users_name ON users(last_name, first_name);

-- Unique index
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

2. Hash Index

Fast equality checks, no range queries.

Good for: Exact matches only (WHERE id = 123)
Cannot: Range queries, sorting, pattern matching
-- PostgreSQL hash index
CREATE INDEX idx_users_id_hash ON users USING HASH (id);

-- Use case: primary key lookups (though B-tree is usually fine)

3. GIN (Generalized Inverted Index)

For columns with multiple values: arrays, JSONB, full-text search.

Good for:
  • Array searches: WHERE tags @> ARRAY['postgresql']
  • JSONB queries: WHERE data @> '{"status": "active"}'
  • Full-text search: WHERE to_tsvector(content) @@ to_tsquery('database')
-- Index on JSONB column
CREATE INDEX idx_users_metadata ON users USING GIN (metadata);

-- Query:
SELECT * FROM users WHERE metadata @> '{"premium": true}';

-- Full-text search index
CREATE INDEX idx_posts_search ON posts USING GIN (to_tsvector('english', content));

4. GiST (Generalized Search Tree)

For geometric data, full-text search, custom data types.

Good for:
  • Spatial queries (PostGIS): WHERE location && 'BOX(...)'
  • Range types: WHERE date_range && '[2024-01-01,2024-12-31]'
  • Full-text search (alternative to GIN)
-- Spatial index (PostGIS)
CREATE INDEX idx_locations_geom ON locations USING GIST (geom);

-- Query:
SELECT * FROM locations
WHERE ST_DWithin(geom, ST_MakePoint(-122.4194, 37.7749), 1000);

5. Partial Index

Index only subset of rows matching a condition.

-- Only index active users
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';

-- Smaller index, faster for:
SELECT * FROM users WHERE status = 'active' AND email = 'john@example.com';

-- Use case: 95% of queries filter by status = 'active'
-- Partial index is much smaller and faster than full index

6. Covering Index (Index-Only Scan)

Index contains all columns needed for query - no table lookup required.

-- Include extra columns in index
CREATE INDEX idx_users_email_name ON users(email) INCLUDE (first_name, last_name);

-- Query can be satisfied entirely from index:
SELECT first_name, last_name FROM users WHERE email = 'john@example.com';

-- No need to access table, just scan index

Composite Indexes (Multi-Column)

Column Order Matters!

Index on (last_name, first_name) is different from (first_name, last_name)

Leftmost Prefix Rule

Composite index can be used if query includes leftmost columns.

-- Index on (last_name, first_name, age)
CREATE INDEX idx_users_composite ON users(last_name, first_name, age);

-- Can use index (leftmost prefix):
SELECT * FROM users WHERE last_name = 'Smith';                          -- ✓ Uses index
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John';  -- ✓ Uses index
SELECT * FROM users WHERE last_name = 'Smith' AND age = 30;             -- ✓ Uses index (skips middle)

-- Cannot use index efficiently:
SELECT * FROM users WHERE first_name = 'John';                          -- ✗ Skips leftmost
SELECT * FROM users WHERE age = 30;                                     -- ✗ Skips leftmost

Best Practices for Column Order

  1. Most selective first: Column with most distinct values
  2. Equality before range: = before >, <, BETWEEN
  3. Common queries first: Columns used in WHERE most often
-- Good order: status (low cardinality) after email (high cardinality)
CREATE INDEX idx_users_email_status ON users(email, status);

-- Bad order: status first limits index usefulness
CREATE INDEX idx_users_status_email ON users(status, email);

EXPLAIN and Query Plans

Reading EXPLAIN Output

-- PostgreSQL
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'john@example.com';
Seq Scan on users  (cost=0.00..1693.00 rows=1 width=54) (actual time=0.234..15.678 rows=1 loops=1)
  Filter: (email = 'john@example.com'::text)
  Rows Removed by Filter: 99999
Planning Time: 0.123 ms
Execution Time: 15.701 ms

Key Metrics

Metric Meaning
Seq Scan Sequential scan (reads entire table) - usually BAD for large tables
Index Scan Uses index - GOOD
Index Only Scan Uses only index, no table lookup - BEST
Bitmap Heap Scan Uses index to find rows, then fetches from table - OK
cost=0.00..1693.00 Estimated cost (startup..total). Lower is better. Units are arbitrary.
rows=1 Estimated rows returned
actual time=0.234..15.678 Actual time (startup..total) in milliseconds (with ANALYZE)
rows=1 (actual) Actual rows returned

With Index

CREATE INDEX idx_users_email ON users(email);
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'john@example.com';
Index Scan using idx_users_email on users  (cost=0.29..8.31 rows=1 width=54) (actual time=0.045..0.047 rows=1 loops=1)
  Index Cond: (email = 'john@example.com'::text)
Planning Time: 0.234 ms
Execution Time: 0.078 ms

Improvement: 15.7ms → 0.078ms (200x faster!)

Common Scan Types

1. Sequential Scan

SELECT * FROM users WHERE age > 25;
Seq Scan on users  (cost=0.00..1693.00 rows=50000 width=54)
  Filter: (age > 25)

Reads entire table. OK for small tables or when returning most rows.

2. Index Scan

SELECT * FROM users WHERE email = 'john@example.com';
Index Scan using idx_users_email on users  (cost=0.29..8.31 rows=1 width=54)
  Index Cond: (email = 'john@example.com'::text)

Uses index, then fetches rows from table. Good for selective queries.

3. Index Only Scan

SELECT email FROM users WHERE email = 'john@example.com';
Index Only Scan using idx_users_email on users  (cost=0.29..4.31 rows=1 width=32)
  Index Cond: (email = 'john@example.com'::text)
  Heap Fetches: 0

All data in index, no table access. Fastest!

4. Bitmap Heap Scan

SELECT * FROM users WHERE age BETWEEN 25 AND 35;
Bitmap Heap Scan on users  (cost=100.29..1500.31 rows=5000 width=54)
  Recheck Cond: ((age >= 25) AND (age <= 35))
  ->  Bitmap Index Scan on idx_users_age  (cost=0.00..99.04 rows=5000 width=0)
        Index Cond: ((age >= 25) AND (age <= 35))

Builds bitmap of matching rows from index, then fetches in physical order. Good for large result sets.

JOIN Strategies

1. Nested Loop Join

SELECT * FROM users u JOIN orders o ON u.id = o.user_id;
Nested Loop  (cost=0.29..1234.56 rows=100 width=...)
  ->  Seq Scan on users u  (cost=0.00..123.00 rows=100 width=...)
  ->  Index Scan using idx_orders_user_id on orders o  (cost=0.29..10.50 rows=5 width=...)
        Index Cond: (user_id = u.id)

For each row in outer table, scan inner table. Good when outer table is small.

2. Hash Join

SELECT * FROM users u JOIN orders o ON u.id = o.user_id;
Hash Join  (cost=234.56..5678.90 rows=10000 width=...)
  Hash Cond: (o.user_id = u.id)
  ->  Seq Scan on orders o  (cost=0.00..4000.00 rows=100000 width=...)
  ->  Hash  (cost=123.00..123.00 rows=5000 width=...)
        ->  Seq Scan on users u  (cost=0.00..123.00 rows=5000 width=...)

Build hash table from smaller table, probe with larger table. Good for large, unindexed joins.

3. Merge Join

SELECT * FROM users u JOIN orders o ON u.id = o.user_id
ORDER BY u.id;
Merge Join  (cost=123.45..2345.67 rows=10000 width=...)
  Merge Cond: (u.id = o.user_id)
  ->  Index Scan using users_pkey on users u  (cost=0.29..500.00 rows=5000 width=...)
  ->  Index Scan using idx_orders_user_id on orders o  (cost=0.29..1500.00 rows=50000 width=...)

Both tables sorted, merge in order. Good when both tables indexed on join key.

Query Optimization Techniques

1. Select Only Needed Columns

-- BAD: Fetches all columns, including large BLOBs
SELECT * FROM users;
-- GOOD: Fetch only what you need
SELECT id, email, first_name FROM users;

Benefits: Less data transferred, can use covering indexes, less memory

2. Use LIMIT for Large Result Sets

-- BAD: Fetches millions of rows
SELECT * FROM logs ORDER BY created_at DESC;
-- GOOD: Pagination
SELECT * FROM logs ORDER BY created_at DESC LIMIT 100 OFFSET 0;

-- BETTER: Cursor-based pagination (no OFFSET)
SELECT * FROM logs
WHERE created_at < '2024-01-01 12:00:00'
ORDER BY created_at DESC
LIMIT 100;
OFFSET is expensive!

OFFSET 10000 still scans first 10,000 rows then discards them.

Better: Use cursor-based pagination with WHERE clause.

3. Avoid Functions on Indexed Columns

-- BAD: Function prevents index usage
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
SELECT * FROM orders WHERE YEAR(created_at) = 2024;
-- GOOD: Store lowercase in column or use functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';  -- Now uses index

-- GOOD: Use range query instead of function
SELECT * FROM orders
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';

4. Avoid Leading Wildcards

-- BAD: Leading wildcard prevents index usage
SELECT * FROM users WHERE email LIKE '%@gmail.com';
-- GOOD: Prefix search can use index
SELECT * FROM users WHERE email LIKE 'john%';

-- GOOD: For suffix search, use full-text search or reverse index
CREATE INDEX idx_users_email_reverse ON users(REVERSE(email));
SELECT * FROM users WHERE REVERSE(email) LIKE REVERSE('%@gmail.com');

5. Use EXISTS Instead of IN for Subqueries

-- BAD: IN with subquery can be slow
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders WHERE total > 100);
-- GOOD: EXISTS can stop early
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.id AND o.total > 100
);

-- BETTER: JOIN if you need columns from orders
SELECT DISTINCT u.* FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.total > 100;

EXISTS stops at first match. IN builds entire list first.

6. Batch Operations

-- BAD: 1000 individual INSERT statements
for user in users:
    execute("INSERT INTO users (name, email) VALUES (?, ?)", user.name, user.email)
-- GOOD: Single INSERT with multiple rows
INSERT INTO users (name, email) VALUES
    ('Alice', 'alice@example.com'),
    ('Bob', 'bob@example.com'),
    ('Charlie', 'charlie@example.com'),
    ... (1000 rows)

-- Or use COPY (PostgreSQL) for bulk loading
COPY users(name, email) FROM '/tmp/users.csv' WITH CSV HEADER;

Result: 1000 round trips → 1 round trip. ~100-1000x faster.

The N+1 Query Problem

What is N+1?

Common ORM anti-pattern: 1 query to fetch N records, then N queries to fetch related data.

Problem Example (Django ORM):
# Fetch all users (1 query)
users = User.objects.all()  # SELECT * FROM users

# For each user, fetch their orders (N queries!)
for user in users:
    orders = user.orders.all()  # SELECT * FROM orders WHERE user_id = ?
    print(f"{user.name}: {orders.count()} orders")

# Result: 1 + N queries (if 100 users, 101 queries!)
Solution 1: Eager Loading (Prefetch)
# Fetch users and their orders in 2 queries total
users = User.objects.prefetch_related('orders')

# Query 1: SELECT * FROM users
# Query 2: SELECT * FROM orders WHERE user_id IN (1,2,3,...)

for user in users:
    orders = user.orders.all()  # No query! Already loaded
    print(f"{user.name}: {orders.count()} orders")

# Result: 2 queries regardless of N
Solution 2: JOIN (Select Related)
# Single query with JOIN
users = User.objects.select_related('profile')

# SELECT users.*, profiles.*
# FROM users
# LEFT JOIN profiles ON users.id = profiles.user_id

for user in users:
    print(f"{user.name}: {user.profile.bio}")  # No query!

# Result: 1 query total

When to Use Each

Method Relationship Use Case
select_related() ForeignKey, OneToOne 1:1 or N:1 (uses JOIN)
prefetch_related() ManyToMany, Reverse ForeignKey 1:N or M:N (uses separate query + IN)

Detecting N+1 in Production

1. Query Logging

# Django: Enable query logging
DEBUG = True  # or use django-debug-toolbar

# Count queries
from django.db import connection
print(len(connection.queries))  # Number of queries executed

2. APM Tools

3. Database Query Log

-- PostgreSQL: Enable slow query log
ALTER DATABASE mydb SET log_min_duration_statement = 100;  -- Log queries > 100ms

-- Analyze logs for repeated similar queries
-- Example: Same query with different parameters 100 times = N+1

Connection Pooling

Why Connection Pooling?

Problem without pooling:
# Each request creates new connection
1. Open TCP connection to database (50-100ms)
2. Authenticate (10-50ms)
3. Execute query (5ms)
4. Close connection

Total: 65-155ms per request (overhead >> query time!)
With connection pooling:
# Reuse existing connections
1. Get connection from pool (< 1ms)
2. Execute query (5ms)
3. Return connection to pool

Total: 6ms (10-25x faster!)

How Connection Pools Work

Application                Pool                 Database
   |                        |                       |
   |---Request Conn-------->|                       |
   |                   [Idle Conn 1]--------------->|  (reuse)
   |<--Return Conn 1--------|                       |
   |                        |                       |
   |---Execute Query------->|---------------------->|
   |                        |                       |
   |<--Return Result--------|<----------------------|
   |                        |                       |
   |---Release Conn-------->|                       |
   |                   [Idle Conn 1] (back to pool) |

Pool Configuration

Parameter Meaning Typical Value
min_connections Minimum pooled connections (always open) 2-10
max_connections Maximum pooled connections 10-100 (depends on load)
max_idle_time Close idle connections after this time 10-30 minutes
connection_timeout Wait time for available connection 5-30 seconds
max_lifetime Maximum connection age (prevent stale connections) 30-60 minutes
Python (SQLAlchemy):
from sqlalchemy import create_engine

engine = create_engine(
    'postgresql://user:pass@localhost/db',
    pool_size=10,              # Normal pool size
    max_overflow=20,           # Can create 20 extra connections if needed
    pool_timeout=30,           # Wait 30s for connection
    pool_recycle=3600,         # Recycle connections after 1 hour
    pool_pre_ping=True         # Test connection before using
)
Java (HikariCP - fastest):
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://localhost/db");
config.setUsername("user");
config.setPassword("pass");
config.setMaximumPoolSize(20);
config.setMinimumIdle(5);
config.setConnectionTimeout(30000);
config.setIdleTimeout(600000);
config.setMaxLifetime(1800000);

HikariDataSource dataSource = new HikariDataSource(config);

Connection Pool Sizing

Formula: connections = ((core_count * 2) + effective_spindle_count)

Example: 4 CPU cores, SSD (treat as 1 spindle) = (4 × 2) + 1 = 9 connections

Common mistake: Too many connections!
  • Database has overhead per connection (memory, context switching)
  • More connections ≠ more throughput
  • PostgreSQL default: max 100 connections
  • If you need 1000+ connections, use connection pooler (PgBouncer, PgPool)

Common Anti-Patterns

1. SELECT * in Production

-- BAD: Fetches all columns including BLOBs, JSON, etc.
SELECT * FROM products;  -- 50 columns, 10MB per row

-- GOOD: Only what you need
SELECT id, name, price FROM products;

2. No Index on Foreign Keys

-- BAD: JOIN without index on foreign key
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(id)  -- No index!
);

SELECT * FROM users u JOIN orders o ON u.id = o.user_id;  -- Slow!

-- GOOD: Always index foreign keys
CREATE INDEX idx_orders_user_id ON orders(user_id);

3. Implicit Type Conversion

-- BAD: user_id is INT, but query uses VARCHAR
SELECT * FROM users WHERE user_id = '123';  -- Index not used!

-- GOOD: Match column type
SELECT * FROM users WHERE user_id = 123;

4. OR Conditions on Different Columns

-- BAD: Can't use indexes efficiently
SELECT * FROM products
WHERE category_id = 5 OR price < 10;  -- Two different columns!

-- BETTER: UNION (can use indexes)
SELECT * FROM products WHERE category_id = 5
UNION
SELECT * FROM products WHERE price < 10;

5. Large IN Lists

-- BAD: Huge IN list
SELECT * FROM users WHERE id IN (1,2,3,...,10000);  -- 10,000 IDs!

-- BETTER: Temp table + JOIN
CREATE TEMP TABLE user_ids (id INT);
INSERT INTO user_ids VALUES (1),(2),(3),...,(10000);
SELECT u.* FROM users u JOIN user_ids t ON u.id = t.id;

Key Takeaways

Interview Tips