Database Architecture Patterns for High-Scale Systems in 2025

A definitive guide to choosing and operating databases at scale, from PostgreSQL internals to vector stores for AI workloads.

Tech Talk News Editorial11 min readUpdated Dec 1, 2024
#database#architecture#postgresql#nosql#distributed-systems
ShareXLinkedInRedditEmail
Database Architecture Patterns for High-Scale Systems in 2025

Most teams pick Postgres by default. That's usually correct. Some teams chase the hot new database every couple of years. That's usually wrong. The database decision is one of the most consequential you'll make, and it's made too casually far too often. Often in a 30-minute architecture meeting early in a project when you have the least information about what your actual access patterns will be.

What makes this consequential isn't just performance. It's hiring (fewer people know Cassandra than Postgres), vendor costs (DynamoDB will surprise you at scale), and how fast you can move when requirements change (relational schemas are easier to evolve than you think, and denormalized NoSQL schemas are harder). The engineers who make good database decisions reason about workload shape first, technology second.

OLTP vs OLAP: Getting the Workload Shape Right First

The single most expensive database mistake is mixing OLTP and OLAP concerns on the same cluster. Online Transaction Processing workloads are high-concurrency, short-lived reads and writes on a small number of rows: point lookups, inserts, small updates. Online Analytical Processing workloads are the opposite: low-concurrency, long-running aggregations over millions of rows, often with complex joins.

Row-oriented storage engines (PostgreSQL, MySQL) are optimized for OLTP. Columnar storage engines (BigQuery, Redshift, ClickHouse, DuckDB) are optimized for OLAP because they skip irrelevant columns entirely during a scan. Running your BI queries on your transactional Postgres cluster isn't just a performance problem. It creates lock contention that degrades user-facing latency. I've seen this take down a production system during a quarterly report run.

The modern answer for most teams is a dedicated OLAP sink fed from the OLTP source via logical replication or a CDC pipeline (Debezium into Kafka into ClickHouse or BigQuery). Accept the operational complexity of two systems. A degraded primary is worse.

PostgreSQL at Scale: Features Most Teams Underuse

Postgres remains the most versatile database engine available. Teams that hit walls with it typically haven't used its advanced features rather than having genuinely outgrown it. Before you add a new database to your stack, ask whether you've actually used partitioning, logical replication, and partial indexes.

Declarative Partitioning and pg_partman

Table partitioning is native in PostgreSQL 10+ and should be default for any table expected to exceed ~100M rows or that has clear time-based or tenant-based access patterns. Range partitioning by created_at lets the planner skip entire partitions during a query. A 10-partition prune on a 10B-row events table is a 10x scan reduction before any index work.

-- Create a partitioned events table by month
CREATE TABLE events (
  id          BIGSERIAL,
  user_id     BIGINT NOT NULL,
  event_type  TEXT NOT NULL,
  payload     JSONB,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (created_at);

-- pg_partman handles automatic child partition creation
SELECT partman.create_parent(
  p_parent_table => 'public.events',
  p_control      => 'created_at',
  p_type         => 'native',
  p_interval     => 'monthly',
  p_premake      => 3
);

Pair pg_partman with a maintenance cron that detaches and archives old partitions. Detaching a partition is a metadata-only DDL operation : it doesn't lock the parent table and completes in milliseconds regardless of row count.

JSONB Indexing and Partial Indexes

JSONB with GIN indexes gives you schema flexibility without sacrificing query performance. The key discipline is to index only the keys you query against, not the entire document. For high-cardinality JSONB keys, a functional B-tree index is almost always faster than a GIN index:

-- GIN index for multi-key containment queries
CREATE INDEX idx_events_payload_gin ON events USING GIN (payload);

-- Functional B-tree for a single high-cardinality key
CREATE INDEX idx_events_tenant ON events ((payload->>'tenant_id'));

-- Partial index: only index the rows that matter
CREATE INDEX idx_events_active ON events (user_id, created_at)
  WHERE event_type = 'purchase';

Logical Replication

Logical replication decouples data movement from physical storage. You can replicate a subset of tables, apply row filters, and stream changes to heterogeneous targets including Kafka, ClickHouse, and other Postgres instances running different major versions. This is the correct architecture for zero-downtime major version upgrades and for feeding your OLAP pipeline.

When to Go NoSQL: A Decision Matrix

The question is never "SQL vs NoSQL." It's whether your access pattern fits the relational model or genuinely requires a different data model. Three clear cases for non-relational engines:

  • MongoDB: document-centric workloads where the schema genuinely varies per record and joins are rare. E-commerce product catalogs with wildly heterogeneous attributes are a genuine fit. Do not use MongoDB to avoid learning SQL. That's a decision you'll regret.
  • Cassandra / ScyllaDB: wide-column stores for massive write throughput with known partition keys. IoT telemetry ingestion, time-series event streams, and multi-region active-active replication where eventual consistency is acceptable. The data model must be designed around queries, not normalized. This is an irreversible architectural commitment.
  • DynamoDB: best when you need zero operational overhead and are fully committed to AWS. DynamoDB's single-table design patterns are powerful but require expert modeling upfront. Its pricing model punishes unpredictable access patterns and hot partitions viciously. I've seen DynamoDB bills that shocked teams who hadn't modeled their access patterns carefully.
The hidden cost of NoSQL is that the database's simplicity transfers complexity to your application code. You're now responsible for referential integrity, cross-document transactions, and query flexibility, all of which a relational database gives you for free.

NewSQL: CockroachDB, Vitess, and PlanetScale

NewSQL engines attempt to deliver the consistency guarantees of relational databases with the horizontal scalability of distributed systems. The tradeoffs are real and worth understanding before adoption.

CockroachDB offers serializable isolation, automatic sharding, and multi-region topologies. Its SQL compatibility is high but not perfect. Certain PostgreSQL extensions are absent. The primary cost is latency: distributed transactions require consensus rounds (Raft), so a P50 write on a single-region CockroachDB cluster is slower than the same write on a well-tuned PostgreSQL primary. Use it when you genuinely need geo-distributed consistency or fear single-region failure at a scale that PostgreSQL with physical replication can't handle.

Vitess (and its managed form, PlanetScale) is MySQL sharding middleware. It handles connection multiplexing, query routing, and schema migrations across shards. YouTube built and open-sourced it. The operational learning curve is steep, so plan for a dedicated database platform team.

Sharding Strategies: Horizontal Partitioning at Scale

Sharding is the last resort, not the first tool. Exhaust read replicas, connection pooling, caching, and partitioning before you shard. When you do shard, the key choice is the shard key, and it's very hard to change later.

Range-based sharding assigns contiguous key ranges to shards. It enables efficient range scans but creates hot spots when writes cluster at the current timestamp or monotonically increasing ID.

Consistent hashing distributes keys pseudo-randomly across a virtual ring of shards. Hot spots are eliminated but range queries become scatter-gather operations. The canonical implementation uses a virtual node count per shard (typically 150) to smooth load distribution during rebalancing.

Directory-based sharding maintains a lookup table mapping entity IDs to shard locations. Maximum flexibility, but the directory service becomes a critical single point of contention. It must be highly available and heavily cached.

CQRS, Read Replicas, and Connection Pooling

Command Query Responsibility Segregation separates write paths from read paths at the application level. Commands mutate state through the primary; queries are served from read replicas. CQRS enables independent scaling of read and write capacity and allows read models to be denormalized for query efficiency.

PostgreSQL connection overhead is substantial. Each connection spawns a backend process consuming roughly 5-10 MB of RAM. At 500 connections you've consumed several gigabytes of RAM purely for connection state before serving a single query. PgBouncer in transaction pooling mode is the standard solution: it multiplexes hundreds of application connections onto a small pool of actual Postgres connections (typically 20-100). Configure pool_mode = transaction and max_client_conn aggressively. Most web applications hold connections for milliseconds per request.

# pgbouncer.ini : production-grade configuration
[databases]
myapp = host=pg-primary port=5432 dbname=myapp

[pgbouncer]
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
log_connections = 0
log_disconnections = 0

Time-Series Databases: TimescaleDB and InfluxDB

Time-series data (metrics, logs, sensor readings) has specific characteristics: append-only writes, queries that always include a time range, and high cardinality in tag dimensions. General-purpose databases handle these workloads poorly at scale because standard B-tree indexes become inefficient for range scans over time ranges and compression opportunities are missed.

TimescaleDB is a PostgreSQL extension that adds hypertables (automatic time-based chunking), columnar compression (typically 90-95% reduction for metric data), and time-series specific SQL functions. If you're already on PostgreSQL, TimescaleDB is the zero-friction upgrade. Your ORM, connection pool, and monitoring stack all continue to work unchanged. This is usually the right answer before you reach for InfluxDB.

InfluxDB 3.0 (rewritten in Rust on Apache Arrow/Parquet) is the choice when you need sub-millisecond ingest rates exceeding what a PostgreSQL-based system can sustain and you're comfortable with InfluxQL rather than standard SQL.

Vector Databases: Don't Rip Out Postgres If pgvector Does the Job

The rise of embedding-based RAG has created an entirely new database category. Vector databases store high-dimensional float vectors and answer approximate nearest neighbor queries efficiently using algorithms like HNSW or IVF. Everyone is rushing to adopt standalone vector databases. My take: start with pgvector.

pgvector is a PostgreSQL extension that adds a vector column type and HNSW/IVF index support. For datasets under ~5M vectors with reasonable dimensionality (768-1536 dimensions), pgvector's recall and latency are competitive with purpose-built vector databases. You avoid introducing a new operational dependency and you keep your data in one place. That simplicity has real value.

-- pgvector: create a table and HNSW index
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id        BIGSERIAL PRIMARY KEY,
  content   TEXT,
  embedding vector(1536)
);

CREATE INDEX idx_documents_embedding ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- ANN search: find 10 nearest neighbors
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

Pinecone and Weaviate are purpose-built vector databases for production RAG workloads at scale. Pinecone is fully managed and operationally trivial. Weaviate is open-source with built-in multi-tenancy, sparse-dense hybrid search, and a GraphQL API. Reach for them when you exceed pgvector's scalability ceiling or need features like multi-vector search. But be honest about whether you've actually hit that ceiling first.

Database Migration Strategy: The Expand-Contract Pattern

Schema migrations on live production databases require discipline. The expand-contract pattern (also called parallel change) is the safe approach for zero-downtime migrations:

  1. Expand: add the new column or table while keeping the old structure. Deploy application code that writes to both old and new locations simultaneously.
  2. Backfill: populate the new column for all existing rows in batches. Never a single UPDATE on a large table. It holds a lock and will likely kill replication lag.
  3. Switch reads: deploy application code that reads from the new location, still writing to both.
  4. Contract: once all reads are from the new location and a rollback window has passed, drop the old column in a separate migration.
Never add a NOT NULL column with no default to a large table in a single migration. PostgreSQL must rewrite every row. Use a nullable column, backfill, then add the constraint with NOT VALID followed by VALIDATE CONSTRAINT. The latter acquires only a ShareUpdateExclusiveLock.

The N+1 Problem

The N+1 query problem deserves its own section because it has ruined more production systems than any technology choice. It occurs when an application issues one query to fetch N parent records, then N additional queries to fetch children, one per parent. This is the most common performance regression introduced by ORMs and it's frequently invisible in development environments where N is small. You won't see it until you have real data volumes, and by then it's a production incident.

The solutions in order of preference: eager loading (JOIN or IN clause to fetch all children in a single query), DataLoader-style batching for GraphQL resolvers, and denormalization where read latency is paramount. Integrate pg_stat_statements and slow query logs into your CI pipeline. Catching an N+1 before it ships is infinitely cheaper than diagnosing it from a production incident.

The database decisions you make today create technical debt or competitive advantage that compounds for years. Push your existing tools to their limits before introducing new systems. Understand your workload shape before picking a technology name. And treat migrations as first-class engineering work, not afterthoughts attached to feature releases at the last minute.

ShareXLinkedInRedditEmail