Back
Scaling PostgreSQL to 800 Million ChatGPT Users: OpenAI's Production Database Architecture Deep Dive
26 Jan, 2026
•
05.00 PM
Scaling PostgreSQL to 800 Million ChatGPT Users: OpenAI's Production Database Architecture Deep Dive
Executive Summary: OpenAI scaled PostgreSQL to handle millions of queries per second for ChatGPT's 800M users using replicas, caching, rate limiting, and workload isolation. This isn't theoretical scaling—it's battle-tested architecture powering the world's largest AI chat service. Senior engineers will walk away with concrete patterns to push their own Postgres clusters from 10k to 10M+ QPS without rewriting your stack.
The Problem Context
You've got a Postgres cluster humming along at 1k-10k QPS for development or early production. Then user growth explodes—suddenly you're at 100k concurrent sessions hammering read replicas with chat histories, metadata lookups, and session state. The status quo of vertical scaling hits physical limits: CPU saturation, I/O bottlenecks, connection storms. OpenAI faced this at ChatGPT scale where a single Postgres instance would melt under millions of QPS. Their solution? Horizontal scaling with smart isolation that keeps write latency under 50ms while reads hit sub-10ms at planetary scale. This beats sharding complexity or NoSQL migrations because Postgres gives ACID guarantees without distributed transaction headaches.
Deep Dive: Architecture & Mechanics
OpenAI's architecture revolves around four pillars: massive read replicas, multi-tier caching, adaptive rate limiting, and strict workload isolation.
Replica Fleet (The Workhorse): 100s of read replicas across regions, each handling 10k-50k QPS. They use logical replication (not physical) to minimize lag—changes stream via WAL decoding in <100ms. Connection pooling via PgBouncer multiplexes 1M+ connections down to 100k physical ones, preventing the "too many clients" crash.
Caching Layers (Hit Rates >95%): Redis cluster for hot session data (TTL 1h), Memcached for metadata (TTL 5min), and Postgres materialized views refreshed every 30s for analytics. Cache invalidation uses pub/sub from primary WAL—when a chat updates, Redis keys evict in <10ms.
Rate Limiting (Per-User, Per-Model): Token-bucket algorithm at edge (Envoy proxy) + database level. Each user gets 100 req/s burst, 10 req/s sustained, tracked in a sharded Redis counter. Spikes get 429s immediately.
Workload Isolation: Separate clusters for writes (small, low-latency), chat reads (high QPS), analytics (batch). No chat traffic touches analytics—zero interference.
Text-based architecture flow:
ChatGPT Frontend --(millions QPS)--> Envoy Proxy (Rate Limit)
|
+------+------+
| |
Redis Cache PgBouncer Pool
| |
+------+------+
|
Read Replicas (100s)
| (logical repl)
Primary (writes)
|
WAL --> Pub/Sub --> Cache InvalidationWhy it works: 95% cache hit + replica fanout = single query fans out to 800M users without primary overload. Latency p99 stays <200ms even at peak.
Hands-on Implementation
Prerequisites: Postgres 15+, PgBouncer, Redis 7+, Envoy 1.28+. Start with AWS RDS Aurora or self-hosted on EC2 r7g.8xlarge.
1. Logical Replication Setup:
-- Primary: postgresql.conf
wal_level = logical
max_wal_senders = 50
max_replication_slots = 50
-- Create publication
CREATE PUBLICATION chatgpt_pub FOR ALL TABLES;
-- Replica: subscribe
CREATE SUBSCRIPTION chat_sub
CONNECTION 'host=primary port=5432 user=repl'
PUBLICATION chatgpt_pub;2. PgBouncer Connection Pooling:
[databases]
chatgpt = host=replicas port=5432 pool_mode=transaction
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000000
default_pool_size = 200
reserve_pool_size = 503. Redis Cache + Invalidation:
import redis
import psycopg2
r = redis.Redis(host='redis-cluster')
# Get with cache
key = f"chat:{user_id}:{session_id}"
chat = r.get(key)
if chat:
return json.loads(chat)
# Cache miss: query replica
with psycopg2.connect(dsn="host=replica1 user=app") as conn:
cur = conn.cursor()
cur.execute("SELECT * FROM chats WHERE id=%s", (session_id,))
chat = cur.fetchone()
r.setex(key, 3600, json.dumps(chat)) # 1h TTL4. Envoy Rate Limiting:
static_resources:
rate_limits:
- domain: chatgpt
descriptors:
- key: "user_id"
value: "{user_id}"
rate_limit_service:
grpc_service:
envoy://rate-limitDeploy: Terraform for ASG replicas, auto-scaling on CPU>70%.
Production Considerations & "Gotchas"
Performance: Replicas lag 50-200ms at peak—use for non-critical reads only. Memory: PgBouncer needs 2GB+ per 10k connections. Cache thundering herds kill Redis—add jitter to TTLs.
Security: Logical replication exposes WAL—use SSL + repl user with minimal perms. Rate limit evasion via proxy rotation? Track IP + user_id pairs.
Gotchas: Vacuuming on 1TB+ tables stalls replicas—schedule during off-peak. Connection storms on cold starts: pre-warm pools. Costs: 100 replicas = $50k+/month—cache aggressively.
Trade-offs: Ultra-fast reads, but writes bottleneck at ~10k TPS. Complex ops, but zero data consistency issues vs. eventual-consistency NoSQL.
The Verdict
Adopt if you're past 10k QPS on Postgres and need ACID at scale—perfect for SaaS, AI backends, fintech. Skip if greenfield (consider CockroachDB) or write-heavy (shard early). This pattern sets the bar for relational scaling in 2025—expect every major SaaS to copy it as AI user bases explode.
References
Other Blogs
•
02.46 PM
•
11.44 AM
•
11.30 AM
•
10.01 AM
•
03.27 PM