Rate Limiting Strategies for APIs: Token Bucket, Sliding Window, and Where to Enforce It

The first time rate limiting really mattered to me, it was 3 a.m. and a single customer's misconfigured cron job was hammering our /reports/generate endpoint 40 times a second. Each call held a Postgres connection for 4 seconds. Within two minutes the pool was exhausted, every other tenant got connection timeouts, and our status page went red. The fix was not more database capacity. It was four lines of Redis that should have been there from day one.

Rate limiting is the seatbelt of API design. Nobody notices it until the crash, and by then it's too late to install. The hard part is not the concept, it's picking the right algorithm and putting it in the right layer. Get either wrong and you either let abuse through or you throttle your best customers during their busiest hour.

The Algorithms, And Why You'd Pick Each

There are really five algorithms worth knowing. They trade accuracy against memory and CPU, and that trade is the whole game.

Fixed window is the naive one. You count requests in discrete buckets aligned to the clock: 00:00:00–00:00:59, then reset. One counter per key, dirt cheap. The problem is the boundary burst. If your limit is 100 requests per minute, a client can send 100 at 00:00:59.9 and another 100 at 00:01:00.1 — 200 requests in 200 milliseconds, well within your "limit." I've watched this exact pattern take down a service whose owners swore they were rate limited.

Sliding window log fixes accuracy completely. You store a timestamp for every request and, on each new request, count how many fall within the trailing window. Perfectly accurate, no boundary effect. It's also a memory disaster: a client allowed 1,000 requests/minute costs you 1,000 stored timestamps per key, all the time. At scale this eats Redis alive.

Sliding window counter is the pragmatic middle. You keep two fixed-window counters (current and previous) and weight the previous one by how far you are into the current window. If you're 25% into the current minute, you count the full current bucket plus 75% of the previous bucket. It approximates the true sliding window within a couple percent, using two integers per key. This is what Cloudflare popularized, and it's my default for HTTP traffic.

Token bucket models a bucket that refills at a steady rate. Each request takes one token; an empty bucket means rejection. The key property is controlled burst: a client who's been quiet accumulates tokens up to a cap, then spends them in a spike. That matches how real clients behave — idle, then a batch job fires. You store two numbers per key (token count and last-refill timestamp) and compute the refill lazily. This is my default for anything where occasional bursts are legitimate.

Leaky bucket is token bucket's mirror: requests queue and drain at a fixed rate, smoothing output completely. Great when the thing you're protecting cannot tolerate bursts at all — a downstream payment processor with a hard QPS ceiling. The cost is added latency, since requests wait in the queue.

Algorithm	Memory per key	Accuracy	Allows bursts	Best for
Fixed window	1 counter	Low (boundary burst)	Accidentally	Internal, low-stakes limits
Sliding window log	N timestamps	Exact	No	Small N, audit-grade limits
Sliding window counter	2 counters	~99%	No	General HTTP APIs
Token bucket	2 values	High	Yes, controlled	Public APIs, bursty clients
Leaky bucket	queue + 1 value	High	No (smoothed)	Protecting fragile downstreams

Where To Enforce It

Picking the algorithm is half the decision. The other half is the layer, and each layer sees different information.

Edge / CDN / WAF. Cloudflare, Vercel's firewall, AWS WAF. This is where you stop volumetric abuse and L7 floods before they touch your origin. You key on IP or ASN, the limits are coarse (say, 1,000 req/min per IP), and crucially the attacker's traffic never costs you compute. Always have something here.
API gateway. Kong, Envoy, AWS API Gateway. Here you key on API key or client ID and enforce plan-level quotas — the "free tier gets 60 req/min, pro gets 600" logic. Centralized, language-agnostic, and it offloads your app servers.
Application. This is the only layer that knows business context: which user, which tenant, whether this is a $0.002 read or a $4 report generation. Endpoint-specific and cost-based limits live here.
Per-tenant. A cross-cutting concern at the app layer. One noisy tenant must never starve the others — that was my 3 a.m. outage. Key by tenant ID and give each their own budget.

The rule I follow: coarse limits at the edge, plan limits at the gateway, and the limits that require business knowledge in the app. Don't try to do cost-based limiting at the WAF — it has no idea what an endpoint costs.

Distributed Rate Limiting With Redis

Once you run more than one app instance, in-memory counters are wrong — each instance sees only its own slice of traffic. You need shared state, and Redis is the standard answer because it's fast and gives you atomic operations.

The naive distributed approach is INCR plus EXPIRE. It works for fixed window, but watch the trap:

// BROKEN under concurrency: not atomic
const count = await redis.incr(key);
if (count === 1) {
  await redis.expire(key, windowSeconds); // can be lost on a crash between calls
}

If the process dies between INCR and EXPIRE, the key never expires and the limit sticks forever. Use a Lua script so the whole operation is atomic — Redis runs Lua single-threaded, so there's no interleaving.

Here's a real token bucket as a Lua script. It refills lazily based on elapsed time, which means you never run a background job to top up buckets:

-- token_bucket.lua
-- KEYS[1] = bucket key
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill_rate (tokens per second)
-- ARGV[3] = now (unix seconds, fractional)
-- ARGV[4] = requested (tokens to take, usually 1)
local capacity    = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now         = tonumber(ARGV[3])
local requested   = tonumber(ARGV[4])
 
local bucket = redis.call("HMGET", KEYS[1], "tokens", "ts")
local tokens = tonumber(bucket[1])
local last   = tonumber(bucket[2])
 
if tokens == nil then
  tokens = capacity
  last = now
end
 
-- lazy refill
local delta = math.max(0, now - last)
tokens = math.min(capacity, tokens + delta * refill_rate)
 
local allowed = tokens >= requested
if allowed then
  tokens = tokens - requested
end
 
redis.call("HSET", KEYS[1], "tokens", tokens, "ts", now)
-- expire the idle key so we don't leak memory: time to fully refill + slack
local ttl = math.ceil(capacity / refill_rate) + 10
redis.call("EXPIRE", KEYS[1], ttl)
 
-- return allowed flag, remaining tokens, and seconds until one more token
local retry_after = 0
if not allowed then
  retry_after = (requested - tokens) / refill_rate
end
return { allowed and 1 or 0, math.floor(tokens), tostring(retry_after) }

The TypeScript wrapper loads the script once and calls it by SHA. With ioredis you get defineCommand, which handles the EVALSHA/EVAL fallback automatically:

import Redis from "ioredis";
import { readFileSync } from "node:fs";
 
const redis = new Redis(process.env.REDIS_URL!);
 
redis.defineCommand("tokenBucket", {
  numberOfKeys: 1,
  lua: readFileSync(new URL("./token_bucket.lua", import.meta.url), "utf8"),
});
 
export interface LimitResult {
  allowed: boolean;
  remaining: number;
  retryAfter: number; // seconds
}
 
export async function tokenBucket(
  key: string,
  capacity: number,
  refillPerSec: number,
  cost = 1,
): Promise<LimitResult> {
  const now = Date.now() / 1000;
  // @ts-expect-error custom command added at runtime
  const [allowed, remaining, retryAfter] = await redis.tokenBucket(
    `rl:tb:${key}`,
    capacity,
    refillPerSec,
    now,
    cost,
  );
  return {
    allowed: allowed === 1,
    remaining,
    retryAfter: Math.ceil(parseFloat(retryAfter)),
  };
}

Two things to call out. The cost parameter is how you do cost-based limiting: a cheap read costs 1 token, a report generation costs 50. Same bucket, weighted by expense. And the key prefix matters — rl:tb: namespaces these so a FLUSHDB mistake or a SCAN for cleanup doesn't touch unrelated data.

If you prefer the sliding window counter, the logic is just as compact. Here it is inline rather than as a script, to show the weighting math:

export async function slidingWindow(
  key: string,
  limit: number,
  windowSec: number,
): Promise<LimitResult> {
  const now = Date.now() / 1000;
  const windowStart = Math.floor(now / windowSec) * windowSec;
  const curKey = `rl:sw:${key}:${windowStart}`;
  const prevKey = `rl:sw:${key}:${windowStart - windowSec}`;
 
  const [[, cur], [, prev]] = (await redis
    .multi()
    .incr(curKey)
    .get(prevKey)
    .expire(curKey, windowSec * 2)
    .exec())!;
 
  const elapsed = (now - windowStart) / windowSec; // 0..1 into current window
  const weighted = Number(prev ?? 0) * (1 - elapsed) + Number(cur);
 
  const allowed = weighted <= limit;
  return {
    allowed,
    remaining: Math.max(0, Math.floor(limit - weighted)),
    retryAfter: allowed ? 0 : Math.ceil((1 - elapsed) * windowSec),
  };
}

Applying It In A Next.js Route Handler

The enforcement point should return the right status and headers, not just a bare 403. A 429 with Retry-After and the draft RateLimit headers tells well-behaved clients exactly how to back off, which dramatically cuts retry storms.

// app/api/reports/route.ts
import { NextRequest, NextResponse } from "next/server";
import { tokenBucket } from "@/lib/rate-limit";
 
function clientKey(req: NextRequest): string {
  // Prefer authenticated tenant; fall back to IP for anonymous traffic.
  const tenant = req.headers.get("x-tenant-id");
  if (tenant) return `tenant:${tenant}`;
  const ip =
    req.headers.get("x-forwarded-for")?.split(",")[0]?.trim() ?? "unknown";
  return `ip:${ip}`;
}
 
export async function POST(req: NextRequest) {
  // Report generation is expensive: 50 tokens per call.
  // Bucket: 200 capacity, refills 1 token/sec -> ~1.2 reports/min sustained,
  // but a rested tenant can burst 4 reports immediately.
  const { allowed, remaining, retryAfter } = await tokenBucket(
    clientKey(req),
    200, // capacity
    1, // refill per second
    50, // cost of this endpoint
  );
 
  const headers = new Headers({
    "RateLimit-Limit": "200",
    "RateLimit-Remaining": String(remaining),
    "RateLimit-Policy": "200;w=200",
  });
 
  if (!allowed) {
    headers.set("Retry-After", String(retryAfter));
    return NextResponse.json(
      { error: "rate_limited", message: "Too many requests." },
      { status: 429, headers },
    );
  }
 
  // ... do the expensive work ...
  return NextResponse.json({ ok: true }, { headers });
}

Two production notes. Fail open, carefully. If Redis is down, wrap the check in a try/catch and decide deliberately: for a login endpoint, fail closed; for a read endpoint, failing open keeps the product usable. Never let a Redis blip take down your whole API. Throttle login attempts separately and aggressively — 5 attempts per account per 15 minutes, keyed by username and IP, is a reasonable floor. That's not a rate limit for fairness, it's a brute-force defense, so failing closed is the right call there.

A Decision Checklist

When you're about to add a limit, walk this in order:

What am I protecting? Compute, a fragile downstream, or an account? That picks fail-open versus fail-closed.
Do legitimate clients burst? If yes, token bucket. If the downstream can't tolerate bursts, leaky bucket. Otherwise sliding window counter.
What's the key? Tenant ID beats user ID beats IP. Use the most specific identity you trust.
What layer? Volumetric → edge. Plan quota → gateway. Anything needing business context or cost weighting → app.
Is the endpoint expensive? Then weight it with a cost. Don't treat a $4 report like a $0.002 ping.
Are the headers right? 429, Retry-After, and RateLimit-* on every throttled response, every time.

Install the seatbelt before the crash. The four lines of Redis that would have saved my 3 a.m. were never about capacity — they were about deciding, on purpose, who gets to spend it.