Analytics13 min read2026-02-15

Building a Scalable Click Analytics Pipeline

Architecture patterns for real-time link analytics at scale

Priya PatelData Engineering Lead

Building a Scalable Click Analytics Pipeline

Why analytics pipelines break under high click volumes

Building a link analytics system that handles ten clicks per second is trivial. Building one that handles ten thousand clicks per second without collapsing is a complex data engineering challenge. Most engineering teams start by inserting a row into a PostgreSQL database every time a user clicks a link. This works perfectly during early-stage testing. However, as traffic scales, this architecture inevitably fails due to write amplification. Every click requires a disk write to the Write-Ahead Log (WAL), an update to multiple B-tree indexes, and a table lock contention that blocks read queries. The dashboard becomes slow, the redirect handler becomes latent, and the database disk fills up at an alarming rate. Building a scalable analytics pipeline requires entirely separating the real-time redirect path from the asynchronous analytics processing path.

Diagram: Decoupled click analytics architecture

┌──────────────────────┐
│ 1. Redirect Handler │
│ (Fire and Forget) │
└──────────┬───────────┘
┌──────────────────────┐
│ 2. Ingestion Buffer │
│ (Redis / Kafka) │
└──────────┬───────────┘
┌──────────────────────┐
│ 3. Stream Processor │
│ (Enrichment + Parse) │
└──────────┬───────────┘
┌──────────────────────┐
│ 4. OLAP Data Store │
│ (ClickHouse / S3) │
└──────────────────────┘

The ingestion bottleneck and write amplification

To understand why direct database writes fail, you must understand write amplification. When you insert a single row into a table with three indexes, the database does not just write one row to the disk. It writes the row to the main table space, it writes the primary key to the primary index, and it writes entries to the two secondary indexes. A single logical write becomes four or five physical disk writes. At ten thousand clicks per second, that is forty to fifty thousand disk IOPS, which requires expensive, high-end NVMe storage. Furthermore, database transaction isolation levels cause lock contention. While the database is writing clicks, your dashboard queries—which need to read those same indexes—are forced to wait in a queue. The solution is to stop writing clicks directly to the relational database entirely.

The buffering layer: Redis vs. Kafka

The first step in a scalable pipeline is introducing a buffering layer between the redirect handler and the database. When a click happens, the redirect handler pushes a JSON payload containing the click metadata into an in-memory buffer and immediately returns the 302 redirect to the user. The user experiences zero latency from the analytics pipeline. For small to medium scale, Redis Lists or Redis Streams are the ideal buffer. They are extremely fast, require minimal operational overhead, and can easily absorb massive traffic spikes. For enterprise scale or distributed microservices architectures, Apache Kafka is the standard. Kafka provides durable, replicated logs, ensuring that no click data is lost even if a server crashes. However, Kafka introduces significant operational complexity that is overkill for most shortener applications until they reach massive scale.

Stream processing: Enrichment and transformation

Raw click data is useless for analytics. A raw payload contains an IP address, a raw User-Agent string, and a referer header. To make this data queryable, you must enrich it. A background worker process consumes messages from the Redis buffer and performs several operations. First, it parses the User-Agent string to extract the operating system, browser family, and device type (mobile, tablet, desktop). Second, it performs a GeoIP lookup against a local database like MaxMind to convert the IP address into a country, region, and city. Third, it classifies the referer header to determine if the click came from a search engine, a social platform, an email client, or direct traffic. This enrichment must happen in the worker, never in the redirect handler, because these operations take too long and would add unacceptable latency to the user's redirect.

OLTP vs. OLAP: Separating your storage engines

The most critical architectural decision in a scalable pipeline is separating your transactional data (OLTP) from your analytical data (OLAP). Your short links, user accounts, and configurations belong in a relational database like PostgreSQL or MySQL. This is your system of record. Your click data belongs in a columnar, time-series analytical database. Storing billions of clicks in PostgreSQL is a recipe for disaster. Columnar databases like ClickHouse, Apache Druid, or TimescaleDB are explicitly designed for time-series data. They store data by column rather than by row, which provides 10x to 100x faster aggregation queries and extreme data compression. A billion clicks that might require 500GB of storage in PostgreSQL can often be compressed down to 50GB in ClickHouse, drastically reducing infrastructure costs.

Time-series partitioning strategies

How you store the click data on disk determines the performance of your dashboard queries. If you store all clicks in a single massive table, querying the total clicks for a specific campaign over the last 30 days requires the database to scan billions of irrelevant rows from previous years. You must partition the data by time. The standard approach is daily or hourly partitions. When a query requests data for "the last 7 days," the database can completely ignore the partitions containing data from last month, performing a partition prune. This reduces the amount of data scanned from terabytes down to gigabytes, resulting in sub-second dashboard queries. If you are using ClickHouse, use the MergeTree engine family with a partition key based on the date. If you are using PostgreSQL, use declarative partitioning with the RANGE type on the created_at timestamp.

Handling schema evolution without downtime

As your analytics needs grow, you will inevitably need to add new data points to your click stream. You might decide to track the user's preferred language, the specific ASN (Autonomous System Number) of their ISP, or whether they were using a dark mode browser extension. In a traditional database, adding a column to a table with billions of rows can lock the table and cause an outage. In a stream-processed pipeline, schema evolution is trivial because you are ingesting JSON documents. If your worker starts attaching a new "asn" field to the JSON payload, your columnar database will automatically create the new column on the fly when it receives the first document. This allows your data engineering team to iterate rapidly, adding new tracking dimensions without requiring complex, high-risk database migrations on massive tables.

Cost optimization: Hot storage vs. cold storage

Storing billions of enriched click events in a fast, queryable analytical database is expensive. While ClickHouse is incredibly efficient, storage costs still scale linearly with data volume. To optimize costs, implement a tiered storage strategy. Keep raw, granular click data in your hot analytical database for a limited retention window—typically 30 to 90 days. This covers the vast majority of active campaign analysis and troubleshooting. After the retention window expires, run an automated batch job that aggregates the raw clicks into daily or hourly rollups (e.g., total clicks per campaign per country per day). Store these aggregated rollups in the hot database permanently. Then, export the raw, expired click data to cheap cold storage like Amazon S3 in a compressed columnar format like Parquet, where it can be queried rarely using Athena or Spark for deep historical audits, at a fraction of the cost.

Backpressure and handling traffic spikes

During a viral event or a successful product launch, your click volume might spike by 10x or 50x in a matter of minutes. If your stream processing workers cannot enrich and write the data fast enough, a backlog builds up in your Redis buffer. This is called backpressure. If unmanaged, Redis will run out of memory and crash, causing permanent data loss. Your pipeline must be designed to handle backpressure gracefully. Implement consumer group logic so you can horizontally scale your worker nodes by adding more servers when traffic spikes. Configure Redis with a maximum memory limit and an eviction policy that drops old click data rather than crashing the server. It is better to lose a small percentage of raw analytics data during an unprecedented spike than to let the buffer overflow and crash the entire ingestion pipeline.

FAQ

Why can't I just use PostgreSQL for everything?

You can, for low traffic. But PostgreSQL uses row-based storage and B-tree indexes, which are optimized for transactional queries (finding a specific user) not analytical queries (aggregating 10 million rows). As click data grows, PostgreSQL queries will slow down to minutes, and disk costs will become unsustainable.

Is Kafka really necessary for a URL shortener?

Almost never. Kafka is designed for distributed, multi-team event streaming. For a single-application URL shortener, Redis Streams or a simple Redis List is vastly simpler to operate, requires fewer servers, and provides more than enough throughput for millions of clicks per minute.

How do I handle duplicate clicks in the pipeline?

Network retries and browser prefetches can cause duplicate click records. Assign a unique event ID to every click on the redirect handler using a combination of the IP, User-Agent, and a timestamp hash. When your stream processor writes to the analytical database, use that hash as a deduplication key.

What is the best way to query the data for a dashboard?

Never query the analytical database directly from the frontend. The frontend should request data from a backend API, which caches frequent queries in Redis or Memcached. Pre-aggregate dashboard metrics (like total clicks per link per day) in a separate summary table that updates every few minutes, rather than calculating them on the fly from billions of raw rows.

Conclusion

Building a scalable analytics pipeline for a URL shortener requires abandoning the simplicity of direct database writes in favor of a decoupled, stream-processed architecture. By absorbing clicks in an in-memory buffer, enriching them in background workers, and storing them in a columnar OLAP database with aggressive time-based partitioning, you create a system that handles massive traffic spikes without slowing down the user's redirect, while simultaneously reducing infrastructure costs through intelligent data compression and cold storage tiering.

Tags

AnalyticsData EngineeringStream ProcessingKafkaClickHouseReal-Time Data