Building a View Aggregator + Tracking System
- viewsTracking is everywhere, I remember at my last startup, we decided to integrate tools like SmartLook, AppsFlyer, Google Analytics, something from Facebook and so on. Interestingly enough, these applications had more tracking information than we collected ourselves! Morever, there were more requests to trackers on every page than there were requests to our own servers.
Research
In order to build a View Aggregator, we will look at some existing solutions and see how they work.
Hello Interview - Ad Aggregator
An Ad Click Aggregator is a system that collects and aggregates data on ad clicks. It is used by advertisers to track the performance of their ads and optimize their campaigns. For our purposes, we will assume these are ads displayed on a website or app, like Facebook.
Requirements
Functional Requirements:
- Users click a link and go to the page
- Advertisers want to query for click metrics for their ads
- [NICE TO HAVE] Ad targeting based on user behavior
- [NICE TO HAVE] Cross device tracking
- [NICE TO HAVE] Integration with offline channels
Non Functional Requirements:
- Scale - how many ads at a given time - 10 million; 10k clicks per second at peak
- Latency - < 1s query response time
- Fault Tolerance - no data loss
- Real time - data should be available in real time
- Idempotency of ad clicks
- [NICE TO HAVE] GDPR compliance
- [NICE TO HAVE] Conversion tracking
- [NICE TO HAVE] Fraud detection
System interface and dataflow
Input:
- Click data
- Advertiser queries
Outputs:
- Redirection
- Aggregate click metrics
Dataflow
1. Click data comes to the system
2. User is redirected
3. Click data is validated
4. Click data is logged
5. Click data is aggregated
6. Click data is queried
HLD
For querying, something like SQL storing all the click events would be too slow - you can get millions of writes per second.
- Ad click hits the processor.
- Click processor service returns a 302 after publishing an event to Kinesis or Kafka.
- Apache Flink processes the event stream and aggregates them over a time window in memory.
LDL
- Use KV store to ensure that we only count each click once even if it is sent multiple times by a malicious user.
Adding a View Aggregator to this website
To keep things simple, we will add tracking events to every single page of this website.
Requirements
Functional Requirements:
- Track page views
Non Functional Requirements:
- Eventually Consistent
- Available - 99.99%
- Partition Tolerant
System Interface and Dataflow
Track API calls on a page, and send them to a tracking system.
Input:
- Page view events
Output:
- Tracking system
Because I'm on Cloudflare, I can use their Queues with consumers being a Fink replacement to aggregate the counts in memory, and over time write them to a datastore.
There are some severe limitations to this approach
- if the consumer crashes, we lose all the data in memory.
- 1 queue can only handle 1000 events per second, so if there is a popular article, we will lose data, unless we create hot queues for popular articles.
- We can't do real time analytics, as the data is only written to the datastore after a certain time period.
Amazon Kinesis has similar limitations - Each shard can support writes up to 1,000 records per second, up to a maximum data write total of 1 MB per second.
Despite these limitations, this is the approach I will take for now.
