Desert Invocation

Incantations I breathe with the breath of the moon daring to sing salted songs played for you Vibrations take flight from deserted waveforms to lay love for garments bright cobalt adorns Soft…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Creating a tracking service

Tracking is a critical tool for any company. It allow us to understand how our business is going, how our customers behaves in our app and even we can create products on top of it (reporting, ads, etc.).

In all my years working with different companies in the Advertising field, I realised that is not so obvious for all the engineers how to build a tracking service. So here is my first post to share with you what I think is a good architecture for a tracking application.

DECLAIMER: Im not a english native writer, so please if you find some mistake let me know. Thanks!

DECLAIMER #2: I’m going to focus on architecture design, framework and tradeoffs. No code.

Imagen that we want to create an Tracking service to support an advertisement platform. This advertisement platform enables to their clients to show ads in different pages and it only charge for the clicks that they receive.

So because this platform charges for clicks, we have to be sure to capture all the clicks information, while impressions are more for statistics. Other factor is that impressions are hugger in terms of volume than clicks.

So if we have to choose the tradeoff between accuracy and scalability it would be like this:

Because the platform charge for the Clicks we have to make sure to have most of them (accuracy over scalability). And because Impressions is a metric that helps the user to understand the quality of their ads (CTR), the volumen is higher than the clicks, but is not critical we choose scalability over accuracy.

Another decision that we have to make is if we want to track client-side or server-side the events. To explain a little bit the difference between them here are the pros-cons of each.

1.a Impressions on client-side

Pros:

Cons:

1.b Impressions on server-side

Pros:

Cons:

Because impressions are not a critical metric for the clients (the platform charge for clicks) I would choose the server-side because its easier to maintain. Also the visibility issue we can solve it by giving weight to the different placements.

For clicks is pretty strait forward because we cannot afford to loose 20% of the clicks, as well it could impact the user experience (if the user has to wait until we track the click through a javascript for example).

This is the list of frameworks that we are going to use in order to create our tracking application.

3.a Tracking impressions

As we mention before, impressions are a huge volumen but not critical, so we will focus on scalability and not accuracy. Also we choose the server-side solution, but also I will mention what is needed to go to the client-side solution.

Here is the architecture design:

Lets say that our Ad server is an API that is being call from the client (to simplify the architecture that is not related to tracking). Every time our server decide which ads has to return, in a second thread (to not make the user wait for this) our server is pushing into the topic “impressions” all the ad IDs with all the internal information that we are interested to save (algorithm, score, etc.).

Because in this case we prefer scalability in Kafka we choose the async replication that doesn’t guarantee that all data is safe (most of the time it is). This enable to scale up the queue to high volume of data. Another setting that we can do is setting a lot of partitions to avoid to loose a lot of data if a couple of nodes goes down.

On the right side we have a couple of “Dump consumers” (minimum two to guarantee stability) listening to the topic “impressions”.

To ensure that each consumer receive different slots of data all this consumers has to suscribe with the same group name (if not they will receive the same information). This consumers do two things. First they write logs in a shared disk (solid if its possible) the rough data of the impressions (or formatted if you need it). Second thing is push this value into a Cassandra or other key/value store.This logs and data in Cassandra will eventually move on to the Hadoop cluster and the MySQL database.

Now probably you are wondering why we go strait to the databases? Remember that we are talking that impressions are huge volumen of data and is constantly coming. Hitting the database directly are highly inefficient.

In case of the files it allow us to not hold data in memory to send it strait to Hadoop. If you are using some language or framework with non-blocking I/O this shouldn’t be a problem.

For Cassandra is a easy way to resume the information that you really want to save in the database. In this case Im only interested in the amount of impressions for each ad/day. So the key that I create is <adID-date-hh>. If you pay attention I also added the hour in the key. This is because the batch process that dump the Cassandra data into MySQL runs every hour and dump the data of the hour before. Off course this is an example, you can do the same with any frequency (if you want to do more real-time). Suggestion find a balance between frequency and volumen. MySQL and other databases of this type are not ready for high volume of data. More if that data may affect triggers, indexes, or whatever black magic you have in the database (try not to have too much).

Client-side solution notes

In case you want to do a client-side solution, you will have to create a javascript to send the request from the browser, and also add a temporal database like in the clicks solution (3.b.1).

3.b Tracking clicks

Clicks architecture share a lot with impressions, but there is some functional differences with impressions. The first one is when we receive the click from our customer we don’t have any information of the backend that we may want to save (algorithm, score, etc.). So we have to store this data temporally in some place. Second we may want to do some post-click data analysis.

Here is the architecture design:

The first difference with impression architecture is that instead of sending for each request of the client the data strait to the queue, is to store this into Cassandra temporally.

Then we have our process pipes reading from different topics and pushing the data again to other topic. In this example I want to do two things: fraud detection and budget update. Both process will do some evaluation and database update (doesn’t matter for tracking), add one attribute to the JSON and publishing that data in other topic.

Example:

Fraud process:

Budget process:

After this the right side of the process is pretty much the same as the Impressions.

The only thing that may change is the Dump consumers where filter for the DB buffer the clicks that are not valid or without budget.

Please if you like or not this leave your comments!!! Also if you have other proposal for a tracking service share your link!

Cheers!

Add a comment

Related posts:

Empowering the crypto community with new ways to learn and share

Many of our customers are looking for ways to better understand crypto both as an investment and as an app platform, but it can sometimes be challenging to find relevant information at the right…

Memulai Kembali

Kesibukan memang menjadi alasan terbaik ketika hal-hal yang kita rencanakan tidak terlaksana. Saat kita ingin melakukan sesuatu, kita terhalang oleh agenda yang ada di luar rencana kita. Misal, kita…

3 effective strategies for SMEs

All you need is Google to like you. You need to be compatible to their algorithm and have a good idea of what Google are looking for. I understand how brutal it can be. You’re new in the market…