At Pearmill, we work with a lot of companies using Segment as the underlying tracking infrastructure. As companies scale and have to build an attribution model, we will often use Segment's Data Warehouse destination as the source to build an attribution model.
This is the blueprint for how you can do this on your own! Note that this is a highly technical blog post – if you're not familiar with data engineering principles then it may be out of reach for you!
Note that we're assuming that you're using Segment only on web, and that you don't have mobile apps. The model will get significantly more complex when you have mobile apps in the picture!
As we've gotten more experienced with building attribution models, we've learned that the best method is to use a bottoms-up approach to the data in such a format that will let us choose the specific model we'll use to attribute later on in the data stack.
Meaning that if the marketing team wants to do first-touch attribution, weighted attribution, last-touch attribution, or whatever attribution model they choose, we should be able to support it out of the box without having to remodel the underlying data. In this section, we'll discuss how you can structure the data so it's flexible for the marketing team's model.
When an unauthenticated user hits your landing page, they are not identified as a particular account in your system. However, this hit record (page-view event) bares information that is important to identify where the user came from, like UTM tags and/or referrer URL.
Such a user is assigned with a long random string ID, called anonymous ID, and it’s stored in a cookie. So, if the same user comes back again (still before registration), the repeat page views will be recorded under the same anonymous ID.
When this user completes registration and gets a persistent ID in your system (user_id), identify() method must be called in Segment, which records a tuple of (anonymous_id, user_id). When a registered user does a meaningful action that is recorded as a Segment event, both his anonymous_id and user_id are recorded. These data points are particularly helpful if the same user uses multiple devices.
Also, cookies tend to expire, or not be stored by the browser, so collecting as many tuples as possible is important. With this mapping, we can backlink the original anonymous page views made by the user to their subsequent journey into the conversion funnel.
Thus, one user_id can have one or more associated anonymous_id, and they come from identifies, pages, and tracks tables. The table has to include unique combinations to avoid further data duplication and it takes time to produce so it is good to materialize it in a separate table and refresh/increment on schedule. You'll use this table to generate a "main ID" for each user, which will be used in the subsequent tables we'll generate below!
If you're using a tool like DBT, this should be quite straightforward for you. This is a very expensive query to run, and the materialized view is going to help ensure you're not using too many compute resources.
To build an attribute model, we'll need to know about all of the significant "landings" (i.e. the pages that the user arrived at from external sources). Landings are page views that have meaningful source information, like UTMs or referrer URLs.
You'll have to build a table to track all of this at a per-user level in a table. Since this table involves window functions and possibly parsing, it’s better to materialize it as well to avoid using too many compute resources as your data sets become large, and to speed up query times in general.
To achieve building this table, you'll have to watch out for the following scenarios:
The generic structure of the table is going to look like this:
Now that we have a User Mapping table and a Landings Table – we can couple them together with events to build a Touchpoints table. This table is going to hold combinations between landings and subsequent funnel events for the same user ID within a certain time window.
This table will be the core data source that allows modeling user journeys with respect to traffic source and filter/assign weights to rows based on the chosen attribution model. We'll have all of the information we'll need to get first-touch, last-touch, or other weighted attribution models.
You'll have to materialize this table by merging the events table from Segment with the landings table we put together in step 2, joining on the "main ID" from the user mapping table to combine both of these tables. This will be done within a "Time Window" (described below after the table). The generic structure of this table is the following:
The time window is something that is a tradeoff between being too short to capture the journey to the final conversion event in most cases and being too long to preserve possible causality between landing and conversion. For example, 5 days might not be enough to capture all links properly, and on the other side, when somebody makes a purchase in 180 days, there might be some untracked reminder or a demand change that drove the user to convert and not the visit that's 6 months old.
Typically in marketing attribution, the time window varies between 30 and 90 days.
It is good to start with joining landings and conversion events simply by user ID and building conversion curve where N days to convert (days between landing and event) is on X axis and cumulative % of conversions is on Y axis, which answers the question: “How many users end up converting by day N?”.
An example dataset looks like this:
Here's what an example conversion curve would could like:
In this example, conversion curve indicates that 80% of conversions happened by day 30. While 80% is a lot, it is insufficient — 20% of signals will be ignored, and day 90 with 91.6% looks much better. So picking 90 days in this instance would make the most sense – however, this is just as much of a business decision as it is a data decision, so it's a discussion that should involve your marketing team.
Here's what the touch points table could look like after you've materialized it:
Now that we have a Touchpoints table, we can create an attribution model on top of it.
You can query the Touchpoints table based on the chosen attribution model that defines weights that are assigned to each interaction. The most popular options are:
If the reporting tool allows query parametrization behind semantics layer like Looker, attribution model can be simply a parameter when pulling data dynamically. In other cases like Metabase where parameterization limits self-service possibilities, data can be materialized with attribution model column to filter by, or different attribution models can be kept in different views.
Now that you've built an attribution model, you should consider the following on your roadmap and decisions to make:
And if you need an expert partner to help, reach out! Let’s grab a few minutes together and see if there is potential to unlock growth.