Blueprints for building an internal attribution model on Segment data

Nima Gardideh

November 30, 2023

At Pearmill, we work with a lot of companies using Segment as the underlying tracking infrastructure. As companies scale and have to build an attribution model, we will often use Segment's Data Warehouse destination as the source to build an attribution model.

This is the blueprint for how you can do this on your own! Note that this is a highly technical blog post – if you're not familiar with data engineering principles then it may be out of reach for you!

Note that we're assuming that you're using Segment only on web, and that you don't have mobile apps. The model will get significantly more complex when you have mobile apps in the picture!
‍

Data Modeling

As we've gotten more experienced with building attribution models, we've learned that the best method is to use a bottoms-up approach to the data in such a format that will let us choose the specific model we'll use to attribute later on in the data stack.

Meaning that if the marketing team wants to do first-touch attribution, weighted attribution, last-touch attribution, or whatever attribution model they choose, we should be able to support it out of the box without having to remodel the underlying data. In this section, we'll discuss how you can structure the data so it's flexible for the marketing team's model.
‍

1. Build a User Mapping Table

When an unauthenticated user hits your landing page, they are not identified as a particular account in your system. However, this hit record (page-view event) bares information that is important to identify where the user came from, like UTM tags and/or referrer URL.

Such a user is assigned with a long random string ID, called anonymous ID, and it’s stored in a cookie. So, if the same user comes back again (still before registration), the repeat page views will be recorded under the same anonymous ID.

When this user completes registration and gets a persistent ID in your system (user_id), identify() method must be called in Segment, which records a tuple of (anonymous_id, user_id). When a registered user does a meaningful action that is recorded as a Segment event, both his anonymous_id and user_id are recorded. These data points are particularly helpful if the same user uses multiple devices.

Also, cookies tend to expire, or not be stored by the browser, so collecting as many tuples as possible is important. With this mapping, we can backlink the original anonymous page views made by the user to their subsequent journey into the conversion funnel.

Thus, one user_id can have one or more associated anonymous_id, and they come from identifies, pages, and tracks tables. The table has to include unique combinations to avoid further data duplication and it takes time to produce so it is good to materialize it in a separate table and refresh/increment on schedule. You'll use this table to generate a "main ID" for each user, which will be used in the subsequent tables we'll generate below!

If you're using a tool like DBT, this should be quite straightforward for you. This is a very expensive query to run, and the materialized view is going to help ensure you're not using too many compute resources.
‍

2. Build a Landings Table

To build an attribute model, we'll need to know about all of the significant "landings" (i.e. the pages that the user arrived at from external sources). Landings are page views that have meaningful source information, like UTMs or referrer URLs.

You'll have to build a table to track all of this at a per-user level in a table. Since this table involves window functions and possibly parsing, it’s better to materialize it as well to avoid using too many compute resources as your data sets become large, and to speed up query times in general.

To achieve building this table, you'll have to watch out for the following scenarios:

Ignore Drill-downs – Exclude session drill-down events - when someone travels on your website, it doesn’t give any additional information about the traffic source, these can be identified as ones that have an internal page referrer, like your_website.com url. While the google.com referrer is meaningful landing, your_website.com/pricing with your_website.com referrer is a drill-down event. Just excluding pages with /path is not right because landing pages can have path.
Keep Meaningful Signals – Exclude page views that don’t have either UTMs or external referrers, but make sure UTMs are tracked properly and put into context_campaign_* fields. If they are not, and UTM parameters are stored in the page URL or referrer URL, parse them from these columns. At the end, you will have each UTM column with a waterfall logic (coalesce function with all possible sources on meaningful information). Then, filter rows that have either one of these produced UTMs or an external referrer. For referrer, it’s better to extract just the domain to reduce cardinality and ease subsequent mapping.
Keep Changing Signals – You can still have multiple subsequent landings with the same combination of UTMs and referrer (like when someone clicks on the same ad multiple times or refreshes the page), so you can also exclude subsequent events that share the same pattern for the same user (we call it context trace). This way, only changing signal is preserved.
Bucket Sources – last but not least, bucket your UTMs into source field, for instance when utm_source in ('google','adwords') or referrer like 'google%' maps to google source. Then, sources can also be grouped into channels for higher level aggregation, like Google and Bing are mapped to paid search channel.
Enrich Dataset with Dimensions – There can be any other additional dimensions, like product or business line, as long as they map directly to campaigns/ads (if campaign X promotes product Y, there can be product dimension with Y value). For this, mapping tables can be used, or if the naming convention is good enough, it can come straight from asset name processing.
Build a Main User ID – Replace anonymous IDs with persistent user IDs from the user mapping table where possible so that the resulting landings table is directly joinable to backend event tables.

The generic structure of the table is going to look like this:

3. Build a Touchpoints Table

Now that we have a User Mapping table and a Landings Table – we can couple them together with events to build a Touchpoints table. This table is going to hold combinations between landings and subsequent funnel events for the same user ID within a certain time window.

This table will be the core data source that allows modeling user journeys with respect to traffic source and filter/assign weights to rows based on the chosen attribution model. We'll have all of the information we'll need to get first-touch, last-touch, or other weighted attribution models.

You'll have to materialize this table by merging the events table from Segment with the landings table we put together in step 2, joining on the "main ID" from the user mapping table to combine both of these tables. This will be done within a "Time Window" (described below after the table). The generic structure of this table is the following:

‍

Handling Time Window

The time window is something that is a tradeoff between being too short to capture the journey to the final conversion event in most cases and being too long to preserve possible causality between landing and conversion. For example, 5 days might not be enough to capture all links properly, and on the other side, when somebody makes a purchase in 180 days, there might be some untracked reminder or a demand change that drove the user to convert and not the visit that's 6 months old.

Typically in marketing attribution, the time window varies between 30 and 90 days.

It is good to start with joining landings and conversion events simply by user ID and building conversion curve where N days to convert (days between landing and event) is on X axis and cumulative % of conversions is on Y axis, which answers the question: “How many users end up converting by day N?”.
‍

An example dataset looks like this:

‍

Here's what an example conversion curve would could like:

In this example, conversion curve indicates that 80% of conversions happened by day 30. While 80% is a lot, it is insufficient — 20% of signals will be ignored, and day 90 with 91.6% looks much better. So picking 90 days in this instance would make the most sense – however, this is just as much of a business decision as it is a data decision, so it's a discussion that should involve your marketing team.

Here's what the touch points table could look like after you've materialized it:

‍

Building an Attribution Model

Now that we have a Touchpoints table, we can create an attribution model on top of it.

You can query the Touchpoints table based on the chosen attribution model that defines weights that are assigned to each interaction. The most popular options are:

First Touch Attribution: weight=1 where touchpoint_index_asc=1, ignore the rest
Last Touch Attribution: weight=1 where touchpoint_index_desc=1, ignore the rest
First Paid Touch: weight=1 where touchpoint_index_paid_asc=1, ignore the rest
Last Paid Touch: weight=1 where touchpoint_index_paid_desc=1, ignore the rest
Linear Multitouch (equal weight): weight=1/(N of touch points)
Multitouch U-Shaped: 40% to first and last touch point, 20% split between the rest

If the reporting tool allows query parametrization behind semantics layer like Looker, attribution model can be simply a parameter when pulling data dynamically. In other cases like Metabase where parameterization limits self-service possibilities, data can be materialized with attribution model column to filter by, or different attribution models can be kept in different views.

‍

Next Steps

Now that you've built an attribution model, you should consider the following on your roadmap and decisions to make:

Merging your attribution model with cost data from ad networks
Deciding on how to report on the model (cohort based vs. conversion time based)
Aligning everyone on how you'll think about Cost of Acquisition, and how models could be used to optimize your ad spend

And if you need an expert partner to help, reach out! Let’s grab a few minutes together and see if there is potential to unlock growth.

About Pearmill

Pearmill is a performance marketing that outperforms expectations, using data-driven insights and relentless innovation for exponential growth. Forged in the crucible of VC-funded startups, we’ve gained unparalleled expertise in the most dynamic and competitive environments, helping our clients 3X their ROAS on average.

The author

Nima Gardideh

Co-founder, President, CTO

Blueprints for building an internal attribution model on Segment data