Snowplow pipeline

The Objectiv Collector supports using the Snowplow pipeline as a sink for Objectiv events, hooking directly into Snowplow's enrichment step. Currently, there is data store support for:

  1. Google BigQuery, via Google PubSub; and
  2. Amazon S3, via AWS SQS/Kinesis.

How to set up Objectiv with Snowplow

In this setup, we assume you already have a fully functional Snowplow pipeline running, including enrichment, loader and iglu repository. If you haven't, please see the Snowplow quickstart for Open Source.

Enabling Objectiv involves two steps, as explained next:

  1. Adding the Objectiv Taxonomy schema to the iglu repository;
  2. Configuring the Objectiv Collector output to push events into the appropriate message queue.

1. Add the Objectiv schema to the iglu repo

This step is required so the Snowplow pipeline (enrichment) can validate the incoming custom contexts.

Preparation

  • copy the Objectiv iglu schemas (see here)
  • get the address / URL of your iglu repository;
  • get the uuid of the repo.

Pushing the schema

java -jar igluctl static push --public <path to iglu schemas> <url to repo> <uuid>

## example:
java -jar igluctl static push --public ./iglu https://iglu.example.com myuuid-abcd-abcd-abcd-abcdef12345

2. Configure output to push events to the data store

The Collector can be configured to push events into a Snowplow message queue, using environment variables.

Background

The Snowplow pipeline roughly consists of the following components:

  1. Collector: http(s) endpoint that receives events;
  2. Enrichment: process that validates incoming events, potentially enriches them (adds metadata);
  3. Loader: final step, where the validated and enriched events are loaded into persistent storage. Depending on your choice of platform, this could be BigQuery on GCP, Redshift on AWS, etc.;
  4. iglu: central repository used by other components to pull schema for validation on events, contexts, etc.

The Snowplow pipeline uses message queues and Thrift messages to communicate between the components.

Objectiv uses its own Collector (which also handles validation) that bypasses the Snowplow collector, and pushes events directly into the message queue that is read by the enrichment.

Snowplow allows for so-called structured custom contexts to be added to events. This is exactly what Objectiv uses. As with all contexts, they must pass validation in the enrichment step, which is why a schema for the Objectiv custom context must be added to iglu, so Snowplow knows how to validate the context. Furthermore, it infers the database schema to be able to persist the context. How this is handled depends on the loader chosen, e.g. Postgres uses a more relational schema than BigQuery.

Objectiv to Snowplow events mapping

In a standard Snowplow setup, all data is stored in a table called events. Objectiv data is stored in the table by mapping the Objectiv event properties on the respective Snowplow properties. Objectiv's contexts are stored in custom contexts.

Events

Event and some context properties are mapped onto the Snowplow events table directly. See table below for details:

Objectiv propertySP Tracker propertySnowplow property
event.event_ideidevent_id
event.timettmtrue_stamp
event._typese_case_category
ApplicationContext.idaidapp_id
CookieIdContext.idnetworkUserIdnetwork_userid
HttpContext.referrerrefrpage_referrer
HttpContext.remote_addressipuser_ipaddress
PathContext.idurlpage_url

Global contexts

For every global context, a specific custom context is created, with its own schema in Iglu. Naming scheme is io.objectiv.context/SomeContext

NOTE: the _type and _types properties have been removed.

Location stack

As order is significant in the location stack, a slightly different approach is taken in storing it. The location stack is stored as a nested structure in a custom context (io.objectiv/location_stack)