Ketch — The Back Story

Vivek Vaidya
6 min readMar 26, 2021

Every start-up has a back story. So does Ketch. And this one is personal.

To understand why, we’ll need to rewind back to early 2017. Salesforce’s acquisition of Krux had closed a few months earlier and the “integration” process was kicking into 3rd gear. Having lived through various incarnations of ePrivacy, Safe Harbor, Model Clauses, etc., I was intimately familiar with how to build data management systems where user level data moved back and forth across the Atlantic, at least in a digital sense. But then came GDPR. And my entire world-view on how to store, process, and manage data shifted radically.

A few months into my tenure at Salesforce, I found myself leading the Engineering team for Salesforce Marketing Cloud and with May 25th 2018 fast approaching, GDPR compliance became the top product priority for me. Every product within Marketing Cloud had its own tech stack and while I was overseeing the development of all products, I was — as expected — deeply involved with the solution design for GDPR compliance in the Krux platform.

In GDPR parlance, Krux operated as a Data Processor with our customers being Data Controllers — the latter being a fancy legal term for businesses that have a direct relationship with end-users (or data subjects) and the former being companies that provide (software) services that manage and process user data on behalf of the Controllers.

Consent Management

GDPR requires businesses to operate with a “consent, opt-in” mindset — meaning that any end-user data that is used by a business can only be collected and used if the end-user has given the requisite consent. GDPR also requires businesses to explicitly declare the purpose(s) for which they are collecting user data and allow users to control if their data is used (or not used) for a given purpose.

Therefore, the first challenge for Krux was the collection of consent. Krux didn’t have a direct relationship with consumers so we had to provide a way for our customers to pass on the consent signals that they were collecting from their users. Consent Management Platforms had not cropped up yet so we ended up building a Consent Management API that was exposed to customers via our JavaScript tag, Mobile SDKs, and an HTTP(s) end-point. Some customers were not ready to use the Consent Management API in any form so we also provided the ability to upload consent signals using files uploaded to S3.

Once the consent signals had been collected, we had to put them to use. Before we could do that though, we had to figure out the most recent consent signature (think of the consent signature like a bitmap — a sequence of 0s or 1s with 1 place, or bit, for every data processing purpose in the Krux platform) for every user. The most-recent consent signature logs were rolling in nature — in other words, we had to retain the consent signature for every user regardless of when that user generated any activity for a given customer. Writing the map-reduce job to do this wasn’t hard, but the fact that we had to run it every day, for every customer made a material impact on our AWS costs.

Next, we had to assign a data processing purpose to each and every data processing job and then update the job to respect the consent choice made by the user.

As an example, consider the following query for the purpose of Analytics, which counts the total number of users generating at least 1 page view broken down by “site” for a given customer (each customer got a unique organization_id in the Krux platform):

SELECT event_site, COUNT(DISTINCT krux_user_id) 
FROM user_events
WHERE
organization_id = '0xDEADC0DE' AND
event_type = 'PAGE_VIEW' AND
DATE_DIFF(now(), event_day) <= 30
GROUP BY event_site

To comply with GDPR, the above query had to be rewritten like so:

SELECT ue.event_site, COUNT(DISTINCT ue.krux_user_id) 
FROM user_events ue, user_consent uc
WHERE
ue.organization_id = '0xDEADC0DE' AND
ue.event_type = 'PAGE_VIEW' AND
DATE_DIFF(now(), ue.event_day) <= 30 AND
ue.organization_id = uc.organization_id AND
ue.krux_user_id = uc.krux_user_id AND
uc.consent_purpose = 'analytics' AND
uc.consent_value = True
GROUP BY event_site

That looks straightforward enough. And for simple use-cases like that, it was. But Krux had a large number of fairly complex data processing jobs that were implemented using Hadoop and/or Spark which were not so straightforward to modify. The brute force approach worked in some cases, but in most cases, because performance could not degrade too much (or at all), we had to figure out clever ways of implementing that “join” in a cost-effective, efficient manner.

Rights Management

Under GDPR, a data subject (legal term for a consumer or user) can exercise multiple rights. The two that were most relevant for us were: RTBF (Right to be Forgotten) and Portability — or, in plain English, delete my data and give me all my data, respectively.

On the surface, this seems simple — just issue the following queries:

For RTBF:

DELETE FROM user_data_table 
WHERE
customer_id = '0xDEADC0DE' AND
email_hash = sha256(‘0xdeadbeef@dabbad00.com’)

or, for the Soft Delete and/or Data Obfuscation peanut gallery:

UPDATE user_data_table SET 
email_hash = sha256('gobbledygook'),
gender = 'other',
age_range = 'other',
household_income = 'other',
customer_status = 'other',
is_deleted = True
WHERE
customer_id = '0xDEADC0DE' AND
email_hash = sha256(‘0xdeadbeef@dabbad00.com’)

And, for Portability:

SELECT gender, age_range, household_income, customer_status
FROM user_data_table
WHERE
customer_id = '0xDEADC0DE' AND
email_hash = sha256(‘0xdeadbeef@dabbad00.com’)

If only it were that simple. If you have a modern data technology stack like we did at Krux (most companies fall into this category now), user data was spread across multiple types of data management systems:

  • Raw Log files in Blob Stores (S3)
  • Processed, semi-structured data sets in Blob Stores (S3)
  • NewSQL Data Warehouses (Redshift)
  • NoSQL Key-value Stores (DynamoDb)

Our data technology stack had grown and scaled organically and while we weren’t moving fast and breaking things, we were definitely moving fast. Which is to say that we didn’t have all of the metadata about our various data assets stored in a centralized Data Catalog that we could easily query to figure out which S3 locations, Redshift tables, and DynamoDb tables we had to process in order to retrieve or delete all the data for a given user.

The process of discovering all the data assets that we had to query to delete or retrieve user data was painful. A significant amount of time went into that. Not to mention all the extra AWS data processing costs we incurred to do the discovery.

All in all, the GDPR compliance project consumed 6 of our top-notch engineers for just over 6 months. While that cost was significant, it paled in comparison to the 30 % increase in our AWS costs.

And now, Ketch

Which brings us to Ketch, where we are building a Data Control Platform that provides

  • A Policy Center that allows our customers to maintain a centralized repository of privacy compliance and data governance policies
  • A true Consent Management Service that allows our customers to configure their privacy posture via the Ketch Policy Center, present a Privacy Experience to capture consent consistent with that posture
  • An Orchestration Engine that allows our customers to define workflows that manage the flow of data subject rights (DSR) requests and propagation of consent signals to external systems that manage and process user data on their behalf
  • A Data Asset Manager that discovers and builds a searchable Data Catalog of data assets across all the data management systems — SQL, NoSQL, NewSQL, Blob, Structured, Semi-structured — used by a customer
  • A Data Classification Service that detects and classifies private and sensitive data, and identifies all the tables, files, partitions, etc. that store personal user data
  • A DSR Automation Service (the Ketch Transponder) built on top of the Data Asset Manager that allows Developers and IT personnel to define customized actions depending on data type (delete, tokenize, mask, anonymize) to address DSR requests
  • A Data Fortification Engine that enforces data governance policies defined in the Policy Center across all the data management systems used by the customer

Remember that 6 devs/6 months with a 30% increase in AWS costs — those numbers would have been halved across the board if I had access to Ketch’s Data Control Platform, specifically, the Policy Center, the Consent Management Service, and the Data Asset Manager. And if you include the Transponder for DSR fulfillment on data platforms like S3 and DynamoDB, it would have resulted in cost savings of 75%!

So, yeah. This one is personal. I am building Ketch for my former self. Ketch is the platform I wish I had access to back in 2017.

P.S. To keep this post somewhat readable, I have glossed over many of the details of our generalized GDPR design and implementation. If you are curious to learn more or want to discuss this further, please give me a shout!

--

--

Vivek Vaidya

Serial entrepreneur and technologist; Co-founder and General Partner, super{set} startup studio;