June 14, 2022
How to De-identify and Secure PII in Snowflake
If you’re aggregating data from multiple sources into a Snowflake data warehouse, despite your best efforts, you’re likely to end up with sensitive PII in data workloads. In this post, I explain how to use Snowflake’s External Functions with Skyflow’s Data Privacy Vault to de-risk and de-scope data privacy, security, and compliance from your data application infrastructure, so you can leverage the power of Snowflake while protecting PII.
To understand de-identification better, let’s consider the data lifecycle for a fictitious company called Instabread as it looks to preserve the data privacy of their gig workers. This company provides a “bread marketplace” where clients shop for bread available at local stores from the comfort of their homes and on-demand shoppers handle tasks like pickup and delivery.
Instabread needs to create a mobile app where gig workers can sign up to become Instabread shoppers. As part of this process, Instabread needs to collect PII like the shopper’s name, email, home address, and banking information.
Instabread wants to be able to take advantage of modern cloud technologies like AWS and Snowflake, while avoiding the problems that come from intermixing user PII with regular application data (more on this available here).
To solve this problem, Instabread will use a Skyflow Data Privacy Vault to isolate and protect the PII of shoppers. Let’s explore how to architect this system while still giving Instabread the full power of AWS and Snowflake.
Instabread Requirements, Features, and Technologies
As part of the Instabread marketplace, Instabread shoppers need to be available for on-demand bread shopping and delivery. To address this need, Instabread is launching a new Instabread Shopper mobile app to recruit gig workers, coordinate shopping requests, and pay shoppers for deliveries.
Here are the high level requirements for the Instabread Shopper app:
- Mobile First: Shoppers are constantly on the go, so the app must support a mobile signup experience
- PII Collection: To create shopper accounts, the app must securely collect and store shopper user data like name and email address
- Bakery Selection: Shoppers must be able to select which bakeries and stores they are willing to travel to based on where they live
- Banking Information: The signup experience must securely collect and store bank account and routing information
- Paying Gig Workers: Shoppers must be able to receive prompt payment for completed deliveries
- Analytics: Sign up data must be stored securely in the Instabread data warehouse and be available for analysis to inform business decisions
A combination of technologies are needed to support these requirements. The image below shows a high level overview of the technology stack. AWS DynamoDB is used for application storage, Snowflake is used for the Data Cloud, and Skyflow’s Data Privacy Vault provides secure storage and controlled access to PII.
To support money movement from the Instabread bank account to a gig worker’s bank account, the application uses moov.io to securely relay data from the vault using Skyflow Connections, as shown below:
The original plaintext values for Instabread gig workers’ PII are stored in the vault based on the schema shown below. A similar structure exists in DynamoDB, where tokenized data and other application data is stored.
Collecting and De-identifying Sensitive Data
When collecting PII, the best case scenario for data privacy and security is that data is de-identified as early in the data lifecycle as possible. In the case of Instabread, in any part of the sign up process where PII is collected, that data should go directly to the Data Privacy Vault without touching any of the backend systems. The Data Privacy Vault will return tokenized data that can be used as a stand-in for the original values for running various workloads but has no exploitable value.
This means that during account creation, when Instabread collects the name, email, zip code, and phone number these plaintext values should go directly to Skyflow. Similarly, when a routing and bank account number are provided by a shopper, that data should not touch any of the Instabread application infrastructure – instead, it’s only stored in the vault. In exchange, Skyflow returns tokens that can be safely sent and stored in the downstream services in lieu of sensitive data.
We can accomplish this easily by using a Skyflow client-side SDK. The sample code below shows the vault insert request. Tokenized data will be returned from the insert function call.
The diagram below shows the data flow for PII collection in Instabread. Sensitive data is sent directly from the application frontend to the Data Privacy Vault, tokens are returned, and those tokens are passed downstream to the application storage in DynamoDB.
Moving Data Downstream to Snowflake
The data that ends up in DynamoDB also needs to be sent through an ETL pipeline to the Instabread Snowflake data warehouse to support data analytics.
To manage this, we’ll set up an AWS Lambda to automatically trigger whenever new data is added to DynamoDB. The Lambda function writes the newly-added data as a message to a Kafka broker. Then the Kafka Connector feature of Snowflake ingests the messages and metadata from Kafka into a Snowflake table similar to the following:
When putting it all together, the entire application infrastructure is completely de-risked and de-scoped from managing data security, privacy, and compliance. Those responsibilities are handled by the Data Privacy Vault.
Querying Across Snowflake and the Vault
The data stored in the warehouse is now completely de-identified. We can execute many analytics workloads directly against the de-identified data.
For example, let’s say we wanted to know the breakdown of shoppers by zip code so we can compute the gig worker density in each geographic area.
As shown in the image above, the tokenization format for zip code in the Instabread vault is a deterministic token. This means the same zip code will always be tokenized to the same UUID.
This tokenization format lets us write a 'group by' query much the same as you would if you were storing the actual shopper zip codes within Snowflake:
However, there could be situations where we want to get the original plaintext value to provide additional context for analytics results. In the zip code example, the query we discussed before could be used to generate data to power a bar chart view of the results. But, to be useful to the person viewing the bar chart, the x-axis labels should correspond to an actual zip code, not random UUIDs.
To support this use case, Instabread needs a way to detokenize the tokens stored within Snowflake.
Creating De-tokenization Support
If you have a token and the right permissions, you can use Skyflow’s detokenize API to retrieve the original value. But, how can we securely call the detokenize API from Snowflake?
You can do this using Snowflake’s External Functions feature. External functions are user-defined functions that are stored and executed outside of Snowflake. You use external functions to call an API endpoint, passing data from the function to a third party.
Since Instabread is using AWS as its cloud provider, we can use AWS’s API Gateway and Lambda function features to create an API endpoint for Snowflake to call. The Lamdba function will securely call the detokenize API with the token value passed from Snowflake and return the detokenized data similar to the process shown below.
The code below shows how the Lambda function receives data from Snowflake, detokenizes the values, and then returns the data as a response.
You can read more about how to set up the API Gateway and Lambda function for an external function on Snowflake here. With that guide, you’ll create an API Gateway in AWS that will call the Lambda function (shown in the AWS box in the architecture diagram). You’ll also create an IAM role in AWS for Snowflake, then use Snowflake’s API integration support to grant access to Snowflake to call AWS using that role, similar to the command below.
Once the API Gateway, Lambda, and permissions for Snowflake are in place, the last thing to do is create an external function in Snowflake that will call the API. The sample code below defines a detokenize function for Instabread.
With the detokenize function created, Instabread can now execute a query like the one below directly from a Snowflake worksheet. The query will return the plaintext value that corresponds to the token.
To perform the analytics operation discussed earlier, computing gig worker density by zip code, we can use the following query:
The results for this query can power a graphical representation of gig worker density as shown in the following GIF:
Easy Data Protection and Compliance for Analytics
The Instabread use case illustrates how you can easily have data protection and compliance while still doing all the typical analytics workloads provided by Snowflake’s data cloud.
This example showed the full lifecycle of data, from collection to usage, but only for a single source. In a real application, additional sources of data, including sensitive data, might be inputs to your data warehouse. De-identification of sensitive data for these additional sources is easily supported through further integrations with Skyflow APIs.
To explore using a Skyflow Data Privacy Vault to de-identify and secure PII, give Skyflow a try.