July 20, 2022
Does Hashing Sensitive Customer Data Protect Privacy?
Hashing is common practice when storing customer passwords and is a natural place for engineers to start when faced with the challenge of protecting their customer’s sensitive personal data. However, hashing has many limitations when it comes to protecting PII and is far from a complete data privacy solution.
When presented with a problem like securely storing customer data, many software developers instinctively think about hashing and encryption. Hashing is a powerful tool that we all learn about during our technical education and training. It’s a technique often used for obfuscating passwords, but when it comes to protecting customer PII, does hashing make sense?
In this post, we explore this question in detail, digging into the problems that hashing does help solve, but also why it fails as a viable solution for secure storage of sensitive customer data or a reasonable solution to data privacy.
What is Hashing?
Hashing is a one-way mathematical operation that transforms a data input of a certain type and arbitrary length to an output string of a fixed length. Unlike tokenization and encryption, the hashed value can’t be used to recover the original value – so there’s no equivalent to detokenization or decryption for hashed data.
What is Hashing Good At?
Hashing is useful in many contexts, including fast lookup and password storage.
Hashing is used as part of the construction of many data structures like hash tables and hash maps. These data structures provide fast lookup for an input value, O(1) and O(logn) respectively, assuming low collision rates. Whenever you have a large amount of data in-memory and you need to scan that list to check for the existence of an item, these data structures are your friend.
Hashing functions are also often used for secure storage of passwords.
We never want to store a customer’s password in plaintext, so passwords are typically hashed during collection. The hashed version is stored in a backend system. The plaintext version is never stored.
During a login attempt, to check whether an entered password matches the stored password, the entered password is hashed and compared against the stored hashed. The comparison operation between the two hashes is fast and the original password never needs to be revealed. This way, even if the password datastore is compromised, an attacker only has access to the hashed values.
The Problems with Hashing
So if hashing provides fast lookups and is a good solution for secure password storage, why wouldn’t it also help us to protect customer PII? There are at least six issues with hashing that make it ill-suited to protecting sensitive customer data.
Issue #1: Susceptible to Brute Force Attacks
Even with hashing, hackers are able to sometimes bruteforce hashed passwords by using a rainbow table (a precomputed table of cached cryptographic hashes) or smart dictionary attack. Even though the stored passwords are hashed, if an attacker can produce the same hash from a given input, then they know that input matches the original password.
When it comes to customer PII, this is even easier to do as many forms of PII, like someone’s birth date, phone number, social security number, or credit card number have low entropy.
For example, there are only 10^9 potential social security numbers and only about 420 million actually assignable social security numbers. With a search space that small, an attacker can pre-compute all possible social security numbers using standard hashing techniques and compare against any leaked hashed values as shown below.
Even names are relatively easy to brute force since there are many common names that can be used to pre-compute possible values.
To protect against a brute force attack on passwords, a salt value is often used. A salt is a unique, randomly generated string that gets added to each password as part of the hashing process. This means that the same password will result in different hash values because there's a different salt value used. To crack an approach like this, the attacker has to crack hashes one at a time.
However, adding salt values is less effective for low entropy values like social security and credit card numbers. This is because the salt must be stored with the hashed value, otherwise there’s no way to recompute the hash for the same input. If the salt is stored with the hashed value and the hashes have been leaked to an attacker, then the attacker has everything they need to steal sensitive data like your customer’s phone number, credit card information, or social security number.
Issue #2: Hashing Isn’t a De-identification Method
Because most hashed PII is vulnerable to a brute force attack, this calls into question whether hashing counts as de-identification, which is required by many data privacy laws. The answer is it depends upon the particular law or regulation.
According to the law firm GreenbergTraurig, in the context of GDPR, hashing is considered pseudonymization and not de-identification.
Issue #3: Can’t Recover Data if Compromised
Another challenge with hashing is if the salt value is leaked, then there’s no way to rotate the salt as you might rotate an encryption key and rehash the original data. This is because hashing is a one-way operation, so essentially the original data is destroyed through the hashing process.
Issue #4: Lack of Utility
Besides the potential vulnerability that hashing PII suffers from, the big limitation of the approach as a data privacy and data security solution is that you lose nearly all data utility with hashing.
Most businesses store customer PII because they need to use this data. For example, if they’re storing a customer’s social security number, they may need to show the last 4 digits in the customer support portal’s user interface for customer verification. Or they may need to show the customer’s home address in the customer's account screen so they can verify whether it needs to be updated. The customer’s credit card or banking information is required for billing purposes.
None of these operations are possible if the data is hashed.
Issue #5: Prevents Third-party Integration
Hashing your customer’s PII prevents your application from passing the data to third-parties. For example, you may need the customer’s phone number in order to call Twilio’s APIs to send a text message, or your customer’s social security number to run a background check. If the original values have been lost to hashing, you won’t be able to perform these high value integrations. The third-party service won’t be able to do anything with the hashed values.
Issue #6: Can’t Govern Access
Even if you assume the customer data is stored securely, it’s still imperative that your system is locked down and that you are governing access to services and users based on what data they actually need access to in order to perform their jobs.
Most services within our applications don’t need access to PII and certainly not all columns and rows. Also, many services only need partial PII as shown below. Attribute-based access control (ABAC), role-based access control (RBAC), and policy-based access control (PBAC) are needed to define and control access based on service needs.
Strong data governance is required for a full data privacy solution, so you’d need to have this capability above and beyond any hashing operations.
How Do You Solve This Problem?
Securely storing data in a compliant and privacy-preserving way that still yields high utility and supports all common use cases and operations for your business is not a simple problem to solve.
Privacy-preserving technologies like tokenization help address some of the limitations of pure hashing. There’s no mathematical connection between a token and the original value, so brute force attacks don’t work against tokenization. Additionally, through de-tokenization, the original values can be retrieved, maximizing utility and supporting third-party integrations.
Beyond tokenization, your data also needs to be encrypted during transit and rest and ideally secure analytics can be performed over fully encrypted data through technologies like polymorphic encryption. This reduces the potential attack vectors and ensures that even if someone gains access to your data storage, the data itself is secured via encryption.
Wherever your data lives, it should live on a segregated network with privileged access where the servers and network security has been hardened. Every access should be logged for auditing purposes.
You’ll need data masking support so you can reveal partial PII but not full PII for sensitive fields like credit card numbers and social security numbers. You’ll need to manage and govern access based around a zero trust architecture.
This is all critical infrastructure, so you’ll need high throughput and availability to scale access and meet the requirements of your business. Finally, to comply with regulations like data residency, you’ll need to be able to shard your customer PII by geography.
That’s a lot to think about!
We created Skyflow to help solve problems like this. We’ve thought through all these challenges and issues so you don’t have to and made it available as a simple API.
Check out our Quickstart environment and give it a try for yourself.