Securing Your Databases Is Good, Securing Your Data Is Better

[Earth had] a problem, which was this: most of the people living on it were unhappy for pretty much all of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, which is odd because on the whole it wasn’t the small green pieces of paper that were unhappy. -- Hitchhiker's Guide to the Galaxy


You’ve heard about data breaches and what they do to company (and employee) fortunes, so you’re working hard to secure your database--upgrade, firewall, encrypt, auditing, etc. Oh yes, and remember to change the default password. What about access control? Do you have the right policies? It feels like I'm forgetting something!

As a security professional, I like more security. You can, and should, nay, must, do all of the above, but not just for your databases. You have to look at your backups, secure the servers that process this data, the data pipelines that move this data around, the logs your applications generate into which data can leak, and so on.

As a security professional, I like more job security. However, you don’t have to be a security professional to see that this is at best an indirect path to the most important thing that you were trying to secure--the data itself.

What is tokenization?

When securing data, the first thing folks think about doing is encrypting it. If you’re following all the best practices--choosing the appropriate encryption algorithms, IV construction, chaining modes, key generation, key management, etc.--you’ve essentially replaced your sensitive data with ciphertext. Ciphertext is a bunch of bytes which are meaningless to anyone without access to decryption keys, and which leak very little information. This is a great start! 

But, there is no lunch without small green pieces of paper. Apart from hiring a cryptography nerd, you now have folks complaining you broke their system--customer service was using the last four digits of social security numbers to validate users and now they don’t have that; the analytics stack was built on software that assumes the email field (which analytics may never read) will always look like an email and now they need to fix their stack. You can solve this by giving everyone access to the decryption keys, but if you do that, you’re not much better off than when you started.

The main point of the encryption based approach--replacing the sensitive data with “something else”--is exactly right. What you need is to replace the sensitive data with a “something else” that solves these new problems as well.

What you need is tokenization. In tokenization, just like with encryption, you rip out your sensitive data and replace it with a placeholder--a token. Unlike encryption, you do not have to worry about an “adaptive chosen ciphertext attack” (or some crypto attack not yet discovered) because a good token generation scheme will neutralize these attacks.  

Tokenization is not encryption, which means that detokenization is not decryption! You do not have to hand over access to your keys to anyone who wants to work on the tokenized data

  • Your tokens can be “format preserving”--i.e., they can look like email addresses, or social security numbers, etc. so that old code that was used to seeing email addresses doesn’t crash
  • You don’t have to give customer service access to the entire SSN just so they can compare the last four. You can, given the right tokenization solution, make sure customer service can detokenize ONLY the last four. This allows you to follow the security design principle of “least privilege”-- hat every process or user should be able to access only the information it needs to do its job.

In some cases, you can choose to have tokens that allow some computations (e.g., find common users across different datasets) which might otherwise have required access to sensitive data. These tokens expose just enough information to be useful (e.g. whether or not two records refer to the same user). The key point is that you get to control the exact tradeoff between security and usability of your data using simple token configuration--not complex cryptography.

This is really just scratching the surface. We haven’t talked about your other problems around compliance, data-residency, governance etc. We also have barely discussed the various kinds of tokens and their information hiding properties.

If you’d like to dig more into the pros and cons of encryption vs tokenization, I encourage you to check out our white paper. If you want to know more about how to use a data vault to solve these problems, please reach out to us.