Data Integrity Primer

Posted on Wednesday, Jul 26, 2017 by Daniel Szpisjak ~7 minute(s) to read

Data integrity is rarely talked about, even though it comprises the basis of many data flows a modern web application has to deal with. From a security perspective, integrity deals with protecting data from being modified (by unauthorized parties). There are various techniques to ensure integrity. I will guide you through the options, using real world examples. Once you finish, you will know more about this than most of the industry.

Checksum

Have you ever wondered, how a website can tell if your credit card number is valid before talking to your bank? Ever heard of the Luhn formula? Well, that is a checksum!

The idea is dead simple: take the original card number and add a check digit to it. The extra number can be used to detect quite a few errors, making it possible to ensure integrity. Checksums are widely used from credit cards to network protocols (like TCP). They are straightforward and efficient. Some can even be used for error-correction.

There is one important caveat though: checksums are designed to detect random errors, not malicious modifications! Take a look at the Luhn algorithm mentioned above; it is trivial to fool if you possess basic math skills.

There must be something to defend against malicious tampering, right? Yes, and it’s called a hash.

Hash

Once upon an evening, you decide to install Ubuntu on your machine. First, you need to get your hands on the installer image. It is quite large, so you choose to use BitTorrent to download. It works by downloading chunks of data from different peers. To make sure no peer can modify any chunk and corrupt your image, the torrent file, downloaded from the Ubuntu website, includes the hash of every piece. When your client finished downloading a piece, it simply calculates its hash and checks for a match.

So what is a hash?

If you are completely new to hash functions, start here. The hash used by the BitTorrent protocol is somewhat unique: it’s a cryptographic hash function.

Hash functions have various properties mainly centered around the idea of avoiding a collision for non-malicious input. A hash function is cryptographic if it satisfies some stronger properties. Namely the following.

Preimage resistance

Given a hash value h, it should be difficult to find a message m, such that hash(m) = h. Makes sense, as hash functions are one-way.

Think about it. If this weren’t the case, a malicious peer in the torrent network would easily corrupt Ubuntu images. All he needs to do is find a preimage to the given hash value of the chunk.

Second preimage resistance

Given a message m₁, it should be difficult to find a different message m₂, such that hash(m₁) = hash(m₂). This is a stronger property, than the previous.

If a hash function does not satisfy this property, an attacker needs to get his hands on the original message to craft a double (one that hashes to the same hash). This is trivial in BitTorrent.

Collision resistance

It should be very difficult to find two different messages m₁ and m₂ such that hash(m₁) = hash(m₂).

Remember when Google announced the first SHA-1 collision? Well, that is exactly what collision resistance should protect against.

If a hash function satisfies the above 3 criteria it is said the be strong enough for cryptographic use. From these we can easily deduce the following:

h₁ = hash(m₁)
h₂ = hash(m₂)
if h₁ = h₂ then m₁ = m₂ with very high probability

Ultimately, this is what the BitTorrent protocol trusts when verifying the integrity of the chunks you downloaded.

Limitations

You may have noticed by now that, for this type of integrity checking to be effective, the hash has to be distributed out-of-band, i.e. separate from the data it meant to protect. Why? Simple, if you attach it to every chunk, corruption becomes trivial as the attacker just has to recalculate the hash before sending you bad data.

For BitTorrent, the torrent file comes from the Ubuntu web page while the chunks are downloaded from peers on the network. Hashing works well in this case. Makes sense, right?

Okay, so what about all those sites that offer you to download the hash of the file right beside the file? Are they any good?

Well, it depends on what they are trying to protect against. If it is random errors then yes, having the data and the hash side by side does the job. On the other hand, if the goal is to protect against a malicious man-in-the-middle, i.e. someone capable of modifying traffic on the fly, then this scheme fails badly for the reason stated above.

So how do you verify the integrity of data/software downloaded from the internet? This is all in the next section.

Signature and MAC

CC0 image by Michal Jarmoluk

Did you know, that humans have signed documents as early as the second century? No wonder we carried the concept into our digital world. We also made some improvements. Digital signatures and MACs, if used correctly, are a lot more reliable than handwritten signatures.

So what’s a digital signature?

A digital signature is a mathematical scheme for demonstrating the authenticity of digital messages or documents. A valid digital signature gives a recipient reason to believe that the message was created by a known sender (authentication), that the sender cannot deny having sent the message (non-repudiation), and that the message was not altered in transit (integrity). - Wikipedia

Okay, what’s a MAC?

In cryptography, a message authentication code (MAC), sometimes known as a tag, is a short piece of information used to authenticate a message—in other words, to confirm that the message came from the stated sender (its authenticity) and had not been changed (its integrity). - Wikipedia

Both of these constructions can be used prove data integrity. Their key advantage over regular hash functions lies in the fact that the signature/tag can only be produced with the help of a secret.

For MAC, the key used to create the signature and the one used to verify is the same, i.e. it’s symmetric. Digital signatures use key-pairs, the private key is used to sign, while the public one is used to check.

Modern mobile operating systems check the integrity of applications before installing them. As a matter of fact, they check its authenticity as well. If it has been modified or it is not signed by a trusted party, the OS won’t install it. This is the essence of code-signing, and it is achieved by using digital signatures.

JSON Web Tokens are integrity protected by a so called HMAC construction. These tokens are passed to other parties, who cannot modify them without knowing the secret that was used during signature. A JWT is essentially a self-contained integrity-checked chunk of data.

Note, that these signatures can be safely transmitted with the data and do not need the out-of-band channel. They also provide another useful property called authenticity, proving the data came from someone who knows the secret used for the signature.

Would you like to dig deeper?

While this post is just an introduction to the topic, you are able to learn about it even more, if you want, along with a more detailed introduction to cryptography.

Conclusion

Ensuring data integrity is critical. It’s used in some of our most fundamental protocols: TCP, TLS, SSH. As a software developer, you must know the right tool for the job when it comes to integrity. Let’s do a quick recap.

Checksums provide protection against random errors and possibly error correction. Use them when you are not concerned about malicious actors. A good example is error detection due to noise on the medium (TCP).

Cryptographic hashes are stronger constructs having the following properties: preimage resistance, second preimage resistance, collision-resistance. They can be used to protect against malicious tampering. Use hashes to protect against malicious modifications. Remember to distribute the hash on a different channel (BitTorrent)!

Digital signatures and MACs have the strongest security properties. They create a piece of data which proves integrity and authenticity using a secret. MACs use a symmetric secret while digital signatures use the asymmetric model. The signature/tag produced by these constructs cannot be calculated without knowing the secret. This makes them ideal in highly untrusted environments, like the web (TLS Certificates, JWT).

One final advice. Don’t roll your own integrity protection scheme. Use battle-tested solutions: CRC32 for checksums, SHA256 or SHA512 for hashes, RSA or ECDSA for digital signatures and HMAC for MACs.

Comments, questions? Start the discussion right below.

Want more? Subscribe below and get the next Security Drop right to your inbox.