About trusting data

Posted on Monday, Jun 26, 2017 by Daniel Szpisjak ~5 minute(s) to read

“Never trust user input” - say the wise. Sound advice, although it raises more questions than it answers. First of all, what does it mean to trust a piece of data? Why not trust it? Is user input the only piece of data you should be careful with? Can you even trust any data? These are the questions I am exploring in this post.

Assumptions

When data enters your system, you are likely to have various assumptions about it. Some of these are basic, and you take them for granted. While others are complex and their validation is far from trivial. Examining just how much is uncertain about incoming data, may surprise you.

First, you probably think nobody tampered with the data in transit. In other words, you assume its integrity. Second, you believe it is not malicious; you assume its intent. Next, you suppose it is syntactically correct, and while you are at it, you conclude its semantics are okay. Why wouldn’t it be, right?

Here is how this would look for an HTTP request:

the source IP of the request is matching the party who sent it (assume integrity)
the size of the Cookie header is reasonable, and won’t cause an overflow (assume intent)
it has a valid JSON body (assume correct syntax)
the end-user initiated this request by a direct interaction (assume semantics)

When a request hits your server these are only suppositions; you have no idea if they are true or not. You hope they are true, but you can only be sure if you test them one-by-one.

Definition of trust

We arrived at the definition of trust I propose:

Trusting data means not verifying your assumptions.

Fortunately, this definition does not conflict with anything already out there. It does, however, place trust in a new perspective. Not trusting data means you must validate your assumptions. Keep in mind though, that trust has nothing to do with handling data correctly (using prepared statements for SQL queries, properly encoding output inserted in HTML, etc.) You must do that regardless of trust!

So why would you ever skip validating your assumptions? Well, it saves you time and money. Think about it, if you were to confirm every single assumption at every turn, most of your code would need to deal with this. Not to mention, validation all around will probably break the single responsibility principle.

Trusting data is not evil. You just need to base it on something.

Trust other components

CC0 image by Osman Rana

The most likely source of trust is another element; this means you assume someone else already did the validation. Let me illustrate this via a simple MVC web application.

When the request reaches your controller logic, it will have gone through various stages of processing. At this point, you can be quite confident, about the request having syntactically correct JSON body.

Why? - Because you know there is a middleware registered in front of your controller that checks and parses JSON data. In your controller scope though, the sound body is still an assumption. One that you will not validate because you trust the middleware to have done it already.

Again, there is nothing wrong with this as long as this is a conscious choice. When dealing with another component, you can either explicitly trust it or use an implicit trust.

Explicit

An explicit trust relationship is established in your code, essentially making it impossible to violate runtime. Think about a method which takes a parameter object as one of its arguments. The parameter object is constructed in such a way to prevent invalid states. A good example is .NET’s URI class. If you specify a URI type argument, there is no way for it to be invalid. The parameter object takes care of validating assumptions.

Implicit

The other type of relationship is implicit; this is established design time. Think about XML processing. You create two classes, one that validates based on a schema, and another for processing. The latter depends on the former for schema validation. This relationship is not enforced runtime. When working with these classes, you must know they have to be used together.

You can choose to trust any component in your system implicitly. For instance, you may decide not to validate data coming from your database. The rationale behind this may be the following: your system is the only one interacting with this particular database and everything you store there has already been validated.

Be careful when applying implicit trust. As the component you trust, is loosely coupled with those depending on it. Therefore, a change in it may quickly convert an implicit trust to blind one.

You have the choice of blindly trusting data. This is like playing roulette and placing all your money on RED-36. While there is a chance you will get rich quick, it is very slight. You will most likely end up losing all your money. Do not do this; software development is not gambling…

Conclusion

Data, without processing, provides no value and poses no harm. Trust comes into play when you have to do something with it. It may come from various sources: users, databases, cache, other services, etc. No matter its origin, always make sure to validate the assumptions you depend on or have a strong reason for trusting it.

Next time you design your components, take a moment to think about the assumptions of data you depend on and make sure you do not get bitten by them and avoid blind trust at all cost.

Did you ever blindly trust a 3rd party component? How did that work out? Share it in the comments.