The Illustrated Guide to Property-Based Testing

Whenever someone mentions “testing”, most people immediately think of unit tests. And that makes sense, since unit tests are easy to write, and quick to execute.

But there’s another way to test, called property-based testing, which is an entirely different testing technique than many programmers are used to.

Suppose you were testing a function that returns the length of a list. The obvious way to test this is to unit test it with an empty list, a list of size 1, and then some random sized list with size > 1.

That sounds fine and dandy, but how do you know that this length function really works correctly? The only way to be 100% sure it works is to either do a mathematical proof, like an induction proof (not generalizable to all functions), or to test all possible inputs (not feasible/impossible).

Proving that every function is correct using mathematical proofs is possible, but not generally feasible for most projects, so we settle for testing edge cases, and then some generic cases when unit testing.

Property-Based Testing

Property-based testing, in a nutshell, is saying, “This property must hold true for all possible inputs”. Property-based testing doesn’t actually test for “all possible inputs” though. In libraries like Haskell’s QuickCheck, it by default only checks 100 cases. Unless you have an incredibly specific and unrealistic bug, like your function only fails when the number “666666” is in your list, property tests should be able to cover most cases. If 100 cases isn’t enough, you can raise it to whatever arbitrary number you’d like.

So how does this actually work? First, lists are generated in increasingly large sizes, and then whatever property you specified is checked across the lists.

For example, let’s suppose that one property is that the length function must always return positive values, but it actually returns a negative value when there’s a duplicate.

Notice that when our property failed on the list [1, 2, 2], this wasn’t a minimal failing case. Our bug is that the output becomes negative if a duplicate exists, so the minimal failing case is actually [2, 2]. If we weren’t returned a minimal failing case in our property test, we might get indecipherable outputs.

Suppose it only found the bug on the 50th iteration, so by this point, the random list it would be testing on would be really long. That doesn’t help tell us what the bug is — it just tells us an example of a case where the bug exists.

Compare that output, with the output of [2, 2]. Because it’s a minimal example, that means that the test either only fails when the length is 2, or when the list contains specifically contains two 2’s.

Of course, reducing the case down to a minimum failing case is really easy if you have a failing case. It simply just has to iterate over the possible failing permutations, removing elements one by one, until you get the smallest possible case that fails.

Now that you know the basics of property-based testing, let’s complicate it a little by chaining properties together. Let’s suppose you wrote a special JSON serializer that takes an object, and serializes it with an extra field called “date”.

Typically, with unit testing, we would just create some object, turn it into JSON, turn it back, then check that it matches the original object. But that’s not very dynamic, since you’re testing only one case. You could do it 5 to 10 times, but that would be pretty tedious.

What you could do instead is creating an arbitrary object generator, and then use property tests to show that it works, all without explicitly creating fixed test cases.

Here are some examples of properties we’d like to be true, where special_serialize and special_deserialize are the custom serialize functions we wrote, and serialize, deserialize are actual serialize functions that don’t include the “date” field:

PropertyWhy?
For all objects O, special_deserialize(special_serialize(O)) = OSerializing should be undone by deserializing
For all objects O, special_serialize(O) != OSerialized non-null objects should be different from the actual objects
For all objects O, special_serialize(O) != serialize(O)Special_serialize adds an extra date field, so if it’s the exact same as if I had serialized it normally.
That’s likely because it either doesn’t add the “date” field, or because it fails when the object already has a “date” field.
For all objects O, length(special_serialize(O)) > length(serialize(O))Adding a date field while serializing should always increase the size of the resulting JSON, as opposed to serializing it normally.
For all objects O, contains(special_serialize(O), “date”)If the resulting JSON doesn’t contain a date field, it’s automatically wrong.
For all objects O, special_deserialize(O) should result in an errorYou shouldn’t be able to deserialize a non-serialized object.

That’s a lot of properties to test. While you can theoretically create a series of unit tests to cover these, it’s much easier to use property tests to do it, since property tests are more exhaustive than simple unit tests. The exception to this is the 3rd property, which will never trigger in a real property test.

This is because if you generate your objects randomly, they should also have fields with random names, and the probability that you randomly generate one that already had a “date” field is at most bounded above by (1 / 26) ^ 4 = 0.0002%. When bugs only happen with really specific inputs that are unlikely to be randomly generated, then unit tests become much more appealing.

Conclusion

To summarize, here’s a top-level diagram of what happens in property testing:

Remember that property-based tests are not an all-inclusive solution to everything (otherwise, everyone would be using them). Property-based tests fail horrifically when you know that the bug only occurs in a very very specific scenario. For example, if the test only fails when a specific field and specific field value occur, that’s a scenario where unit tests are clearly favored, since those fields and values would basically never be randomly generated.

Also, property tests are generally much harder to write than unit tests. If you write too few properties, you end up with a property-based test that isn’t sufficient enough to test your code. And it’s really hard to know when your properties are sufficient.

Lastly, for the biggest disadvantage of property-based testing, it’s that they take much longer than unit tests, just by virtue of running more cases than unit tests. The average programmer probably only writes ~2-5 unit tests per function, whereas with a property-based test, you’re executing at least 100 tests, meaning unit tests are about 20 times faster.

If xkcd 303 was about property-based testing

This is further exacerbated when you’re working with lists. Property-based tests that involve sorting a list, or really any algorithm that’s not O(n), might end up taking time on the order of seconds, which is an eternity in the world of testing.

Just remember that you can’t jam property-based testing into every single situation. It pretty much only works when you have pure functions, and when your functions have easy-to-generate random inputs. Hopefully you found this guide to property-based testing useful, and maybe it’ll inspire you to try out property-based testing in the future!