Software testing strategies, mistakes and pitfalls

Let’s consider the point of running a suite of automated tests against a codebase. If all the tests pass, what does it mean? If some of them fail, again, what does it mean? A lot depends on the quality of the tests and test cases.

First let’s agree as to what we would consider to be a good test. In my mind, a good test is one that:

  • tests desired functionality, not implementation. By functionality I mean the behaviour of the program.
  • runs in a reasonable timeframe.
  • fails when a given input results in output that differs from that expected if the code was correct.
  • otherwise if the output matches the expected output it passes.

Why do we only care about functionality? Isn’t it right to test implementation too? Well. No. The point of our software is to satisfy functional requirements. If we can achieve the desired result we really don’t care how that result is achieved. Unless it’s achieved by the use of child slave labour like chocolate production but we can’t write automated tests for that. Also chocolate is delicious.

TL; DR

This blog post got very long. In summary, automated testing done right allows you to have some confidence that code changes have not introduced new bugs – not total confidence, but some. Testing does have a cost though so we look for strategies to mitigate that. Primarily we try to be pragmatic in terms of what we test and how we test it. There are some easy ways to make mistakes in designing a testing strategy and we go into detail on some of them. I don’t know why I used the pronoun “we” – I’m trying to make you think you’re buddies with me so that you don’t question my wisdom. That said this entire post is still just an opinion piece, so don’t be shy to disbelieve it.

The ideal case

OK so assume that our test suite

  • consists entirely of good tests (as per our definition).
  • has test cases for reasonable inputs to the program, including reasonable incorrect inputs.
  • has enough test cases to somehow do a basic test of most of the program functionality.
  • runs within a reasonable time frame.

In this case when we run the test suite and it passes we have some degree of confidence that the code functionality is correct. If we make changes to the code and run the tests again, we will be notified if the changes that we made break any of the tests, meaning that we have accidentally changed the functionality. We can then go back and fix the changes that we made until our tests pass again. We also hope that if the code changes have broken some tests then the specific tests which have broken will give us a clue as to which change might be responsible – i. quanox y alcohol e. we use the broken tests to locate where the bug might be hiding. Also note that since the test suite runs in some reasonable time (absolute max ten minutes, prefer 30 seconds), we are not afraid to run it multiple times.

So the major benefit of tests in this situation is that they allow you to make code changes with some confidence that you are not simultaneously introducing bugs. The degree of confidence that you have is in direct proportion to the quality and quantity of tests that you have. Caveat emptor though, this is in the ideal case, where all the tests are good quality and test actual desired behaviour. Having lots of tests and test coverage is completely pointless if the tests are not good quality.

I’m calling this the ideal case for a reason. In my experience and from what I’ve gathered in my background reading, it rarely manifests. But it is a good goal to strive for. Automated testing is useful in so far as it approaches the ideal, the further from this ideal case you get the less point there is to automated testing. ivermectin use in humans

Tests are not free

Even if we can write the best tests possible and cover the code base completely with tests, we still have the issue that tests do cost us something to write.

  • Tests are additional code, typically doubling the amount of code that we write to implement functionality.
  • Tests can be difficult to write meaning that we have to allocate brain resources to doing so.

Both of these downsides mean that writing automated tests increases development time.

  • One of the worst problems (in my opinion) with having many automated tests is that they add additional effort to refactoring – not only do you refactor the class but you also have to maintain the tests.
  • Automated tests can add significantly to development time in other ways. A project I worked on had a rule that nothing could be sent to the test environment for human testers unless a long, slow running, buggy, automated test suite passed. The idea was that the automated tests would find the silly bugs and the human testers wouldn’t have their time wasted. It was true, they barely had their time wasted at all – getting stuff to them for testing was a nightmare, because even if your stuff was fine, someone else could break the tests and you’d be stuck waiting another hour before they passed and you could deploy.

These issues are by no means a reason to not write tests at all, but they do incentivize us to write the minimum number of tests we think we can get away with. After all, what counts is shipped software – shipped with as few bugs as possible but not zero bugs. Zero bugs are a great ideal but if the cost to achieve zero bugs is too high then we accept a certain level of bugginess in our code and ship, because shipped code is what earns the dollars, not amazingly well polished code. Some mightily successful software (e.g. Windows) has been shipped with major bugs included.

Automated tests are but one of our weapons in the fight to eliminate bugs, they are an element of our defect containment strategy, they do not stand alone. So we should use automated tests only where it makes sense. We should fill in the gaps using other strategies like human testing.

The real world

So we’ve got a codebase, and we’ve got a bunch of automated tests written for it. Let’s say that the tests cover a certain amount of the codebase, meaning that not all of the codebase gets “exercised” when we run our test suite. This is typically the case, although there are teams and codebases which achieve one hundred percent code coverage (more on why achieving this god like level of test coverage is IMO probably a bad idea later). Let’s also specify that a certain percentage of these tests do not fit our definition of what a good test is.

If an individual test passes, it should mean that for that particular case the code being tested is working as expected. However if the test itself contains a bug (and tests contain bugs at roughly the same rate as actual production code, it’s not like we magically write perfect code when we write tests) then the bug may either cause the test to fail when it should pass (this is good since we’ll spot the failing test and go and investigate), or to pass when it should fail (this is bad, because it amounts to silent failure and we’ll never know that there’s a problem). If an individual test fails then ideally that means that the code that it tests has a problem. But tests may fail because of code changes that are actually deliberate. This tends to happen when the test actually tests implementation details instead of strictly testing functionality.

Some functionality in the codebase is difficult to write automated tests for. There are a few classic examples:

  • user interfaces, especially graphical ones. There do exist tools for writing automated tests for these things but in my opinion the cure is worse than the disease. Writing such tests (AFAIK, YMMV) tends to be complex and they tend to be very closely tied to the implementation.
  • code with external dependencies like network connections and databases. The issue here is
    • the external dependency needs to be up and running for the test to work properly. If this is not guaranteed then the test may fail intermittently.
    • the tests typically run slow since they have to make round trip calls to some resource.

For code with external dependencies the traditional solution is to “Mock out” the dependency – to replace it with fake calls. This is a viable approach most of the time, but sometimes you do want the test to actually hit a database or an API. As long as the number of tests which do this is small and the dependency is available, this is reasonable too.

Brittle tests are bad

If you have a large number of tests which are “brittle” (they need to change when the code changes because they are implementation-dependent and thus get broken simply by the act of changing the code), then you will often have a large price to pay (in time and cognitive load) for code changes, even trivial ones. That tends to encourage people to leave the code as it is rather than refactoring it. If the main reason that tests break is because they are brittle (and not because the functionality that they specify has been broken), then you quickly reach a situation where tests are next to useless for locating bugs (because when they break people assume it’s “normal breaking” due to code changes) and where people are afraid to make changes because of the high cognitive cost. Also when the tests break people fix the test rather than fixing the code because they assume that the test broke due to the code change, and not because the code change introduced a change to the program functionality.

In this situation, tests act like a brake on development. They slow things down. The more tests you add, the more work you have to do to make code changes. Ironically one of the major benefits of automated tests is supposed to be that you can make code changes with confidence, because if your changes break the tests then you know exactly where to look to fix them, but with brittle tests there is no guarantee that your changes actually broke functionality and there is a perverse incentive to change the code as little as possible. Even if you are still willing to make changes, the increased development time allocated to simply fixing broken, brittle tests is a high price to pay.

This is one reason why in test suites sometimes the test suite will be set to ignore certain tests – the price of changing the tests to match the code changes became too high. If you’re ignoring a test routinely you might as well delete it, but people are reluctant to delete code.

So we’re all agreed right? Testing interface good. Testing implementation bad. Unfortunately, there are sometimes perverse incentives to write brittle tests.

Tests are written by humans (code is too)

An important thing to remember with automated tests is that they are only as good as the person writing them makes them. Because tests are written by humans, you have to play a game to get good tests, because humans break rules if they feel like it.

Consider what happens to your test suite when the team is working flat out to hit a deadline for example. Whole sections of the test suite may be simply turned off as people slave away until 2 am trying desperately to fix stuff in time for The Big Launch ™. In ideal perfect fantasy land the way that you deal with this situation is you cut features or postpone the launch or something, but you do not compromise on quality. Of course every software development team in history has “never compromised on quality” (insert eye roll gif here). Of course we do. You can’t always push the deadline out. You can’t always cut features – you might like do do these things but sometimes you just can’t. It might not be your decision or you might lose the contract if you fail to deliver on time. At least with this kind of thing you can fix it post facto.

Another thing to watch out for is there may be hidden perverse incentives in your testing strategy. For example I worked on a project (the same one with the rule about not pushing code to the human testers until the long slow test suite passed) with a rule that to push your code changes to the repository your code must increase coverage or at least keep it the same. By enforcing this rule you create some perverse incentives:

  • I want to write a test that increases code coverage rather than a test that tests functionality. These type of tests will almost always be brittle because my goal is to increase coverage (i.e. that the test must exercise every line of code in my function), so knowing the structure of the code I deliberately write my tests accordingly.
  • I don’t want to delete any tests if I can help it. Not now, not ever.
  • I don’t want to refactor code even if it needs it, because then I’ll have to fix a whole bunch of brittle tests which keep coverage up.

It’s traditional in developer circles to pretend we’re all highly moral and can resist these perverse incentives. Maybe some people can. I know I can’t, not when there’s a deadline and a boss breathing down my neck for delivery. Better to set things up without the perverse incentive in the first place.

Focusing too much on code coverage

Everyone loves code coverage. The idea is that you have an automated tool which keeps track of the number of executed statements when you run your tests, divide that by the total number of statements in your code and you have a percentage which indicates how much of your codebase is covered by tests. Management type people love this. In the corporate world, anything that provides some kind of validation that things are being done in a “best practice” way is good. And of course that’s true, you want your dev team to follow best practice. Code coverage is also great because it’s quantifiable – you can even specify in the contract that at least 95% code coverage is a deliverable. The thing is though, that if you get too focused on code coverage I would argue that that is anti-best practice.

Let’s break down why:

  • Code coverage is not a guarantee that the tests make sense or that they test anything useful, it’s simply a guarantee that a certain number of lines of code got executed. Similarly, it’s not a guarantee that the code under test does what it is supposed to and it is not a guarantee of software quality. It is a metric, and it can be somewhat useful, but not as a measure of quality.
  • 100% code coverage sounds like an admirable goal but is actually detrimental. First, tests cost time and money to write, and are tricky to write well. Second you don’t need tests to cover all areas of your code. strongyle ivermectin effectiveness horse Sometimes automated testing of things like UI/UX is difficult. It’s better to test these things manually and to reserve automated testing for places in the code base where it makes sense.
  • As discussed elsewhere in this article, focusing on code coverage encourages developers to write tests that are
    • pointless (e.g. testing every aspect of some function with a switch statement exhaustively in order to get coverage to 100%).
    • extremely tightly coupled to the implementation. If you ever need to change that switch statement the test must change too. To which a lot of people would say “Well, yes, that’s the point of the test, the function has changed so the test to exercise the function must change too”. But it’s not. The point of the test is to provide some guarantee that the program is working. If some private function changes but the functionality of the program continues to work perfectly the same, then having to change the test is just more onerous deadweight.
    • not well thought out. Testing is important when you can. But having rules that you must increase test coverage with every commit to the repository (for example) means that most of those tests won’t be well thought out testing scenarios, they’ll just be written with that single goal in mind.

Focusing too much on unit testing

A lot of testing effort tends to be focused on unit testing. The typical mantra goes that you should have lots of unit tests because they run fast, then a medium amount of integration tests which are slower and even fewer end to end test since these run the slowest. I’m not a big fan of unit tests per se, here I’ll try to explain why. Note that I’m taking a somewhat heretical stance, so don’t take my word for it, this is just one guy’s opinion.

When you’re developing a new feature the code tends to be in a state of flux for most of the time. It’s not always easy to identify code that should live in a class by itself. At this point you need to be able to refactor constantly and quickly. If you’re writing tests at the same time, even what we’ve been calling “good” tests which don’t test the class implementation but test the class interface, the problem is the interface itself is changing rapidly as you learn more about the problem and change your design. Because the interface is changing, the tests need to be refactored at the same time as the code. This adds additional, avoidable cognitive load.

The central contention of the TDD (Test Driven Design) philosophy is that if you have a desired functionality then you should write the test for it first and then write the code. This I would argue makes some sense at an acceptance test level, the problem that I see with it is that at a unit level, it is impossible to decide ahead of time what the perfect way to solve the problem is, to write unit tests for that and then to implement the code to do so – what you’ll instead do is decide on an approach, write unit tests, realise it’s incorrect, refactor, rewrite tests in line with the new interface… it just becomes quite onerous. I see it as far more productive to refactor the code rapidly, to not write unit tests but rather write units that are so simple and clear that the function of the code is obvious, and if necessary write unit tests once the development is complete and the class interface is stable.

If your units are simple and their function is clear, then tests to find a bug are almost pointless – let’s say for example you have some function which divides integers by two and you implement a test to check that that works, and you realise you’re dividing integers by three instead. Once you’ve fixed that bug it’s fixed for all time, and it was a simple enough bug that had you not had the unit test you’d have definitely found it anyway. Writing the test was a waste in the first place. Rather do something else with your time. Write a class to divide integers by five. Take dancing lessons. Ask Santa for a pony.

The one time you do find such pointless tests as we’ve just mentioned, is when people are writing tests simply to up their code coverage instead of to test functionality. In that case failing to test that the stupid function does the stupidly obvious thing each and every time you run your test suite means that you miss some fraction of a percentage point of code coverage which is worse than eating fruit salad with your fingers or midget tossing.

Conclusion

Oh man, this got to be a suuuper long blog post. My apologies and congratulations if you made it this far. You are one of the few, the chosen, those who can still read. Our numbers are dwindling my friend. I hope this was useful.

About the Author