Practical testing: a guide for software engineers
Testing is an essential part of the software development process, but it can be tedious, time-consuming, and might not prevent you from making mistakes. Moreover, code that is found in test directories is one of the highest liability code since it is less often scrutinized, highly duplicative, less documented, often ignored from linting rules, verbose, missing the code to regenerate fixture files, testing the wrong things, and is just overall harder to understand than the actual non-test code.
Unfortunately, a lot of the advice you might hear about testing practices is dogmatic and unjustified. I remember Uncle Bob once mentioning that test-driven development (TDD) is a done deal and you’re either incompetent or ignorant if you’re not using it. Yet I don’t enjoy TDD and I found that it’s rarely used in practice among many of my coworkers. Software engineering evolves in the same way things evolve in science. People come up with new processes and ideas for how to do things only to find that five years later the new generation has moved on to new paradigms, frameworks, and languages because those sometimes work better. Therefore it is best to keep an open mind and constantly question any advice, including the one in this post. Below I want to share some practical things that make testing more efficient and less of a bottleneck.
Test for realism
I used to rely heavily on mocks and faker libraries. How else would you run an integration test against a database without those? How else would you run tests in isolation without mocking some function call? The problem with mocking and faking is that you’re never really testing against a realistic environment. Yes, your code is isolated and runs fast, and you end up following Mike Cohn’s test pyramid, with a lot of unit tests, but it’s all meaningless if those unit tests don’t check for realistic behavior. I think the preference for unit testing with mocking is primarily due to computers being slower in the past and codebases being much larger, so the speed of testing was a relevant factor. Nonetheless, today computers are much faster, tools are better, and I don’t see a reason to not make integration and e2e tests be a larger share of the overall testing suite. Unrealistic unit tests, even if they are fast, create more entropy and cognitive load, catch fewer bugs, generate a false sense of security, and lead you to spend more time maintaining the testing suite.
For example, don’t mock your Postgres. Instead, add an integration in your docker-compose that uses a real database that runs locally or in CI/CD and connects to your service. Then have some scripts handy that allows you to populate the database with data similar to what is in production. Similarly, you could use tools like garden.io to mimic complex production environments that run on k8s or rely on port-forwarding to a staging environment for integration testing against other internal services.
What are the consequences of bugs or mistakes?
Different industries have different requirements for how much coverage and how thorough your testing should be. I had the privilege of working for a financial institution that is heavily regulated and where mistakes could have severe consequences. I was very paranoid when working in that environment. We had multiple layers of high-fidelity testing with high coverage across unit, integration, e2e, and smoke tests. We were integrating MLOps practices of versioning data, ML models, documentation, and features before MLOps was a word or any vendor or open source tools in that area existed. We even adopted append-only storage for our tables so that we never, ever mutated or deleted data. That slowed us down, but that was overall a good thing because mistakes in that context were expensive to fix. I want the same level of care and paranoia to be present in engineers who handle my money, develop my medicine, and build my car.
Today I work for a last-mile logistics company where a mistake means that someone might get their food delivered late or not at all. If I brought that level of caution to my existing employer, I would generate higher opportunity costs than benefits and should be fired, since I would not able to ship product features with high velocity. Yes, mistakes will happen, but they would be far offset by the business impact of constantly improving the product. I’m ok with breaking something one out of ten times if there are processes in place that allow me to identify and mitigate the issue quickly. In this environment, testing is still relevant, but it should not consume the bulk of development time.
Try to understand if you operate in a low or high-risk environment and use that to guide the amount of time you spend on testing.
Prefer e2e tests first and unit tests last
In the past, I think folks were a lot more thoughtful about how they wrote their software. Compute was expensive and you were generally encouraged to get your code right on the first try. Modern IDEs and tooling have atrophied that skill, with most programmers often rewriting and refactoring the same code dozens of times until they think it looks good. Moreover, programming is a lot more collaborative today, with most of the code getting reviewed by multiple people who might have different opinions about how something should work. You can’t just push code unless it gets some eyes on it.
Because our habits of refactoring and writing code are so different from 20 years ago, a premature commitment to unit tests means that when you perform your rewrites, you’re also having to update the unit tests. Instead, my recommendation would be to frontload e2e tests and leave unit tests for last.
- Yes, I’m the person that opens PRs without tests because I want to get early feedback on my implementation before I waste time writing tests.
- Yes, I’m also the person who prefers to test my public methods and leave all the helper functions and private methods for last.
- And yes, I would sometimes not be willing to add tests and push code under a feature flag. After two or three rewrites, where I’m happy with how the code looks, I would then follow up with tests. Finally, I would turn the flag on.
Rely on regular builds to achieve reproducibility
What is wrong with this Dockerfile?
There are only three lines of code but so many things will go wrong. Rebuilding Dockerfiles does not guarantee reproducible environments. In our example, we pull `latest` image for ubuntu, leading to unpredictable operating system dependencies. Moreover, when we install git, how do you know which version will get installed? Recently, at work, we had a deploy bug caused because of an unexpected git dependency change. You just can’t trust anything.
Unfortunately, it is very hard to guarantee reproducible environments. This is particularly problematic in the Python ecosystem, especially if you don’t use tools like Poetry. You come back 6 months later to update your library and find that nothing works because of some transitive dependency changes.
Nightly builds help alleviate this issue. On a daily cadence, your CI kicks off an operation that builds your software artifact and runs the test suite against it. You can do this for both libraries and services. If the artifact fails to build, 99% of the time is because there was an unforeseen dependency drift that you should fix. At least now you know about something breaking ahead of time rather than being put in a position to scramble an emergency security patch in a code artifact that completely fails to build and deploy.
Property-based testing is overrated
If you don’t know what property-based testing is, that’s good. Skip this section and don’t look back. If you’re a huge fan, I’m unlikely to convince you of my arguments since you're probably putting far too much weight on functional programming and type theory.
My litmus test is that if a paradigm or framework does not bring me joy, I should move on. Property-based testing is one of those paradigms that I dislike and would use only if the consequence of mistakes in code is high. I wrote hundreds of property-based tests, and I just can’t stomach them anymore. For example, a lot of code I write is mathy in nature and I don’t want to just test the property that when I add two positive numbers A and B, the result C should be ≥ B,A. I want instead to test that when I add two numbers I get exactly the right result. Property tests take longer to write, run slower, are not always deterministic, and most importantly are just much harder to read and reason about than simple unit tests.
Property testing is not completely invaluable. For some critical libraries or applications, for high-risk domains, they can be useful in catching bugs. Nonetheless, in product development, their ROI is marginal relative to simple unit tests.
Combine documentation and tests
As I mentioned previously, one reason why tests are high liability code is that it is often less scrutinized than regular code. One way to perform better testing is to use literate programming. In literate programming, you combine the programming language with a documentation language. For example, consider the documentation for pydantic or fast.ai. The actual code exposed in documentation gets run whenever the documentation artifact is built and the result of that is that you have both tests and documentation all in one place. This practice elevates just how much attention you pay to your testing code and encourages a more thoughtful frame of mind for how your write tests since you're not simply testing your code, you’re proudly showcasing how it works.
Use A/B testing as a testing paradigm
Only after I joined a larger company I learned the value of using experimentation as a testing paradigm. At large companies where a single team runs hundreds of experiments, effective teams work in something similar to 1–2 day sprints. They brainstorm an idea internally one day, push the code to a subset of users the following day, and then monitor for metric impact. If you’re obviously pushing something wrong to users, this will show in your observability and A/B test metrics.
In an experimentation-driven workflow, the value of tests decreases because a lot of changes you make will be reverted if their business impact is not justified. In this environment, do you want to spend more time writing comprehensive testing for features that might not even be fully shipped? Would you not rather ship 2x or 3x times the number of features if you can partially skip some tests?
Note that running A/B tests doesn’t obviate the need to run any tests. You still write them, but the coverage is less comprehensive.
Add more tests as iteration slows down or code matures
After you reach some maturity in your code, it is preferable to go back and add additional tests. Tests have two purposes: 1) to check for code correctness and 2) to prevent accidental regressions or bugs when you make code changes. When you iterate daily on a piece of code, you’re much more confident about its correctness because you’re probably testing/using the code a lot. Nonetheless, come back to your code six months later, and you’ll find that you don’t remember everything. The purpose of tests is to prevent yourself or new code owners from shooting themselves in the foot due to a lack of familiarity.
Lean on health checks
I think SRE and DevProd engineers have done a great job in introducing health checks that monitor the availability of a service. These health checks are used to perform automatic failover and recovery, load balancing, or optimize resource usage. Nonetheless, one area that I think is being underinvested is having health checks that continuously run tests in a production environment and proactively send alerts on those failures to product teams. The tests themselves are nothing more than “golden path” e2e tests that get triggered on a cadence and any data associated with those tests gets scaffolded and purged at each run.
Conclusion
It’s important to focus on testing for realism, by using real environments and data, and to consider the consequences of bugs or mistakes in your industry.