One of the things in software development that usually has a long feedback loop is integration testing. So many developers are pushing their code and then later check if the tests pass. What if we could cache the test results? Then, at least for most small changes, most tests wouldn’t run. We can only dream of the productivity and morale gains by pushing the code and seeing CI green within seconds.
As part of my building my own programming language, I’ve been thinking about what it would take to achieve this dream. The compiler can determine what code changed and what tests depend on that code, in order to invalidate the cache. The missing piece to make it safe for unit tests is to make them forcefully deterministic. But we don’t have the same luxury for integration tests. If I use a database and upgrade the database version, the compiler of my programming language has no way of knowing that the tests related to that integration should run again.
The Go programming language suffers from this exact problem. There’s no (automatic; without programmer intervention/discipline) way to distinguish between tests that rely on outside state and ones that don’t. So, while there is test caching, most people I know professionally working with the language always run all of their tests. They don’t leverage the cache at all.
So what would it take to cache integration test results and not think it’s eroding software quality? As you might’ve guessed, I don’t have a silver bullet. But if we look at the individual problems, we might get somewhere.
Problem 1: depending on external systems
Your code might be exactly the same, but system you’re integrating with has changed. The example above of updating database version falls into this category. A similar problem, which is way more common and way more likely to break your tests, is changing your database schema.
How about we add those in a configuration file?
[integration.dependencies]
postgres = "16.0"
latest_db_update_file_name = "321.up.sql"
Then we could have specific tests depend on specific dependencies and the programming language test runner would check the version on the file against the version the tests used. We could have special CLI commands to set the version, to make it easy to use.
This dependency on configuration might require some discipline and/or CI work, but it might easy to use enough that it becomes enjoyable to use.
But what about when a vendor updates their software? Well, if a third party service you depend on, that you’re integrating through calls to a REST API, makes changes to their production system… Integration tests wouldn’t save you there. But we can always keep a date for the version and re-run all the relevant tests.
[integration.dependencies]
last_known_stripe_api_update = "2024-05-03"
Not perfect. But, perhaps, good enough?
Problem 2: a test might be flaky
This is a bit more tricky to have a holistic view of. Why is the test flaky? Is it because test order matters? Is it because two tests running concurrently interfere with each other? Is it because it waits for something with a timeout and the timeout is seldom enough?
I think we definitely need control over concurrency. Some integrations have flows that only work if a certain action is preceded by another action. We can only guarantee that if we’re not running other tests that might interfere with that. So, like the dependency versions, we need another sort of tag, that has a concurrency limit, and have the test runner know how to handle it.
[integration.concurrency]
message_queue_push_and_pull = 1
That covers the concurrency scenario, but much more can be happening. We don’t want to run all of the tests all of the time, but maybe we can add some randomness here and run some? We could add a couple of mechanisms, leaving their configuration to the language user:
Pick 1/X tests and invalidate their cached result
Pick 1/X tests to be ran twice in the same test run
[integration.run]
randomly_invalidate = 2
randomly_run_twice_one_in = 100
With this example configuration, on every test run:
two tests would have their previously cached results invalidated (thus re-ran)
there’s a 1% chance that a test runs twice during a test run
A bit of chaos helps everyone to sleep better at night.
When these mechanisms are actually in use then I’ll probably come across things that don’t feel as good in practice as one hoped and obviously iterate from there. As of today, this sounds good enough to me for an MVP. Tomorrow I might have changed my mind.