There are plenty of resources on the internet about testing in software engineering. However, as a data scientist, the code you need to test is often very different:
- The functions often input and output complicated data structures like dataframes, arrays, tensors, etc.
- The code is often very slow to run (e.g. a model that takes hours to train)
- Results of a function can be non-deterministic (e.g. a random forest model or a api call to a ML-service)
- The code is often very coupled to the data (e.g. a function that does preprocessing of a dataframe)
- The code is often very coupled to the model (e.g. a function that trains a model)
- We often need to test the whole pipeline (e.g. a function that trains a model and then evaluates it)
Note this article is about functional testing and not evaluation of the model. The goal of testing is to make sure the code works and keeps working. The goal of evaluation is to make sure the model is good enough for the business case.
Be critical
In the software testing world there is a lot of principles that software purist idealize. The reality is that for many cases it is impractical or overkill to follow these principles. Examples:
- 100% test coverage
- Test driven development
- Unit testing (that is totally independent - with everything mocked)
So those purposes - can we accomplish them in a more pragmatic way?
You should avoid getting lost in the rule and remember the goals of testing:
- Make sure the code works
- Make sure the code keeps working
- Make sure the code is easy to change
My goal is to get these benefits with as little effort as possible.
It is all about habits
I have found that the most important thing is to change your habits. You are probably already testing your code, but you are doing it in a very inefficient way. Data scientists often test their code by running the whole notebook/script and then looking at the results. This is great for initial exploration, but it is not a good way to test production code.
Imaging you are writing a script that is looping through documents processing each document one at a time. After hours of running it fails on some edge case. Now you need to rerun the whole script again to see if you fixed the bug. This is a very slow feedback loop.
Instead you should isolate out the code that failed and write a test for it reproducing the error. Then you can run the test in seconds and get a fast feedback loop. This is a much better way to work. - Or even better you had used the test to help develop the code in the first place.
So the habits you want to create is:
- Instead of testing the code by running the full script or pipeline - create a test
- When you find a bug - write a test for it
- If you need to test a function and the feedback loop is slow - isolate out the function and test it separately
- If you need to test a function that is slow to run - make it faster to run (e.g. by using a smaller dataset)
- If you need to test a function that is non-deterministic - make it deterministic (e.g. by setting a random seed or mocking the partial results)
- Mock out external dependencies (e.g. databases, APIs, etc.) when doing component integration testing
Why testing is worth it
It is worth doing testing if your code needs live long and/or run in production. In some companies the data scientist is only doing explorative work and the code is then rewritten by a software engineer. However, as a full spectrum data scientist you will need to write code that is production ready -
You should strive for the balance between the effort you put into testing and the value you get out of it.
Basically, it makes your coding life easier. You get faster feedback if you code is working, changing the code later isn’t a headache, and anyone new can jump in and understand the project faster.
Don’t write test for what your linter or type checker can do
If you are using a linter or type checker then you should not write tests for what they can do. This is a waste of time and it makes it harder to change the code. If you are not using a linter or type checker then you should probably invest the time to learn to use them. It gives you a lot of quality assurance (almost) for free.
Making tests dependent
If you have two steps in your ML pipeline that are dependent on each other, then there is a benefit of making the tests dependent as well. This is a bit controversial in the software testing world, but it makes a lot of sense in ML.
The reason why you want tests to be independent is to make it easier to isolate exactly where the root cause of the bug is, ultimately so you can fix the bug faster.
With a bit of clever design and use of tools you can still isolate the bug even if the tests are dependent.
Sounds great? Let’s do it!
Snapshot testing to the rescue
In a previous project I needed to learn do modern front-end development, here snapshot testing was a big thing. I was very skeptical at first, but it turned out to be a great tool. I have since then used it in my ML projects as well.
Snapshot testing ensures the immutability of the output of a function (even for complex data structures - yay!)
The idea is that you save the output of a function to a file and then compare it to the output of the function the next time you run the test. If the output is the same then the test passes, if not then it fails. This is a very simple way to test that the code works and that it keeps working.
Do not do unit testing
In one of my previous projects, I had two colleagues. One was spending his time writing unit tests. Another colleague just ran the full code every time and tested it manually. Interestingly, the colleague who did not write unit tests was much more productive and his code broke less when I was integrating their code and doing system testing.
principle: If you have to choose between writing unit tests or integration tests, then choose integration tests.
Relying only on manual testing is giving you long feedback loops and is not a good idea. You should strive for a fast feedback loop.
Avoid full code coverage
In the software testing world there is a lot of talk about 100% code coverage.
Isolate out the parts that make sense to test - leave the rest.
To gain code coverage fast component integration tests can quickly get you there. They are easy to write and they give you a lot of code coverage. Having a lot of code coverage on your testing is nice because it quickly fails if you have incorrect sequencing, interface calls, communication or assumptions about the data being passed between components.
Don’t mock
I like to differentiate between component integration testing and system integration testing. I only mock when doing component integration testing, to mock out external dependencies to databases, APIs, etc.