A Few Thoughts On Flakey Tests

Butch Mayhew

Jun 10, 2024 • 5 min read

When a new test automation project starts there are really 3 buckets of work that happen. building (creating new tests), maintaining (updating existing tests as the application changes), and monitoring (keeping an eye on automation runs, investigating failures and surfacing findings to the team).

"Flakey" tests really come into play within the "monitoring" of the automation runs. If a test is flakey it's always important to dive in and understand why the original test failed... This is important, if this step isn't being taken by people, there is a potential for big risks.

Below I will give a few things I've experienced with flakiness from my context, this list is by no means comprehensive.

What I Consider Flakey Tests

Tests that failed due to getting stuck in a bad state. For example, you expected to be logged in but were in a logged out state or maybe another test/user touched the data you were asserting.
Test data that you used for your test was poor. For example you sent a 11 digit phone number or you sent a date time stamp with the wrong UTC offset due to the machine running your automation is different than your local timezone.
Is there a race condition in the automation code where it is trying to make assertions before data is loaded within the page?
Did the test timeout because you didn't have await syntax implemented properly?

It is critical that you investigate each failure.

For each of these scenarios just because a tests fails every so often doesn't mean that it is a flakey test. It is critical that you investigate each failure and ensure that the issue isn't with the application under test but rather some other factor. It is possible there is an underlying issue that is causing the test to fail is actually a real bug that needs to be resolved. One example I can think of was a bug in our system when a test would pass at 6:00 PM but fail at 11:00PM, due to a timezone bug in the application.

How Do You Handle Flakey Tests?

Building your tests in a way where the testa are flakey is really easy to do (I've done it on a few projects even when I was aware and trying not to). Test Data and State are the biggest offenders that I have run into and must be considered at the beginning and throughout a test automation project.

If you're main issue is with state, and due to specific cookies the Playwright team released some new functionality to clear certain cookies that can be found in the 1.43 release notes.

When you do have flakey tests you determine if the test is still a valuable test. If the test isn't valuable delete it! If it is valuable either fix it or set it to run as a part of a set of tests that don't run frequently and are non-build blocking.

If the test isn't valuable delete it!

One way I have handled keeping flakey tests is through Playwright tags. I specifically use @unsatisfactory as the tag name, as it is failing me. I actually have around 70 api tests in my day job where we have categorized this way in my suite at work and we run them once a week to give us feedback, but not on every build. With Playwright npx playwright test --grep-invert @unsatisfactory you can run all your specs except the ones tagged @unsatisfactoryin your main CI pipeline.

If you are looking to implement tags do check out the 1.42 release notes for the newest way to create tags win your tests. You no longer have to add the tag in the test title (though you still can if you'd like).

import { test, expect } from '@playwright/test';

test('flakey test', {
  tag: '@unsatisfactory',
}, async ({ page }) => {
  // ...
});

test('solid test', async ({ page }) => {
  // ...
});

What I Consider Flakey Environment/ Infrastructure

Did a 3rd party service fail?
Did the service/api call hit a rate limit?
Is the test passing every other run because we only have the latest code deployed to half of the running servers?
Did a container crash because we are running on the cheapest server possible because it's a test environment?
Did the automation run during a chaos test or load test?
Did a developer make a code push and cause the environment to recycle?
Did the data team hammer the database to rebuild their data warehouse?
Is AWS experiencing an outage in your region?

There are many different scenarios where your environment or infrastructure could cause tests to return as "failed". It is critical that in this scenarios as with the flakey tests above that you investigate each failure and conclude that there isn't a real issue. Don't just make an assumption, dive down into the failure messages, network traces, and infrastructure logs/dashboards to ensure the test failure reason.

I've had a few scenarios where tests just started intermittently failing, and after the first day I was convinced it was a flakey infrastructure problem, but I decided day 2 to look a little deeper. I checked our pganalyze tool which gives insight into our Database queries, and clearly saw the issue was due to a new query taking 20x as long to complete. After working with a developer, we added an index to the table and we fixed the bug before it made it to production!

If you have accepted there may be an endpoint or test that fails but you still want to test and ensure that it atleast passes 1 out of X times, you can try utilizing expect.poll() or expect.to Pass() in your tests. I've found the best place to use these is when "creating test data" prior to tests running as it can hide underlying problems with your system if you use them on your actual test assertion steps.

You can also make tests less flakey by mocking certain reqeusts and/or responses. The Playwright docs give lots of good examples!

Guard Against Introducing Flakey Tests Into Your Project

Ideally you don't want to introduce flakey tests into your repo. The best way to guard against this is to run your tests multiple times before code gets merged into your test repository. The below article walks you through how to loop your test runs to check for flakiness before you merge them into your main code branch.

Something I think we can all agree on:

Flakey Pie Crust = GOOD
Flakey Test = BAD

Thanks for reading! If you found this helpful, reach out and let me know on LinkedIn or consider buying me a cup of coffee. If you want more content delivered to you in your inbox subscribe below, and be sure to leave a ❤️ to show some love.

What I Consider Flakey Tests

How Do You Handle Flakey Tests?

What I Consider Flakey Environment/ Infrastructure

Guard Against Introducing Flakey Tests Into Your Project

Sign up for more like this.