How to Identify and Fix Anomalies
- By Datamago team
- Published March 23, 2022
- 7 min read
We’re often faced with anomalies in our historical data for any number of reasons: recurring events, internal or external factors, or simply noise. COVID is a recent example that has had nearly universal impact. Whichever the case, consistent data is the key to successful forecasting, so any inconsistencies should be removed or explained. In this post we’ll discuss 4 simple yet powerful techniques you can apply today with Datamago to improve your forecasts. They include:
- Trimming: slice data from the beginning and/or end of the dataset.
- Manual adjustments: select individual data points and assign new values.
- Historical forecasting: replace sections of historical data with predicted values.
- Special events: explain anomalies due to recurring events.
To illustrate each technique we’ll use a fictional restaurant’s sales data with some anomalies due to a faulty payment system that misreported totals on some days, a convention that boosted local tourism, and a building renovation project where the restaurant was doing take out only. These anomalies are circled in red. Note that real data can be messier than this example and you may need to experiment with a combination of the 4 techniques to achieve the best forecast.
How to identify anomalies
Unlike outliers (which Datamago automatically detects and warns you about) some anomalies aren’t extreme enough to be flagged, so you may have to use your best judgment if you see values that are out of place or deviate from the historical pattern.
If you’re unsure, you can cover the potential anomaly and study the values around it. Pretending you have no prior knowledge about the dataset, ask yourself if you would have been able to foresee it. If not, one of the techniques in this article will help you.
Trimming
Trimming is especially useful for anomalies near the beginning of the dataset as long as you can remove them without slicing too much data (more than a third of the dataset, for example).
Slicing the end of the dataset can be more tricky since you usually want to start the forecast from the latest date available. For this scenario, historical forecasting (more on that technique below) is usually a better option. If you still prefer to trim the end, you can create a forecast from the trimmed data, then download it, replace the values at the end of the original dataset with the forecasted values and resubmit the file.
Manual adjustments
Anomalies that occur once or very few times can be adjusted manually. You’ll have to use your knowledge of the dataset or domain expertise to pick a new value. Regardless, it’s important for the new value to be consistent with the historical data.
You can manually adjust values in the data editor which is located in the forecast configuration sidebar:
Historical forecasting
This technique replaces sections of historical data with predicted values. Anomalous events can occur for periods of time (the impact of COVID on historical data is a good example) that may not be representative of the future. For these scenarios, historical forecasting can help correct groups of anomalous values so that they’re not reflected in the forecast.
Important note: the historical forecast is applied in conjunction with the trimming and manual adjustment so that the historical data is as clean as possible.
Before vs. after
We can see that leaving the anomalous section untouched has an outsized impact on the forecast. This is fine if the conditions that caused the dip are expected to occur again in the future, but for the purposes of this example that’s not what we want.
Applying an historical forecast to replace the anomalous section with predictions results in cleaner historical data and a forecast that reflects the overall seasonality and trend.
Special events
If there are anomalies due to recurring events or holidays, creating special events will help explain the past behavior so that the forecast takes their effect into account on the right dates.
To illustrate, we’ll undo the manual adjustment we previously made to the anomaly in the middle of the dataset (May 2015) and pretend that it’s an event that is expected to occur again during the forecast period (May 2022, for example). We’ll keep the trimming and historical forecast so that the rest of the data is clean.
Here’s how it’s done, and notice how we select both May 2015 (the date that the event occurred in the past) and May 2022 (the date it’s expected to occur again in the future).
The updated forecast now reflects the expected impact of the event in May 2022:
Special event impact
Consistency is just as important for special events as it is for the rest of the historical data. To illustrate, let’s imagine there’s a convention every 2-3 years that brings in more foot traffic (a real life example is the Zurich Festival which takes place every 3 years). The dates the convention took place are circled in the graph below. As you can see, the impact on sales has been mostly consistent over time except for the second to last occurrence. This could be due to any number of internal/external reasons but without further information it’s difficult to estimate the convention’s impact on sales in the future.
We have a few options to address this problem depending on whether we consider the outsized impact to be an anomaly:
- If it’s an anomaly, we can simply lower the value manually so that it has approximately the same impact as the others.
- If not, and we expect the convention to have another relatively large impact on sales during the forecast period, we can create two separate events: one for the three less impactful dates and another for the high impact date. Then we can simply say which date we expect either the high or low impact event to occur during the forecast.
- Include a variable that explains the difference in variance. Continuing the example of a local convention: if the convention center provided historical data on the number of visitors, we could use that as a variable. And if they provided a visitor forecast, we could plug that in to improve our own (but Datamago forecasts the variables by default if future values aren't provided).