Variables - When and How to Use Them | Forecasting Best Practices

The most common way to start creating your forecast model is by deciding on the first variable you’d like to forecast, called the “target” variable. Once you’re ready to experiment and improve your single variable forecast, try adding more variables!

If used purposefully, additional variables can help explain fluctuations in the target and assist in predicting the future. However, gathering the extra data can be time consuming with no guarantee of improved results. Having an intuition for how variables work, how to choose them, and when to use them will save you time and increase the likelihood of success.

What are variables?

Variables can take on different forms, but in the context of forecasting with Datamago we’re referring to a numerical representation of the underlying conditions impacting the target.

Common examples include ad spend, economic indicators (e.g. GDP, currency exchange rates), average order value, number of unique visitors, weather, etc. The options are practically limitless so we’ll discuss later in this article how to narrow down which ones to use.

When you may not need them

The past is usually the best indicator of the future. This means that your target’s historical data is the primary source for the forecast—variables simply provide secondary information.

Since historical patterns are a reflection of every internal and external factor that influenced them, you can create accurate forecasts without variables if a) the historical pattern is consistent and b) you can assume the underlying conditions will continue along their current trajectory in the future.

The level of atmospheric CO2 in Mauna Loa, Hawaii [1], is a good example. Given the perfectly stable trend and seasonality, variables aren’t needed to create a forecast as long as you can assume the underlying conditions won’t shift.

When they're useful

Variables are useful for explaining the target’s historical variance when a forecast can’t be made accurately by only looking at the past—either due to irregular variance or a lack of data (learn more about data predictability here).

To illustrate how a variable can help with irregular variance, imagine a retail store whose sales go up when local conventions take place, but the convention dates change from year to year. Since the pattern is inconsistent, sales couldn’t be reliably forecasted without explaining the variance due to the conventions. This could be done by including a column with the number of events or the convention center’s visitor count, to name a couple, in your dataset.

When there’s not enough historical data, correlated variables can help reinforce trend and/or seasonality, thereby serving as a guide for the target’s forecast. To illustrate, let’s assume the retail store from the previous example just recently opened and has only 6 months of sales data, which isn’t enough to learn and forecast any sort of long term repeatable pattern. But if the retail store was in a tourism destination and its sales were correlated with the sector, it could use tourism arrivals as a variable to help determine what sales may look like in the future.

Finally, whether or not there’s irregular variance or a lack of historical data, you may want to include variables in order to run post-forecast analysis like variable optimization, what if scenarios, and impulse response (blog posts coming soon on these topics).

How to include variables in a forecast

You can create a variable by adding a numerical column to your dataset (e.g. copying and pasting). And to see how the file should be formatted, click the ‘New’ button at the top of the home page, then select ‘view an example’. You can also select and deselect variables as you see fit after selecting the file.

Note: The variables’ future values are also required; however, Datamago auto-forecasts each variable by default. You can also submit forecasts from another source by scrolling to the bottom of the ‘advanced settings’ in the forecast configuration sidebar.

How to decide which variables to keep

When tasked with a forecast, it’s common to start thinking about all of the potential factors influencing your data. It can feel daunting to not only come up with a list of compelling variables, but to then go out and actually collect the data. To avoid decision paralysis, we recommend keeping it simple: start with one or two and add more as needed.

To narrow down which variables are the most important, looking at how each one affects validation performance is a good option. If performance improves, keep it. If not, discard it.

You can see the performance score by clicking on the ‘Performance’ menu option on the left hand side of the forecast page. If you have multiple variables, you can reference the ‘Variable importance’ section (which is accessible from the top of the Performance page). Variable importance gives an estimate (actual results will vary once variables are added or removed) of the performance scores with different variable combinations.

Consider scope

Another consideration when choosing variables is scope. For example, when forecasting sales for a local retail store, macro economic trends like GDP and monetary conversion rates probably aren’t applicable. Your time would be better spent looking at local economic, sociographic, and cultural indicators (the local economy, housing market, tourism, conventions, etc.). Likewise, if you're forecasting national or international data, indicators that share that scope (GDP, monetary conversion rates, international trade, etc.) are a better bet.

A warning about high correlation

Note: Datamago will alert and guide you if highly correlated variables are detected in a forecast. It’s mentioned here to save you time when considering variables and gathering data.

Variables that are highly correlated with each other or with the target variable can create undesirable results. Each variable’s contribution to changes in the target variable can’t be reliably estimated and the forecast may appear off or simply copy the variables’ shape.

However, there are a couple exceptions for when a strongly correlated variable with the target may be helpful: first, to help guide the forecast when there’s a lack of historical data; and second, to explain past fluctuation and guide the forecast as long as the variable’s forecast looks reasonable (again, the target’s forecast may copy the variable’s shape).

An example of a highly correlated variable is the number of monthly orders if you’re forecasting monthly sales. More orders typically equals more sales, and vice versa, so the number of orders likely provides overlapping information.

Example - Retail Value of California Wine

We’ll start by creating a 5 year forecast of the retail value of California wine [2] without any variables in order to establish a baseline. The result looks pretty good. It both captures the global trend and has a good performance score. However, the forecast contains an artifact from the 2020 dip in retail value which was most likely due to COVID. Both are circled in red below:

We can see that the retail value is beginning to recover in 2021 before the forecast starts. And since we don’t have a reason to believe that a similar dip will occur several years from now, we could potentially improve the forecast with a variable that both explains 2020 and guides the future values.

It’s important to consider scope when deciding which variables to try. Even though we’re dealing with wine from California, the retail value encompasses all points of sale across the U.S., so the dataset is national in scope. With that said, an economic factor like disposable income, GDP, or demand (i.e. wine consumption) may be appropriate to explain the dip in value.

We first tried disposable income but it hardly decreased in 2020—at least in the yearly aggregate—so it didn’t improve the forecast. Next we tried GDP per capita [3] which has a dip in 2020 (highlighted in red) that coincides with that of the retail value of California wine:

Here’s the resulting forecast after adding GDP per capita to the dataset:

As you can see above, the original forecasted dip is gone and even though there’s a small one near the end, it’s more in line with the subtle historical movement. Overall it’s an improvement. Additionally, the performance score—which is based on validation set accuracy—went from ‘good’ to ‘excellent’. For advanced readers: the MAPE was reduced from 5% to 1.5%. Let’s compare the validation set forecasts with and without GDP per capita as a variable:

Below is the original validation set forecast without a variable. As you can see, the forecast (yellow line) overshot 2020’s actual value (blue line). But since there isn’t a similar pattern in the past to learn from, we couldn’t expect the dip to be predicted.

However, the 2020 forecast is much closer to the real historical data after including GDP per capita:

Given the improved forecast shape and performance score, the recommended course of action is to stop here and use GDP per capita instead of continuing the search for potentially better variables. Any marginal gain in performance—assuming there’s an improvement at all—would likely not be worth the additional time it takes to gather and prepare the data. See When is a forecast good enough? to learn more.

Note: Using a variable to improve the above forecast isn’t the only option. It can work really well, but it’s also potentially the most time consuming. Other viable techniques include smoothing, historical forecasting, and adjusting anomalies.

Lastly, it’s worth noting that with GDP per capita included in the forecast, you could adjust its future values to test how potential dips and spikes in the economy could affect wine’s retail value. A blog post about this topic is coming soon.

Anyone with a Datamago account can use variables to increase forecast performance!

1. Source: Dr. Pieter Tans, NOAA/GML (gml.noaa.gov/ccgg/trends/) and Dr. Ralph Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu)

2. Source: Wine Institute.

3. Source: USA Facts.