Introduction

The authors of this paper start with an introduction of the problem that forecasting is a data science task that is central to many activities within an organization. As examples, they give resource management, resource allocation, and others. Furthermore, the authors state the two main techniques of forecast creation - automatic statistical forecasting, which can be hard to tune and adjust without an in-depth knowledge of the underlying mathematics, and the analysts’ technique of relying on the objective view of a particular professional regarding a given problem.

You can download the paper as a PDF from here.

The result is that the demand for high-quality forecasts often far outstrips the pace at which they are produced.

The authors continue to give a third independent type of forecasting scale, and most realistic in their view. This third type incorporates a large number of forecasts that will be created, requiring efficient and automatic means of evaluating and comparing these forecasts. The main reason is that the vast amount of forecasts made in a unit of time, it is crucial to focus machine power on evaluating them, rather than allocating human resources to such tasks.

Features of Business Time Series

This chapter starts by introducing the reader to the wide range of business forecasting problems and then specifically giving Facebook’s time series for Facebook events as an example. Afterward, the reader is presented with the concept of seasonality - weekly, monthly yearly. A brief description of seasonality is given as “cyclical on a given period of time”. These types of seasonal effects naturally arise and can be expected in time series generated by human actions. The authors also emphasize the fact that often real datasets contain outliers. The Figure below shows that a time series using a fully automated method is reasonable but very difficult in implementing. In the Figure below the authors used 4 different methods for time series estimation

and the forecasts were made at 3 points in history, each using only the portion of the time series up to that point to simulate making a forecast on that date. They concluded that the statistical methods (such as ARIMA) failed to produce forecasts that match the characteristics of these time series.

The authors continue to expose the disability of standard statistical models, ARIMA in this example, to misrepresent long-term seasonality. The main argument for this to be a major disadvantage is that tuning the hyperparameters for any of these statistical models requires an in-depth understanding of the mathematics behind that particular statistical model. Namely, when a forecast is poor, we wish to be able to tune the parameters of the method to the problem at hand. Tuning these methods requires a thorough understanding of how the underlying time series models work.

The Prophet Forecasting Model

The main focus of this model is to address the main common features of business time series and importantly, Prophet Forecasting Model is designed to have intuitive parameters that can be adjusted without knowing the details of the underlying model.

The Prophet Forecasting Model uses a decomposable time series model with three main components: trend, seasonality, and holidays.

\[\begin{aligned} y(t)=g(t)+s(t)+h(t)+\epsilon_t \end{aligned}\]

Where g(t) is the trend function, s(t) is the seasonality function, h(t) is the holidays’ function, and at last place an error term. The Trend function models non-periodic changes in the value of the time series. The Seasonality function models periodic changes - daily, weekly, monthly, yearly, etc. The Holiday function represents the effects of holidays that occur on potentially irregular schedules over a defined period of time.

The authors of this paper assume that the error term is normally distributed.

Later the authors compare this model to the generalized additive model (GAM), which is a class of regression models with potentially non-linear smoothers applied to the regressors.

The authors make the case that seasonality is an additive component accumulating with time, therefore taking the same approach as exponential smoothing, and making a further clarification that multiplicative seasonality, where the seasonal effect is a factor that multiplies the trend model, can be accomplished through a log transformation.

Afterward, the authors state the advantages of GAN to decompose easily and accommodate new components as necessary and fit quickly, either using “backfitting” or L-BGGS. The authors contrast this to their own statement of the problem, namely, they see time-series forecasting as a curve-fitting exercise, which is inherently different from time series models that explicitly account for the temporal dependence structure in the data.

The authors state that they intentionally give up on some of the GAN advantages.

This tradeoff yields the following advantages, stated by the authors:

Flexibility. Seasonality is easily accommodated with multiple periods and the analyst can make different assumptions about trends.
Measurements need not be regularly spaced. No need for interpolation of missing values
Fast fitting.
Easily interpretable parameters that can be changed by the analyst to impose assumptions on the forecast.

The Trend g(t) Model

The authors have implemented 2 trend models that cover particular Facebook-related applications. These two trend models are the nonlinear, saturating growth model and the piecewise linear model.

Nonlinear, saturating growth model

The main question that this model can answer is how the population has grown till now and how it is expected to continue growing. The authors compare growth in Facebook (based on the number of users) is similar to population growth in the natural ecosystem, where nonlinear growth has a saturation capacity. This growth is usually, (but not limited to) modeled by a logistic growth model.

The authors assume logistic growth for the nonlinear saturating growth model.

\[\begin{aligned} g(t)=\frac{C}{1+e^{-k(t-m)}} \end{aligned}\]

Where C is the carrying capacity, k is the growth rate, and m is the offset parameter.

There are two important aspects that are NOT captured in this equation.

The carrying capacity C is not a constant

a. The example given is such that the number of people with access to the internet increases, the ceiling for this growth function also increases. Therefore C is a function of time t → C(t).
The growth rate k is not a constant

a. The example given is such that the rate of change/growth may vary depending on the new product releases etc. Therefore, the model must be able to incorporate a varying rate in order to fit historical data. This is solved by incorporating changepoints in the growth model by explicitly defining changepoints (location) where the growth rate is allowed to change.

The piecewise logistic model is then converted into this.

\[\begin{aligned} g(t)=\frac{C(t)}{1+e^{-(k+\bf{a(t)}^{T}{\delta})(t-(m+\bf{a(t)}^{T}\gamma))}} \end{aligned}\]

where δ is the rate adjustment to the changepoints and γ is the rate adjustment to the offset (making it continuous)

An important set of parameters in this model is the carrying capacity at different points in time. The authors state that the “**Analysts often have insight into market sizes and can set these accordingly**”s.

Linear Trend with Changepoints

For forecasting problems that do not exhibit saturating growth, a piece-wise constant rate of g*rowth provides a good solution and often is considered a useful model.

\[\begin{aligned} g(t)=(k+\bf{a(t)}^{T}{\delta})t+(m+\bf{a(t)}^{T}\gamma)) \end{aligned}\]

Automatic Changepoint Selection

There are 2 possible ways to specify changepoint location:

Directly by the analyst on specific dates
Automatically

Automatic selection can be accomplished by putting a sparse prior to the rate adjustment δ.

This sparse prior is of the form δj ∼ Laplace(0, τ ), by default changepoint count is one per month for several years of history. The authors state that. The parameter τ directly controls the flexibility of the model in altering its rate. Importantly, a sparse prior on the rate adjustments δ has no impact on the primary growth rate k, so as τ goes to 0 the fit reduces to standard (not-piecewise) logistic or linear growth.

Trend forecast uncertainty

When the model is extrapolated past the history to make a forecast, the trend will have a constant rate. The authors estimate the uncertainty in the forecast trend by extending the generative model forward. Future changepoints are randomly sampled in such a way that the average frequency of changepoints matches that in the history DataFrame.

The authors assume that the forecast will have the same average frequency and magnitude of rate changes similar to the one in the history. “We thus measure uncertainty in the forecast trend by assuming that the future will see the same average frequency and magnitude of rate changes that were seen in the history.“

Seasonality

The authors assume that business time series often have multi-period seasonality because of the human behaviors they represent. This leads them to use the Fourier series to simulate these seasonalities because the Fourier series are quite reliable and good at estimating a wide variety of functions. Therefore, the authors state that matching the seasonality problem is then converted to a simple function approximation problem.

The authors assume the seasonality prior comes from a normal distribution - β ∼ Normal(0, σ2 )

\[\begin{aligned} s(t)=\sum_{n=1}^{N}(a_n\cos(\frac{2{\pi}nt}{P})+b_n\sin(\frac{2{\pi}nt}{P})) \end{aligned}\]

Holidays and Events

Holidays and events provide large, somewhat predictable shocks to many business time series and often do not follow a periodic pattern, so their effects are not well modeled by a smooth cycle. Prophet allows the analyst to provide a custom list of past and future events, identified by the event or holiday’s unique name. Incorporating this list of holidays into the model is made straightforward by assuming that the effects of holidays are independent. As with the seasonality, holidays have a prior that also is assumed to come from a normal distribution - κ ∼ Normal(0, ν2 ). The authors also suggest that a window effect might be advantageous in this situation, because of human behavior before and after holidays and other events.

Model Fitting

The way Prophet fits the model data is by first combining the seasonality and holiday features for each observation into a matrix and the changepoint indicators into another matrix, therefore the whole model can be expressed in a few lines of code, as shown below.

model {
// Priors
k ∼ normal(0, 5);
m ∼ normal(0, 5);
epsilon ∼ normal(0, 0.5);
delta ∼ double_exponential(0, tau);
beta ∼ normal(0, sigma);
// Logistic likelihood
y ∼ normal(C ./ (1 + exp(-(k + A * delta) .* (t - (m + A * gamma)))) +
X * beta, epsilon);
// Linear likelihood
y ∼ normal((k + A * delta) .* t + (m + A * gamma) + X * beta, sigma);
}

Prophet uses STAN L-BFGS to find the maximum a posteriori estimate, however, also can do full posterior inference to include model parameter uncertainty in the forecast uncertainty.

The authors compare the results on the original data in the paper with the results from other models and conclude that the Prophet forecast is able to predict both the weekly and yearly seasonalities, and unlike the baseline model does not overreact to the holiday dip in the first year.

Analyst-in-the-Loop Modeling

The authors state that analysts making forecasts often have extensive domain knowledge about the quantity they are forecasting, but limited statistical knowledge. In the Prophet model specification, there are several places where analysts can alter the model to apply their expertise and external knowledge without requiring any understanding of the underlying statistics.

Capacities
Changepoints
Holidays and seasonality
Smoothing parameters

The authors give a suggestion when the model fit is plotted over historical data, it is quickly apparent if there were changepoints that were missed by the automatic changepoint selection. The τ parameter is a single knob that can be turned to increase or decrease the trend flexibility, and σ is a knob to increase or decrease the strength of the seasonality component.

The authors state that linear trend, logistic growth, time-series scale of seasonality, and outlying time periods that require adjustments can be made without statistical expertise, and are important ways for analysts to apply their insights or domain knowledge.

After presenting that the forecasting literature often makes a distinction between statistical forecasts, which are based on models fit to historical data, and judgemental forecasts, which human experts produce using whatever process they have learned, the authors state that their analyst-in-the-loop modeling approach is an alternative approach that attempts to blend the advantages of statistical and judgmental forecasts by focusing analyst effort on improving the model when necessary rather than directly producing forecasts through some unstated procedure.

As of this (the paper) writing, we have only anecdotal empirical evidence for possible improvements to accuracy, but we look forward to future research which can evaluate the improvements analysts can have in a model-assisted setting.

Automatic Evaluation of Forecasts

In this section, the authors outline a procedure for automating forecast performance evaluation, by comparing various methods and identifying forecasts where manual intervention may be warranted. This section is agnostic to the forecasting method used and contains some best practices they have settled on while shipping production business forecasts across a variety of applications.

Use of Baseline Forecasts

When evaluating any forecasting procedure it is important to compare to a set of baseline methods. The authors prefer using simplistic forecasts that make strong assumptions about the underlying process but that can produce a reasonable forecast in practice. They have found it useful to compare simplistic models (last value and sample mean) as well as the automated forecasting procedures.

Modeling Forecast Accuracy

The forecasts are made over a specific forecast horizon, denoted by H. This horizon represents the number of days in the future that the analyst cares about when forecasting. Thus for any forecast with daily observations, we produce up to H estimates of future states that will each be associated with some error. We need to declare a forecasting objective to compare methods and track performance. The forecasting objective can be any metric: minimizing MSE, RMSE, MAPE, etc.

In order to form an estimate of this accuracy and how it varies with h, it is common to specify a parametric model for the error term and to estimate its parameters from data. The authors give an example using an AR(1) model as a specific parametric model. Then we could form expectations using any distance function through simulation or by using an analytic expression for the expectation of the sum of the errors. Unfortunately, the authors state that these approaches only give correct estimates of error conditional on having specified the correct model for the process – a condition that is unlikely to hold in practice.

A non-parametric approach to estimating expected errors that are applicable across models may be preferred. The approach is similar to applying cross-validation to estimate out-of-sample error for models making predictions on i.i.d. data. Given a set of historical forecasts, we fit a model of the expected error we would make at different forecast horizons h:

\[\begin{aligned} \xi(h)=E[\phi(T,h)] \end{aligned}\]

The authors state that this model should be flexible but can impose some simple assumptions. First, the function should be locally smooth in h because we expect any mistakes we make on consecutive days to be relatively similar. Second, we may impose the assumption that the function should be weakly increasing in h, although this need not be the case for all forecast models. In practice, they use a local regression or isotonic regression as flexible non-parametric models of error curves.

In order to generate historical forecast errors to fit this model, we use a procedure they call simulated historical forecasts.

Simulated Historical Forecasts (SHF)

Prophet uses simulated historical forecasts (SHFs) by producing K forecasts at various cutoff points in the history, chosen such that the horizons lie within the history and the total error can be evaluated. The main advantage of using fewer simulated dates (rolling origin evaluation produces one forecast per date) is that it economizes on computation while providing less correlated accuracy measurements.

SHFs simulate the errors we would have made had we used this forecasting method at those points in the past.

There are two main issues to be aware of when using the SHF methodology to evaluate and compare forecasting approaches.

First, the more simulated forecasts we make, the more correlated their estimates of error are. Although correlated estimates do not introduce bias into our estimation of model accuracy, they do produce less useful information and slow down forecast evaluation

Second, forecasting methods can perform better or worse with more data. A longer history can lead to worse forecasts when the model is misspecified and we are overfitting the past, for example using the sample mean to forecast a time series with a trend.

Even for a single time series SHFs require many forecasts to be computed, and at scale, we may want to forecast many different metrics at many different levels of aggregation. SHFs can be computed independently on separate machines as long as those machines can write to the same data store. Prophet stores the forecasts and associated errors in Hive or MySQL depending on their intended use.

Identifying Large Forecast Errors

When there are too many forecasts for analysts to manually check each of them, it is important to be able to automatically identify forecasts that may be problematic. Automatically identifying bad forecasts allows analysts to use their limited time most effectively, and to use their expertise to correct any issues. There are several ways that SHFs can be used to identify likely problems with the forecasts.

When the forecast has large errors relative to the baselines, the model may be misspecified. Analysts can adjust the trend model or the seasonality, as needed.

Large errors for all methods on a particular date are suggestive of outliers. Analysts can identify outliers and remove them.

When the SHF error for a method increases sharply from one cutoff to the next, it could indicate that the data generating process has changed. Adding changepoints or modeling different phases separately may address the issue.

There are pathologies that cannot be easily corrected, but most of the issues that the authors have encountered can be corrected by specifying changepoints and removing outliers. These issues are easily identified and corrected once the forecast has been flagged for review and visualized.

Conclusion

A major theme of forecasting at scale is that analysts with a variety of backgrounds must make more forecasts than they can do manually.

The first component of the forecasting system is the new model that the authors have developed over many iterations of forecasting a variety of data at Facebook. They use a simple, modular regression model that often works well with default parameters, and that allows analysts to select the components that are relevant to their forecasting problem and easily make adjustments as needed.

The second component is a system for measuring and tracking forecast accuracy, and flagging forecasts that should be checked manually to help analysts make incremental improvements. This is a critical component that allows analysts to identify when adjustments need to be made to the model or when an entirely different model may be appropriate.

Simple, adjustable models and scalable performance monitoring in combination allow a large number of analysts to forecast a large number and a variety of time series – what the authors consider forecasting at scale.