Time Series Forecasting

In Short

Time series forecasting predicts future values of a sequentially ordered dataset using patterns in its own history. The field spans classical statistical models like ARIMA through deep learning approaches, and has recently shifted toward pre-trained foundation models (TimesFM, Chronos, Moirai) capable of zero-shot forecasting on unseen datasets.

01. What It Is

A time series is a sequence of observations indexed in chronological order: daily sales figures, hourly energy load, minute-by-minute stock prices, or monthly unemployment rates. Forecasting uses that historical record to predict future values.

Forecasting differs from most supervised learning tasks in one key way: the ordering of observations is not incidental but causal. A value at time t depends on values at t-1, t-2, and so on. Models must respect this temporal dependency, and evaluation must avoid data leakage by keeping future data strictly out of training.

02. Why It Matters

Demand forecasting at Amazon and Walmart directly determines inventory levels, logistics costs, and out-of-stock rates. Energy grid operators use load forecasting to balance supply and prevent blackouts. Financial institutions forecast prices, volatility, and macro indicators for trading and risk management. Healthcare systems forecast patient volumes for staffing and supply chain planning. Anomaly detection in time series, a related problem, powers fraud detection and industrial equipment monitoring.

Google Research notes that improving demand forecasting accuracy "can meaningfully reduce inventory costs and increase revenue," which is why TimesFM was developed as a foundation model for the problem.

03. How It Works

Decomposition

Most forecasting methods begin by decomposing a series into components:

Trend:
The long-term direction (upward, downward, or flat).
Seasonality:
Repeating patterns at fixed periods (weekly, monthly, yearly).
Noise (residual):
Random variation unexplained by trend or seasonality.

Additive decomposition assumes the components sum: y(t) = trend + seasonality + noise. Multiplicative decomposition assumes they multiply: y(t) = trend x seasonality x noise. Multiplicative is appropriate when the amplitude of seasonal swings grows with the trend level.

Classical statistical methods

ARIMA (Autoregressive Integrated Moving Average):
ARIMA models a series as a linear combination of its own past values (AR component), past forecast errors (MA component), and a differencing operation that removes trend to achieve stationarity (the I component). SARIMA extends this to model seasonal patterns. ARIMA requires the analyst to specify three parameters (p, d, q) for the AR, differencing, and MA orders. It works well on univariate, stationary or differenced-stationary series.

Exponential smoothing (ETS):
Assigns exponentially decreasing weights to past observations, so recent values matter more than older ones. Holt's method extends it to handle trends. Holt-Winters further adds seasonality. ETS is fast, interpretable, and competitive on many short to medium horizon tasks.

Prophet:
Developed by Facebook (Meta) and open-sourced in 2017. Prophet decomposes a series into trend, seasonality (modeled with Fourier series), and holiday effects, then fits the decomposition with an additive model. It is robust to missing data and handles multiple seasonality levels. Prophet was designed for business analysts without deep time series expertise and became widely adopted for retail and capacity planning.

Machine learning approaches

Gradient boosting (XGBoost, LightGBM):
Tabular ML models can forecast by constructing lag features (y at t-1, t-7, t-365), rolling statistics (7-day average), and calendar features (day of week, month). These models won the M5 competition, a large-scale retail demand forecasting benchmark, demonstrating that well-engineered features plus gradient boosting can outperform specialized time series models on many real-world tasks.

LSTMs (Long Short-Term Memory):
Recurrent neural networks that maintain a hidden state across time steps, allowing them to capture long-range dependencies that ARIMA misses. LSTMs were dominant in deep learning forecasting roughly from 2016 to 2021. DeepAR is a probabilistic deep learning model for time series trained jointly across many related series, and became a widely used production system for demand forecasting.

Transformer-based models:
Since 2022, attention-based architectures (PatchTST, iTransformer) have challenged LSTMs on long-horizon benchmarks. The key innovation is treating fixed-size patches of a time series as tokens, similar to how vision transformers treat image patches, enabling efficient long-context modeling.

Time series foundation models

The most significant recent development is the emergence of foundation models pre-trained on massive time series corpora for zero-shot use.

TimesFM (Google Research, 2024):
A 200M parameter decoder-only transformer pre-trained on 100 billion real-world time points, primarily from Google Trends and Wikipedia pageviews. Published at ICML 2024. TimesFM achieves zero-shot performance competitive with or exceeding supervised models explicitly trained on each target dataset. Google's blog describes the architecture: "similar to LLMs, we use stacked transformer layers as the main building blocks. In the context of time series forecasting, we treat a patch (a group of contiguous time-points) as a token." Available on Hugging Face.

Chronos (Amazon, 2024):
A family of language model architectures (T5-based) pre-trained on a large collection of real-world time series, augmented with synthetic data. Chronos tokenizes time series values into discrete bins, framing forecasting as a sequence-to-sequence language modeling task. The model achieves strong zero-shot performance across diverse domains.

Moirai (Salesforce Research, 2024):
A masked encoder-based transformer trained on LOTSA (Large-scale Open Time Series Archive), containing over 27 billion observations across nine domains. Moirai addresses the challenges of cross-frequency learning, variable numbers of covariates, and distributional shift across datasets. Published at ICML 2024. "Moirai achieves competitive or superior performance as a zero-shot forecaster when compared to full-shot models," per the arXiv abstract.

The pattern across all three mirrors the LLM paradigm: pre-train once on massive data, then evaluate or fine-tune on specific downstream tasks with minimal data.

04. Key Terms and Methods

Term	Definition
Stationarity	A series whose statistical properties (mean, variance) do not change over time
Differencing	Subtracting consecutive values to remove trend and achieve stationarity
Lag feature	The value of a series at a prior time step, used as a model input
Horizon	The number of future time steps being forecast
MAE (Mean Absolute Error)	Average absolute difference between forecast and actual values
RMSE (Root Mean Squared Error)	Square root of mean squared error, penalizes large errors more heavily
MAPE (Mean Absolute Percentage Error)	MAE expressed as a percentage of actual values, undefined when actuals are zero
ARIMA	Autoregressive Integrated Moving Average, classical parametric model
Prophet	Meta's open-source decomposition-based forecasting tool
Foundation model	A model pre-trained on broad data for zero-shot use across tasks
Zero-shot forecasting	Applying a pre-trained model to new series without additional training
Anomaly detection	Identifying time points that deviate significantly from expected patterns

05. Examples

Retail demand forecasting:
Amazon runs DeepAR across thousands of related product time series simultaneously, using global training to share statistical strength across products with sparse individual histories. Foundation models like TimesFM are being evaluated as drop-in replacements that require no per-series training.

Energy load forecasting:
Grid operators forecast hourly demand 24-48 hours ahead to schedule generation capacity. Weather covariates (temperature, humidity) are critical inputs. LSTM-based models and gradient boosting are common in production.

Finance:
Short-horizon price and volatility forecasting uses GARCH models for variance, along with ML approaches for directional prediction. The efficient market hypothesis implies fundamental limits on how predictable prices are.

Anomaly detection:
Industrial IoT sensors generate continuous time series from equipment. Models trained on normal operation detect deviations indicating wear or failure. Healthcare monitoring (ECG, blood glucose) uses the same approach.

Macroeconomic forecasting:
Central banks and IMF use VAR (Vector Autoregression) models that jointly model multiple related series (GDP, inflation, unemployment).

06. Common Pitfalls and Misconceptions

Data leakage through improper train/test splits:
A random split on a time series allows future data to appear in training, inflating performance metrics. Evaluation must use a strict cutoff: train on data before date T, test on data after.

MAPE fails on low-volume or zero-value series:
It divides by actual values, so series with zeros or near-zero values produce undefined or extremely large errors. Scaled alternatives (SMAPE, sMAPE, MASE) are more appropriate for sparse demand.

Stationarity is required for ARIMA, not for deep learning:
Many practitioners incorrectly apply differencing to deep learning inputs by habit. Neural models can learn non-stationary patterns directly, and unnecessary differencing can remove signal.

A better offline metric does not guarantee better production performance:
Holdout evaluation on historical data does not capture distribution shift (changes in the data generating process over time). A model that fits historical patterns perfectly may underperform during demand shocks or structural breaks.

Foundation models are not universally superior:
On datasets with strong, consistent patterns and abundant history, a well-tuned ARIMA or gradient boosting model often matches or exceeds a zero-shot foundation model. Foundation models provide the most value when labeled training data is scarce or a new series needs immediate predictions.