The DBCM is a combination of a Dynamic Count Mixture Model (DCMM) and a binary cascade of Bernoulli DGLMs. The original use case came from a retail sales settings. Consider two time series, one of transactions and the other of sales for a single item. A DCMM is used to model the number of transactions that involve the item. The binary cascade models the number of units that each shopper will purchase. The cascade is a sequence of probabilities for whether a shopper will purchase $r+1$ items, conditional on them having purchased $r$ items.
The first component of a DBCM is the DCMM, used to model the total number of daily transactions containing the item of interest, $b_t$. We repeat the form of a DCMM for clarity:
$$ z_t \sim Ber(\pi_t) \text{ and } b_t \mid z_t = \begin{cases} 0, & \text{if } z_t = 0,\\ 1 + x_t, \quad x_t \sim Po(\mu_t), & \text{if }z_t = 1 \end{cases} $$where $\pi_t$ and $\mu_t$ vary according to the dynamics of independent Bernoulli and Poisson DGLMs respectively.
Transactions $b_t$ are related to sales by modeling the number of units sold in each transaction. Recognizing that sales outliers often occur due to shoppers buying many units of an item, the probability of each quantity is modeled with a binary cascade. Let $n_{r,t}$ be the number of transactions with more than $r$ units. Then $n_{r,t} | n_{r-1, t} \sim Bin(n_{r-1,t}, \pi_{r,t})$ is defined by a binomial DGLM. This cascade of binomial DGLMs represents the sequence of conditional probabilities for purchasing $r$ units or greater, given that the shopper has bought $r-1$. The sales $y_t$ are then:
$$ y_t = \begin{cases} 0, & \text{if } z_t = 0,\\ \sum_{r=1:d} r(n_{r-1, t} - n_{r,t}) + e_t, & \text{if }z_t = 1, \end{cases} $$where $d$ is the predefined length of the cascade and $e_t$ represents excess units greater than $d$. The cascade enables modeling of very small probabilities, in the rare cases when shoppers purchase large quantities of an item on a single grocery store trip.
To model transactions with a quantity greater than $d$, we follow the "Bayesian bootstrap" strategy of Berry, Helman, and West (2020), which is to record the observed outlier quantities purchased, and to forecast by sampling from the empirical distribution for the excess, $e_t$.
A DBCM can be used in the same way as a DGLM, with the standard methods dbcm.update
, dbcm.forecast_marginal
, and dbcm.forecast_path
. There are equivalent helper functions as well. A full analysis can be run with analysis_dbcm
, and define_dbcm
helps to initialize a DBCM.
The only difference from using a standard dglm
is that outside of analysis_dbcm
, the update and forecast functions do not automatically recognize whether the DBCM includes latent factors or call a copula for path forecasting. This means that the modeler needs to be more explicit in calling the correct method, such as dbcm.forecast_path_copula
for path forecasting with a copula.
A quick example of using analysis_dbcm
follows:
import pandas as pd
import numpy as np
from pybats.shared import load_dbcm_latent_factor_example
from pybats.analysis import analysis_dbcm
from pandas.tseries.holiday import USFederalHolidayCalendar
data = load_dbcm_latent_factor_example()['data']
data.head()
The data has already been formatted for use in a DBCM:
Sales
are the final outcome we want to model.Y_transaction
is the number of transactions, which is always less than or equal to the totalSales
.X_transaction
is a predictor variable for the DCMM on transactions. In this case, it's the item price.mt1
throughmt4
is the number of shoppers who purchased more than r units of the item.X_cascade
is a predictor variable for the binary cascade. In this case it's an indicator of item promotions - such as "buy one get one free" - which would influence the quantity each shopper buys.excess
is a list of unit quantities for any shoppers with 5 or more items in their cart, which extends beyond the length of the cascade, which is set to the default of $4$.
rho = .2
k = 14
nsamps = 200
prior_length = 21
dates = data.index
forecast_start_date = dates[-100]
forecast_end_date = dates[-50]
mod, forecast_samples = analysis_dbcm(data['Y_transaction'].values.reshape(-1),
data['X_transaction'].values.reshape(-1,1),
data[['mt1', 'mt2', 'mt3', 'mt4']].values,
data['X_cascade'].values.reshape(-1,1),
data['excess'].values,
prior_length=prior_length,
k=k,
forecast_start=forecast_start_date,
forecast_end=forecast_end_date,
nsamps=nsamps, rho=rho,
dates = dates, delregn_pois=.98,
ret=['model', 'forecast'])
The DBCM is a wrapper for the DCMM and the binary cascade. To illustrate, we can call mod.forecast_marginal
, with two flags activated:
mean_only=True
, which is available for all models in PyBATS. This returns the mean of the forecast distribution, instead of samples.return_separate=True
, available only for DBCMs. This divides the forecast into the transaction, binary cascade, and excess components, returning each separately.
transaction_mean, cascade_mean, excess_mean = \
mod.forecast_marginal(k=1,
X_transaction=data.loc[forecast_end_date]['X_transaction'],
X_cascade = data.loc[forecast_end_date]['X_cascade'],
mean_only=True,
return_separate=True)
out = pd.DataFrame({'sales':np.array(transaction_mean + cascade_mean.sum() + excess_mean).reshape(-1),
'transactions':transaction_mean,
'mt1':cascade_mean[0], 'mt2':cascade_mean[1],
'mt3':cascade_mean[2], 'mt4':cascade_mean[3],
'excess':excess_mean.reshape(-1)}).round(2).T
out.columns = ['Forecast Mean']
out
The forecast mean for item sales is $7.53$ units on $5.73$ separate transactions. On average, $1.4$ people will have more than 1 unit of the item in their basket, $0.23$ will have more than 2, and so on. Finally, the average prediction is for $0.05$ 'excess' units, which are any sales above $4$ units in a single basket.