Pricing is a common problem faced by businesses, and one that can be addressed effectively by Bayesian statistical methods. We'll step through a simple example and build the background necessary to extend get involved with this approach.
Let's start with some hypothetical data. A small company has tried a few different price points (say, one week each) and recorded the demand at each price. We'll abstract away some economic issues in order to focus on the statistical approach. Here's the data:
That middle data point looks like a possible outlier; we'll come back to that.
For now, let's start with some notation. We'll use P and Q to denote price and quantity, respectively. A "0" subscript will denote observed data, so we were able to sell Q units when the price was set to P0.
Building a Model
I'm not an economist, but resources I've seen describing the relationship between Q and P are often of the form
This ignores that Q is not deterministic, but rather follows a distribution determined by P. For statistical analysis, it makes more sense to write this in terms of a conditional expectation,
Taking the log of both sides makes this much more familiar:
The right side is now a linear function of the unknowns loga and c, so considering these as parameters gives us a generalized linear model. Conveniently, the reasonable assumption that Q is discrete and comes from independent unit purchases leads us to a Poisson distribution, which fits well with the log link.
Econometrically, this linear relationship in log-log space corresponds to constant price elasticity. This constant elasticity is just the parameter c, so fitting the model will also give us an estimate of the elasticity.
First, we used very broad Cauchy priors. For complex models, too many degrees of freedom can lead to convergence and interpretability problems, similarly to the indentifiability problems that sometimes happen in maximum likelihood estimation. For a simple case like this, it should be enough to make a note of it, in case we run into trouble.
Next, we expect price elasticity of demand c to be negative, and we may even have a particular range in mind. For example, it could be reasonable to instead go with something like
We'd then have a minor change to express things in terms of negc.
The μ0 line may seem wordier than expected. Yes, we could have just used μ0 = a * p0 ** c .
Making the linear predictor explicit makes later changes easier, for example if we decided to include variable elasticity. Wrapping the value in Deterministic just means the sampler will save values of μ0 for us. This is a convenience at the cost of additional RAM use, so we'd leave it out for a complex model.
This gives us the posterior mean and standard deviation, along with some other useful information:
mc_error estimates simulation error by breaking the trace into batches, computing the mean of each batch, and then the standard deviation of these means.
hpd_* gives highest posterior density intervals. The 2.5 and 97.5 labels are a bit misleading. There are lots of 95% credible intervals, depending on the relative weights of the left and right tails. The 95% HPD interval is the narrowest among these 95% intervals.
n_eff gives the effective number of samples. We took 2000 samples, but there are some significant autocorrelations. For example, our samples from the posterior of c have about as much information as if we had taken 241 independent samples.
Rhat is sometimes called the potential scale reduction factor, and gives us a factor by which the variance might be reduced, if our MCMC chains had been longer. It's computed in terms of the variance between chains vs within each chain. Values near 1 are good.
Back to our problem. Results here look pretty good, with the exception of neff.
Let's have a closer look what's going on. Here's the joint posterior of the parameters:
This strong correlation is caused by the data not being centered. All of the logP0 values are positive, so any increase in the slope leads to a decrease in the intercept, and vice versa. If you don't see this immediately, draw an (x,y) cloud of points with positive x, and compare slope and intercept of some lines passing through the cloud.
The correlation doesn't cause too much trouble for NUTS, but it can be a big problem for some other samplers like Gibbs sampling. This is also a case where fast variational inference approximations tend to drastically underestimate the variance.
Just to note, this doesn't mean Bayesian method can't handle correlations. Rather, it's that the representation of the model can have an impact. There are lots of ways of dealing with this, including explicit parameterization of the correlation, as well as several reparameterization tricks. Let's look at a simple one.
The trouble came from off-center logP0 values, so let's try taking that into account in the model. A quick adjustment does the job:
Note that we've changed the names of the parameters, to keep things straight. If we wanted to transform between the two parameterizations, we could equate the two logμ0 expressions and solve the system of equations (constant terms equal, and logP0 terms equal).
Sampling is done using an independent (Markov) chain for each hardware core (by default, though this can be changed). The trace plot gives us a quick visual check that the chains are behaving well. Ideally, the distributions across chains should look very similar (left column) and have very little autocorrelation (right column). Note that since μ0 is a vector, the plots in the bottom row superimpose its five components.
Sometimes we need a more compact form, especially if there are lots of parameters. In this case the forest plot is helpful.
The trace we found is a sample from the joint posterior distribution of the parameters. Instead of a point estimate, we have a sample of "possible worlds" that represent reality. To perform inference, we query each and aggregate the results. So, for example...
Each "hair" represents E[Q∣P] to one sample from the posterior. On introducing the data I suggested the point P0=40 may be an outlier. At first glance, the above plot might seem to confirm this. But remember, this is not a conditional distribution of (Q∣P), but of its expectation.
The following boxplot makes this point more explicitly:
Think of this as asking each possible world, "In your reality, what's the probability of seeing a smaller Q value than this?" The responses are then aggregated by price. For P0=40, about half the results are above 0.1. The point is a bit unusual, but hardly an outlier.
This shouldn't be considered a recipe for the way to do posterior predictive checks. There are countless possibilities, and the right approach depends on the problem at hand.
For example, another very common approach is to use the posterior parameter samples to generate a replicated responseQ0rep, and then comparing this to the observed response Q0:
The dashed blue line indicates our first result for an optimal price.
So... are we done? NO!!! There are some lingering issues that we'll address next time. For example, is expected profit the right thing to maximize? In particular, our estimate is well above the best worst-case scenario (highest unshaded point below the curves). Should we be concerned about this?
Also, the mean is very flat near our solution, suggesting the potential for instability in the solution. Can we quantify this, and does it lead us to a different solution?