PolarBayes quickstart for tidybayes users

If you have used tidybayes before, PolarBayes should feel familiar. The key functions are still named spread_draws and gather_draws, and they still yield tidy data frames indexed by MCMC draw, with additional index columns for array-valued parameters. This quickstart walks you through the key differences between the two packages' APIs, and the reasons they arise.

Most differences ultimately stem from the fact that PolarBayes is built on top of ArviZ and aims to wrap or mirror ArviZ's API and conventions to the extent possible.

Both spread_draws and gather_draws call arviz.extract to get MCMC samples. They accept all the configuration that extract permits, so it is worth reading the [extract docs and examples][arviz.extract] to get a sense of what is possible.

To get started, just provide list of variables to spread or gather as the var_names argument:

import polarbayes as pb

pb.spread_draws(idata, var_names=["var1", "var2"])
pb.gather_draws(idata, var_names=["var1", "var2"])

Or provide no var_names to spread or gather all available variables:

pb.spread_draws(idata)
pb.gather_draws(idata)

Key differences between tidybayes and PolarBayes

`InferenceData` groups

PolarBayes extracts tidy data frames from MCMC output stored in an arviz.InferenceData object. The tidybayes equivalent is the posterior::draws_df format. Unlike draws_df objects, [InferenceData][arviz.InferenceData] objects are organized into "groups" representing different categories of Bayesian input and output: posterior for posterior samples, posterior_predictive for posterior predictive draws, prior_predictive for prior predictive draws, et cetera.

spread_draws and gather_draws can extract draws from any appropriate group. If no group is specified, they default to extracting from the posterior group.

No dots in column names

Reserved column names in tidybayes output begin with dots (.): .chain, .iteration, .draw, .variable, and .value. PolarBayes avoids dots in variable names because dots have a special role in Python syntax. Python is object-oriented. In Python, as in many object-oriented programming languages, dots are used to access "attributes" and "methods" associated to particular objects.

Note

As you may know, R also has object-oriented features. The equivalent R operator to the python dot (.) is the dollar sign ($). You may have written df$.draw to retrieve a column named .draw from a data frame named df. So a polars column name like .draw is potentially a bad idea in the same way that a data frame column name like $draw could be a bad idea in R.

In PolarBayes, we instead reserve the bare column names chain and draw for indexing, consistent with ArviZ conventions for indexing MCMC output. If you try to extract variables with those names from an arviz.InferenceData object, PolarBayes will error and suggest renaming those variables prior to extraction.

Similarly, the default gather_draws variable and value column names are variable and value. The columns can be given alternative, custom values using the value_name and variable_name keyword arguments, respectively.

`draw` in PolarBayes corresponds to `.iteration` in tidybayes (not `.draw`!)

In tidybayes, the .draw column contains the unique ID of an MCMC sample across all chains (in relational database terms, it is a single column "primary key"). The .iteration column contains the ID of the sample within a specific .chain.

In ArviZ, draw is equivalent to tidybayes's .iteration, and not tidybayes's .draw; it is the ID of the MCMC sample within a chain. Rather than create a single primary key column as tidybayes does, ArviZ instead uses draw and chain as a composite primary key. Here too we follow ArviZ conventions in PolarBayes.

Dimension names are automatic

Array-valued parameters are stored in InferenceData objects with named dimensions. spread_draws and gather_draws respect those named dimensions. As a result, you cannot (but also do not need to) name the dimensions of array-valued variables when requesting them in a spread or gather call.

So this tidybayes R code:

draws <- mcmc_output |> spread_draws(x1[time], x2[time, location])

might become this PolarBayes Python code:

draws = pb.spread_draws(mcmc_output_arviz, var_names = ["x1", "x2"])

The PolarBayes output will still have time and location columns along with the MCMC sample ID columns, provided those are the names of the dimensions in the mcmc_output_arviz InferenceData object. If the dimension names in your InferenceData object are not the ones you want in your output data frame, you can simply rename them via polars.DataFrame.rename.

Dimension index conversion deferred to ArviZ

For similar reasons, PolarBayes does not provide functionality equivalent to tidybayes's recover_types() converting integer indexes for array-valued MCMC output into more interpretable quantities (e.g. named categories, timestamps, geographic coordinates). PolarBayes instead relies on ArviZ's built-in functionality for dimension indexing.

ArviZ performs recover_types()-like operations when creating InferenceData objects from probabilistic programming language (PPL) MCMC output. The degree and sophistication depends on the source PPL and what metadata it provides. ArviZ also has functionality for doing manual dimension annotation.

Other resources

polars for tidyverse users

If you're familiar with tidybayes and the tidyverse but new to polars, consider a consulting a polars tutorial aimed at tidyverse users. We like this one by Emily Riederer.

ArviZ and xarray documentation

It is worth consulting the ArviZ documentation. ArviZ is built on top of a multi-dimensional array library called xarray. The xarray docs are also helpful for learning ArviZ.