Zero-Inflated Bootstrap
=======================

Many real-world distributions are *zero-inflated*: most observations
are exactly zero, and the inferentially interesting variation lives
in a small fraction of non-zero values. Customer revenue per
visitor, insurance claim amounts, ad clicks per impression, daily
purchases per user — the canonical pattern is the same. A few
percent of the units carry the signal; the rest contribute zero,
which is to say nothing.

The naive bootstrap on such data is wasteful. Each bootstrap
replicate draws :math:`n` observations with replacement from
:math:`\hat{F}`, but the vast majority of those draws are zero by
construction and contribute nothing to most statistics. With
:math:`n = 50{,}000` observations and a 2% non-zero rate, a single
bootstrap sample takes about half a millisecond on modern hardware,
of which roughly 99% is spent generating zeros. This guide describes
the *zero-inflated bootstrap*, a clean way to skip that work
entirely. The trick is exact, not approximate, and integrates into
the library through a small
:class:`~bootstrap_stat.distributions.EmpiricalDistribution`
subclass.

The trick
---------

Let :math:`m` denote the number of non-zero values among the
:math:`n` observations and :math:`p = m/n` the non-zero fraction. A
naive bootstrap sample is :math:`X_1^*, \ldots, X_n^*`, drawn iid
from the empirical distribution :math:`\hat{F}`. Decompose each
draw as

.. math::

   X_i^* = \xi_i \cdot Y_i,
   \qquad
   \xi_i \overset{\text{iid}}{\sim} \mathrm{Bernoulli}(p),
   \qquad
   Y_i \overset{\text{iid}}{\sim} \hat{F}_{\text{nonzero}},

where :math:`\hat{F}_{\text{nonzero}}` is the empirical
distribution of the non-zero observations. When :math:`\xi_i = 0`
the draw is zero; when :math:`\xi_i = 1` the draw is one of the
:math:`m` non-zero values, chosen uniformly with replacement. The
two formulations have identical distributions because the
probability of drawing a zero from :math:`\hat{F}` is exactly
:math:`(n - m)/n = 1 - p`.

Now observe that almost every statistic of interest depends on the
:math:`X_i^*` only through their non-zero values. The bootstrap
mean, for example, is

.. math::

   \bar{X}^* = \frac{1}{n} \sum_{i=1}^{n} X_i^*
             = \frac{1}{n} \sum_{i:\, \xi_i = 1} Y_i
             = \frac{S^*}{n},

where :math:`S^* = \sum_{i:\, \xi_i = 1} Y_i` is a sum of
:math:`K = \sum_i \xi_i` iid draws from :math:`\hat{F}_{\text{nonzero}}`,
and :math:`K \sim \mathrm{Binom}(n, p)`. The dependence on the zeros
is purely structural: they show up as a divisor :math:`n`, not as
data we need to materialize.

The optimization is to simulate the structural decomposition
directly. For each replicate:

1. Draw :math:`K \sim \mathrm{Binom}(n, p)`.
2. Draw :math:`Y_1, \ldots, Y_K` iid from
   :math:`\hat{F}_{\text{nonzero}}`.

The compact sample :math:`(Y_1, \ldots, Y_K)` is everything the
statistic needs. Each replicate now does :math:`O(K)` work instead
of :math:`O(n)`, with :math:`\mathrm{E}[K] = np = m`. The asymptotic
speedup is :math:`1/p`.

The example
-----------

We illustrate on a synthetic dataset that mimics e-commerce visitor
revenue: :math:`n = 50{,}000` visitors, of whom about 2% make a
purchase, with order values drawn from a lognormal distribution.

.. code-block:: python

    import numpy as np
    import bootstrap_stat as bp

    rng = np.random.default_rng(42)
    n = 50_000
    nonzero_mask = rng.binomial(1, 0.02, n).astype(bool)
    data = np.zeros(n)
    data[nonzero_mask] = rng.lognormal(mean=4, sigma=1,
                                       size=nonzero_mask.sum())

    # m ≈ 994 (p̂ ≈ 0.0199), sample mean ≈ 1.84

The interest is in the population mean :math:`\mu = \mathrm{E}[X]`
— the average revenue per visitor — and the goal is a standard
error and a confidence interval. The sample mean is
:math:`\hat{\mu} \approx \$1.84`, but the high variance of the
non-zero values (sample sd :math:`\approx 20.8`) means the standard
error is non-negligible relative to that estimate.

The naive bootstrap (baseline)
------------------------------

The naive approach uses
:class:`~bootstrap_stat.distributions.EmpiricalDistribution`
unchanged:

.. code-block:: python

    dist_naive = bp.EmpiricalDistribution(data)
    se = bp.standard_error(dist_naive, np.mean, B=500)
    # se ≈ 0.092   ~0.29 s

It works correctly. With :math:`n = 50{,}000` and :math:`B = 500`,
each replicate samples 50,000 observations and computes their mean.
The wall time is dominated by drawing — and immediately discarding
— roughly :math:`n - m = 49{,}000` zero values per replicate.

The optimized bootstrap
-----------------------

The decomposition above translates into a small subclass:

.. code-block:: python

    class ZeroInflatedDist(bp.EmpiricalDistribution):
        def __init__(self, data, rng=None):
            super().__init__(data, rng=rng)
            self.nonzero = np.asarray(data)[np.asarray(data) != 0]
            self.m = len(self.nonzero)
            self.p_nonzero = self.m / self.n

        def sample(self, size=None, **kwargs):
            n = self.n if size is None else size
            k = self._rng.binomial(n, self.p_nonzero)
            return self._rng.choice(self.nonzero, size=k, replace=True)

The subclass forwards ``rng`` to the base class — which normalizes it
via :func:`numpy.random.default_rng` and stores it on ``self._rng`` —
and then draws from ``self._rng`` rather than the global
``np.random``. This is the recommended pattern for any custom
empirical distribution: it keeps results reproducible from a single
``rng=`` argument and preserves the spawn-based parallelism the
multithreaded paths rely on.

The override returns the *compact* sample of non-zero values only.
The statistic must accept this representation and account for the
implicit zeros via a fixed denominator:

.. code-block:: python

    def mean_stat(x):
        return x.sum() / n   # n closed over from outer scope

The closure on :math:`n` is the trick. The same function evaluates
correctly whether :math:`x` is the compact bootstrap sample (size
:math:`K`), the full data (size :math:`n`), or a leave-one-out
slice (size :math:`n - 1`): for the zero-inflated mean,
:math:`\mathrm{sum}(x) / n` is the right answer in every case
because the non-included values are all zero anyway. We will see
shortly that this same property makes BCa interval computation work
without further modification.

.. code-block:: python

    dist_opt = ZeroInflatedDist(data)
    se = bp.standard_error(dist_opt, mean_stat, B=500)
    # se ≈ 0.089   ~0.01 s   (29× faster)

The standard error agrees with the naive estimate to within Monte
Carlo noise, as it must — the optimization is an exact
distributional equivalence, not an approximation. The :math:`B = 500`
budget that took 0.29 seconds with the naive bootstrap takes 0.01
seconds with the optimized one. At :math:`B = 5{,}000` the
optimized version takes 0.11 seconds, still faster than the naive
:math:`B = 500` and giving a more precise estimate (:math:`\hat{\mathrm{se}}
\approx 0.094`).

Confidence intervals
--------------------

Percentile and BCa intervals require no further changes for the
percentile case:

.. code-block:: python

    lo, hi = bp.percentile_interval(dist_opt, mean_stat, B=500)
    # (1.701, 1.998)   ~0.01 s   (vs naive 0.32 s)

For the BCa interval, the bootstrap step is sped up the same way,
but the BCa acceleration is computed by *jackknifing* the original
data — :math:`n = 50{,}000` evaluations of the statistic on
leave-one-out slices, which is by far the dominant cost on data
this size. We can sidestep that work too, because for the
zero-inflated mean the jackknife values have a closed form. Letting
:math:`S = \sum_{j} v_j` denote the sum of non-zero values:

.. math::

   \hat{\theta}_{(i)} =
   \begin{cases}
   S / n & \text{if } x_i = 0,\\
   (S - x_i) / n & \text{if } x_i = v_j \neq 0.
   \end{cases}

(Strictly, the natural jackknife uses :math:`n - 1` in the
denominator. The constant factor :math:`(n-1)/n` does not affect
the BCa acceleration, which is invariant under affine
transformations of the jackknife values.)

Pre-computing :math:`\boldsymbol{j}\boldsymbol{v}` and passing it
through the ``jv`` argument bypasses the
:func:`~bootstrap_stat.sampling.jackknife_values` call entirely:

.. code-block:: python

    sum_nz = dist_opt.nonzero.sum()
    jv = np.full(n, sum_nz / n)                       # all zero observations
    nonzero_idx = np.where(data != 0)[0]
    jv[nonzero_idx] = (sum_nz - data[nonzero_idx]) / n

    lo, hi = bp.bcanon_interval(dist_opt, mean_stat, data, B=500, jv=jv)
    # (1.718, 2.026)   ~0.01 s   (vs naive 3.18 s — 318× faster)

The interval matches the naive computation to Monte Carlo noise,
arriving in 0.01 seconds against the naive 3.18.

Speedup as a function of sparsity
---------------------------------

The theoretical speedup is :math:`1/p`. In practice fixed overhead
from the bootstrap loop and the random number generation puts a
ceiling on it, so the speedup saturates at very low :math:`p`:

.. list-table::
   :header-rows: 1
   :widths: 12 18 18 14 14

   * - :math:`p`
     - Naive (B=200)
     - Optimized
     - Speedup
     - Theory (:math:`1/p`)
   * - 0.5
     - 0.12 s
     - 0.06 s
     - 2×
     - 2
   * - 0.1
     - 0.12 s
     - 0.02 s
     - 5×
     - 10
   * - 0.05
     - 0.12 s
     - 0.013 s
     - 10×
     - 20
   * - 0.01
     - 0.11 s
     - 0.004 s
     - 27×
     - 100
   * - 0.005
     - 0.12 s
     - 0.005 s
     - 25×
     - 200

For data with single-digit-percent non-zero rates the optimization
delivers a one-to-two order of magnitude speedup; below
:math:`p \approx 0.005` the absolute time is so small that fixed
overhead dominates and the ratio plateaus around 25×. The naive
bootstrap remains :math:`O(n)` per replicate regardless of
:math:`p`, so the *absolute* savings continue to grow as the data
becomes sparser even when the *ratio* does not.

A general principle
-------------------

The zero-inflated optimization is one instance of a broader pattern.
The bootstrap principle says: replace :math:`F` with :math:`\hat{F}`
and resample. *How* we resample from :math:`\hat{F}` is up to us,
as long as the bootstrap distribution is preserved. When
:math:`\hat{F}` has structural redundancy — repeated values, a
mixture decomposition, a sparsity pattern, a known stratification —
that structure can be exploited to make sampling cheaper without
changing the result. Stratified bootstrap (sample within strata in
proportion to their original weights), the Bayesian bootstrap (a
re-weighting view of the same idea), and the block bootstrap for
time series are all examples of the same move: factor :math:`\hat{F}`
into pieces, sample each piece efficiently, and recombine. The
zero-inflated trick factors :math:`\hat{F}` as

.. math::

   \hat{F} = (1 - p) \cdot \delta_0 + p \cdot \hat{F}_{\text{nonzero}}

and notes that simulating from a point mass at zero requires no
work at all.

Caveats
-------

The optimization is exact for any statistic that depends on the
data only through sums or related linear functionals — means, sums,
weighted means, second moments built on top of sums and sums of
squares, regression coefficients written as such. For order
statistics (medians, quantiles, extremes) the structure is more
delicate: the median of a zero-inflated sample is zero whenever
:math:`p < 0.5`, and the bootstrap has to track which side of that
threshold each replicate falls on. The compact representation
suffices in principle but the statistic must be written to
interpret it correctly.

The optimization assumes the statistic *requires* knowledge of
:math:`n` — the total sample size — which is why we pass it via
closure rather than via ``len(x)``. Statistics computed by external
libraries that hard-code ``len(x)`` as the sample size cannot be
plugged in directly; a wrapper is needed.

Finally, the speedup is meaningful only when sample-generation cost
dominates statistic-computation cost. For statistics whose
evaluation on a sample of size :math:`n` is itself :math:`O(n)` and
cheap (means, variances), the optimization is effectively pure
profit. For statistics whose per-replicate cost is dominated by the
fitting of a complex model, the bootstrap loop's outer cost
:math:`O(B \cdot n)` is shadowed by :math:`O(B \cdot \text{model
cost})` and the relative gain shrinks accordingly.

For data with many zeros and statistics built on sums, however, the
optimization is essentially free in implementation cost and large
in payoff. A 30× to 300× wall-time reduction on routine analyses is
the kind of improvement that turns "let it run overnight" into "let
it run during a coffee break".