API Reference#

gamdist: Generalized Additive Models fit via ADMM.

The package fits the GLM/GAM zoo – binary, continuous, or count outcomes paired with continuous, categorical, or spline-transformed features, with arbitrary convex regularization on each term – by splitting the joint convex problem into per-feature primal steps and a per-outcome proximal step coordinated by ADMM dual variables. The modular decomposition follows Chu, Keshavarz, & Boyd’s A Distributed Algorithm for Fitting Generalized Additive Models.

Public API:

  • GAM – single-response generalized additive model.

  • MultiTaskGAM – joint fit of K tasks with optional cross-task coupling regularizers.

  • SplineError – raised when spline knot selection or basis evaluation fails.

  • fit_adaptive_lasso() – two-stage adaptive-lasso wrapper around GAM.fit().

GAM#

Single-response GAM and the ADMM orchestration loop.

This module contains GAM – the user-facing model – together with the family / link tables and the ADMM driver that ties per-feature primal steps to the per-outcome proximal step.

The public entry points are:

The algorithm follows Chu, Keshavarz, & Boyd’s distributed-fitting paper (citation key GAMADMM in GAM’s docstring); the convexity-only design rule is captured in CLAUDE.md.

class gamdist.gamdist.GAM(family: Literal['normal', 'binomial', 'poisson', 'gamma', 'exponential', 'inverse_gaussian', 'quasi_binomial', 'quasi_poisson', 'quantile', 'huber'] | None = None, link: Literal['identity', 'logistic', 'probit', 'complementary_log_log', 'log', 'reciprocal', 'reciprocal_squared'] | None = None, dispersion: float | None = None, estimate_overdispersion: bool = False, name: str | None = None, load_from_file: str | None = None, tau: float | None = None, delta: float | None = None)[source]#

Generalized Additive Model fit via ADMM.

Configure with a family and link, register features with add_feature(), fit with fit(), then call predict(), deviance(), aic(), summary(), etc.

Parameters:
  • family (str, optional) –

    Family of the model. One of:

    • "normal" – continuous responses.

    • "binomial" – binary responses (or counts with covariate_class_sizes at fit time).

    • "poisson" – counts.

    • "gamma" – positive continuous responses (in progress).

    • "exponential" – gamma with dispersion fixed at 1.

    • "inverse_gaussian" – positive continuous responses with cubic mean–variance relationship (in progress).

    • "quasi_binomial" – binomial score; dispersion estimated from the data via Pearson \(\chi^2 / (n - p)\) so dispersion() reflects over- or under-dispersion relative to the exact-binomial baseline of 1.

    • "quasi_poisson" – Poisson score; same Pearson dispersion estimator, useful when count data has variance > mean.

    • "quantile" – pinball loss; requires tau.

    • "huber" – robust M-estimator; requires delta.

    Multinomial (ordinal / nominal) and proportional-hazards families are not yet supported. Required unless loading an existing model from file (see load_from_file). Point estimates for "quasi_binomial" / "quasi_poisson" coincide with their full-likelihood cousins; only the dispersion estimator (and any inferential quantity that depends on it) differs.

  • link (str, optional) –

    Link function. Supported links and their canonical families:

    Link

    Canonical for family

    identity

    normal

    logistic

    binomial

    log

    poisson

    reciprocal

    gamma

    reciprocal_squared

    inverse_gaussian

    probit and complementary_log_log are also accepted (non-canonical for binomial). If not specified, the canonical link is used. Per the convexity-only design rule (CLAUDE.md), only (family, link) pairs with a dedicated convex proximal-operator implementation are accepted.

  • dispersion (float, optional) – Dispersion parameter. Some families (binomial, poisson) have a fixed dispersion independent of the data. For other families the dispersion is typically estimated from the data; passing a known value here reduces uncertainty in the model.

  • estimate_overdispersion (bool, optional) – Whether to estimate over-dispersion for binomial and poisson families. Only meaningful when covariate classes are present and have at least modest size. See [GLM], S4.5. Defaults to False.

  • name (str, optional) – Name for the model. Used in plots and in deriving filenames when save_flag=True is passed to fit().

  • load_from_file (str, optional) – Pickle path produced by a previous save_flag=True fit. When set, every other parameter is ignored.

  • tau (float, optional) – Quantile level in \((0, 1)\) for family="quantile" (pinball loss). tau=0.5 recovers the conditional median. Required when family="quantile" and ignored otherwise.

  • delta (float, optional) – Knee parameter for family="huber". Residuals with \(|y - \mu| \le \delta\) are penalized as \(0.5 r^2\) (least squares); larger residuals are penalized linearly, capping their per-observation influence. Must be positive and is in the units of y. Required when family="huber" and ignored otherwise.

References

[glmnet]

glmnet (R package). The standard package for GAMs in R. https://cran.r-project.org/web/packages/glmnet/index.html

[pygam]

pygam (Python package). Implements GAMs in Python without using ADMM. dswah/pyGAM

[GLM] (1,2)

McCullagh, P. and Nelder, J. A. Generalized Linear Models. The standard text on GLMs.

[GAM]

Hastie, T. and Tibshirani, R. Generalized Additive Models. The book by the folks who invented GAMs.

[ESL]

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. Covers a lot more than just GAMs.

[GAMr] (1,2,3,4)

Wood, S. N. Generalized Additive Models: An Introduction with R. Covers more implementation details than [GAM].

[ADMM]

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.

[GAMADMM]

Chu, E., Keshavarz, A., and Boyd, S. A Distributed Algorithm for Fitting Generalized Additive Models. Forms the basis of this package’s approach.

add_feature(name: str, type: Literal['categorical', 'linear', 'spline'], transform: Callable[[ndarray[tuple[Any, ...], dtype[Any]]], ndarray[tuple[Any, ...], dtype[Any]]] | None = None, rel_dof: float | None = None, regularization: dict[str, Any] | None = None, constraints: dict[str, Any] | None = None) None[source]#

Add a feature to the model.

An implicit constant feature is always included, representing the overall average response.

Parameters:
  • name (str) – Name for the feature. Used internally to keep track of features and also used when saving files and in plots.

  • type (str) –

    Type of feature. One of:

    • "categorical" – for categorical variables.

    • "linear" – for variables with a linear contribution to the response.

    • "spline" – for variables with a potentially nonlinear contribution to the response.

  • transform (callable, optional) – Optional transform applied to the feature data. Any callable may be used; it is applied to data provided during fitting and prediction. Common choices include numpy.log(), numpy.log1p(), or numpy.sqrt(). The user may wish to start with a base feature "age" and use derived features "age_linear" and "age_quadratic" to permit quadratic models for that feature, with potentially different regularization applied to each.

  • rel_dof (float, optional) –

    Relative degrees of freedom. Applicable only to spline features. The degrees of freedom associated with a spline represent how “wiggly” it is allowed to be: a spline with two degrees of freedom is just a line. (Since features are constrained to have zero mean response over the data, linear features only have one degree of freedom.) The relative degrees of freedom set the baseline smoothing parameter \(\lambda\) associated with the feature. When fit() is called, the user may specify an overall smoothing parameter applied to all features to alter the amount of regularization in the entire model; the actual degrees of freedom will vary based on the amount of smoothing. By default, splines have 4 relative degrees of freedom.

    Regularization of any feature effectively reduces the degrees of freedom; that adjustment is not yet wired in.

  • regularization (dict, optional) – Dictionary specifying the regularization applied to this feature. Different types of features support different types of regularization. Splines always include a \(C^2\) smoothness penalty controlled via rel_dof; regularization={"group_lasso": {"coef": lam}} additionally shrinks the entire spline contribution and can zero it out. group_lasso_inf is the \(L_\infty\)-norm variant (\(\lambda \|f_j\|_\infty\)); it produces a clipping rather than a uniform contraction and is also available on linear and categorical features. huber is a bounded-influence ridge analogue: regularization={"huber": {"coef": lam, "delta": d}} adds \(\lambda h_\delta(\mathrm{coef})\) per parameter, with \(h_\delta\) quadratic for small magnitudes and linear beyond \(\delta\). Available on linear and categorical features. Other features have more diverse options; see each feature class’s docstring.

  • constraints (dict, optional) – Optional convex shape constraints on the feature’s coefficients. sign ("nonnegative" / "nonpositive"), lower, and upper (floats) bound coefficients. monotonic ("increasing" / "decreasing"), convex, and concave impose ordering / second-difference constraints (categorical features additionally require an order list of category labels; splines order along the knots). Linear features support only sign / lower / upper. See each feature class’s docstring for details.

fit(X: DataFrame, y: ndarray[tuple[Any, ...], dtype[Any]], covariate_class_sizes: ndarray[tuple[Any, ...], dtype[Any]] | None = None, weights: ndarray[tuple[Any, ...], dtype[Any]] | None = None, optimizer: str = 'admm', smoothing: float = 1.0, save_flag: bool = False, verbose: bool = False, plot_convergence: bool = False, max_its: int = 100, n_jobs: int = 1) None[source]#

Fit the model to data.

Parameters:
  • X (pandas.DataFrame) – Dataframe of features. The column names must correspond to the names of features added to the model. X may have extra columns corresponding to features not included in the model; these are silently ignored. Where applicable, the data should be “pre-transformation”: any transformations specified in add_feature() are applied here.

  • y (array-like) – Response. Depending on the model family, the response may need to be in a particular form (for example, for a binomial family, y should contain 0s and 1s); this is not checked.

  • covariate_class_sizes (array-like, optional) – If observations are grouped into covariate classes, the size of those classes should be listed here.

  • weights (array-like, optional) – Per-observation weights. Effectively specifies the dispersion of each observation.

  • optimizer (str) – Optimization method. Currently only "admm" is supported, so this argument has no effect.

  • smoothing (float) – Multiplicative scale applied to every feature’s regularization coefficient. Lets the user set the overall smoothing by cross-validation while keeping the relative regularization across features fixed. Defaults to 1.0, which leaves the regularization as specified in add_feature().

  • save_flag (bool) – Whether to save intermediate results after each iteration. Useful for complicated models with massive data sets that take a while to fit; if the system crashes during the fit the analyst can resume from the last checkpoint. Defaults to False.

  • verbose (bool) – Print mildly useful information during the fit. Defaults to False.

  • plot_convergence (bool) – Plot the convergence graph at the end of the fit. Defaults to False.

  • max_its (int) – Maximum number of ADMM iterations. Defaults to 100.

  • n_jobs (int) – Number of threads to use for the per-feature primal step within each ADMM iteration. Defaults to 1 (serial); pass -1 to use os.cpu_count(). NumPy / SciPy / cvxpy release the GIL during their numeric kernels, so threading produces real speedup. Expect a 2-4x ceiling on models with several non-trivial features (splines, categoricals via cvxpy); pure linear-only models are usually faster serial because the per-feature work is too cheap to amortize thread-dispatch overhead.

Notes

Many binomial data sets include multiple observations with identical features. For example, a data set with features gender and country and a binary survival response might be presented in the compact form

gender

country

patients

survivors

M

USA

50

48

F

USA

70

65

M

CAN

40

38

F

CAN

45

43

This is still a binomial family, just more compact. The compact format is not yet supported; in this context it is important to check for over-dispersion (see [GLM]). The current implementation assumes no over-dispersion and that the number of observations sharing a feature pattern is small.

predict(X: DataFrame) ndarray[tuple[Any, ...], dtype[float64]][source]#

Apply the fitted model to features.

Parameters:

X (pandas.DataFrame) – Data for which to predict the response. The column names must correspond to the names of the features used to fit the model. X may have extra columns corresponding to features not in the model; these are silently ignored. Where applicable, the data should be “pre-transformation”: transformations specified at model definition time are applied here.

Returns:

mu – Predicted mean response for each row of X.

Return type:

ndarray

robust_covariance(cov_type: str = 'HC0') ndarray[tuple[Any, ...], dtype[float64]][source]#

Sandwich (Huber-White) covariance estimator.

Returns the heteroscedasticity-consistent covariance for the parameter vector defined by _design_matrix:

V = (X’ W X)^{-1} (X’ diag(s_i^2) X) (X’ W X)^{-1}

where W_ii = (dmu/deta)_i^2 / V(mu_i) is the IRLS weight and s_i = (y_i - mu_i) (dmu/deta)_i / V(mu_i) is the per-observation score on the eta scale. mu is the fitted mean from this GAM. The dispersion drops out, so the estimator is identical for known and estimated dispersion.

For canonical-link GLMs without regularization, the result matches statsmodels.GLM(...).fit(cov_type="HC0").cov_params() on the same parametrization to numerical precision. Penalized fits return the naive sandwich at the penalized point estimate; for proper penalized inference a bread that includes the penalty Hessian is needed (future work).

Parameters:

cov_type ("HC0" or "HC1") – "HC0" is the standard White estimator. "HC1" multiplies by n / (n - p) for a small-sample correction (matches Stata / statsmodels HC1).

Returns:

V – Covariance for the parameter vector (intercept, linear..., categorical contrasts...), in the order produced by _design_matrix().

Return type:

(p, p) ndarray

Raises:
  • AttributeError – If the model has not been fit.

  • NotImplementedError – For spline features, covariate classes, observation weights, the quantile or huber families, or unsupported cov_type values.

confidence_intervals(X: DataFrame, prediction: bool = False, width: float = 0.95) ndarray[tuple[Any, ...], dtype[float64]][source]#

Confidence intervals on predictions.

Note

Not yet implemented; raises NotImplementedError.

Two notions of confidence interval are useful here. The first is a confidence interval on \(\mu\), the mean response; this captures the uncertainty in the fitted model. The second is a confidence interval on observations of this model. For a Gaussian family the model might be a perfect fit with billions of observations – so \(\mu\) is known precisely and the mean-response interval is tiny – but observations are still spread around the mean, so the prediction interval is wider. For a binomial family the estimated mean lies in \((0, 1)\) and admits a confidence interval, but the observed response is always 0 or 1 so a “prediction interval” is only meaningful in a pedantic sense.

When making multiple predictions a “global” set of intervals (all predictions inside their intervals with the specified joint probability) is sometimes desirable. This function does not compute global intervals; each interval is computed in vacuo.

Parameters:
  • X (pandas.DataFrame) – Data for which to predict the response. The column names must correspond to the names of the features used to fit the model. X may have extra columns corresponding to features not in the model; these are silently ignored. Where applicable, the data should be “pre-transformation”.

  • prediction (bool) – If True, return a confidence interval on the predicted response; if False, on the mean response. Defaults to False.

  • width (float) – Desired confidence width in \((0, 1)\). Defaults to 0.95.

Returns:

bounds – Lower and upper bounds on the confidence interval associated with each prediction.

Return type:

ndarray of shape (n, 2)

plot(name: str, true_fn: Callable[[ndarray[tuple[Any, ...], dtype[float64]]], ndarray[tuple[Any, ...], dtype[float64]]] | None = None) None[source]#

Plot the model component for a particular feature.

Parameters:
  • name (str) – Name of the feature; must be one of the features in the model.

  • true_fn (callable, optional) – Function representing the “true” relationship between the feature and the response, overlaid on the plot for comparison.

residuals(kind: Literal['response', 'pearson', 'deviance', 'anscombe'] = 'deviance') ndarray[tuple[Any, ...], dtype[float64]][source]#

Residuals for the fitted model.

Parameters:

kind ("response", "pearson", "deviance", or "anscombe") – "response" returns y - mu (raw error). "pearson" returns the standardized residual (y - E[y]) / sqrt(Var(y)) using the family’s variance function and the estimated dispersion. For binomial with covariate classes, y is the count and E[y] = m * mu. "deviance" returns sign(y - E[y]) * sqrt(d_i), where d_i >= 0 is the per-observation deviance contribution; squaring and summing gives the total deviance (Wood 2017 Sec 3.1.7). "anscombe" returns the family-specific transformation (A(y) - A(mu)) / (V(mu)^(1/6) * sqrt(phi)) with A(t) = integral V(s)^(-1/3) ds, chosen so the residual is approximately standard-normal under correct specification (McCullagh & Nelder 1989 Sec 2.4.1; Wood 2017 Sec 3.1.7).

Returns:

residuals

Return type:

array of shape (n_obs,)

plot_residuals(kind: Literal['response', 'pearson', 'deviance', 'anscombe'] = 'deviance') Any[source]#

Plot residuals vs. fitted values and a normal QQ plot.

Produces a 1x2 matplotlib figure: residual-vs-fitted on the left, Q-Q plot against the standard normal on the right.

Parameters:

kind ("response", "pearson", or "deviance") – Forwarded to residuals().

Returns:

fig

Return type:

matplotlib.figure.Figure

plot_residuals_vs_predictor(predictor: ArrayLike, *, kind: Literal['response', 'pearson', 'deviance', 'anscombe'] = 'deviance', name: str | None = None) Any[source]#

Plot residuals against a predictor.

Useful for diagnosing fit adequacy: a clear pattern (curvature, funnel shape, etc.) when predictor is a feature already in the model suggests an unmodeled non-linearity or heteroscedasticity. The same plot for a predictor not in the model is a quick check on whether to add it.

Parameters:
  • predictor (array-like of shape (n_obs,)) – Predictor values for each training observation. Must match the length of the training data used to fit this model. Categorical (string / object) predictors are plotted as a categorical x-axis.

  • kind ("response", "pearson", "deviance", or "anscombe") – Forwarded to residuals().

  • name (str, optional) – Label for the x-axis and title.

Returns:

fig

Return type:

matplotlib.figure.Figure

deviance(X: DataFrame | None = None, y: ndarray[tuple[Any, ...], dtype[Any]] | None = None, covariate_class_sizes: ndarray[tuple[Any, ...], dtype[Any]] | None = None, w: ndarray[tuple[Any, ...], dtype[Any]] | None = None) float[source]#

Deviance of the fitted model.

With no arguments, returns the training deviance

\[D = 2 \phi \bigl(\ell(y; y) - \ell(\mu; y)\bigr),\]

where \(\phi\) is the dispersion (present only to cancel the denominator of the log-likelihood), \(\ell(y; y)\) is the log-likelihood of a model that fits the data perfectly, and \(\ell(\mu; y)\) is the log-likelihood of the fitted model on its training data. This is the quantity ADMM minimizes during fitting.

With X and y supplied, returns the deviance on the provided data set. Combined with cross-validation this is the usual way to pick the smoothing parameter by minimizing deviance on a hold-out set.

Parameters:
  • X (pandas.DataFrame, optional) – Dataframe of features. The column names must correspond to the names of features added to the model (see predict()). Only used in the hold-out branch.

  • y (array-like, optional) – Response on the hold-out set. Only used in the hold-out branch.

  • covariate_class_sizes (array-like, optional) – Array of covariate class sizes for the hold-out data.

  • w (array-like, optional) – Per-observation weights for the hold-out branch.

Returns:

D – The deviance of the model.

Return type:

float

dispersion(formula: str = 'deviance') float[source]#

Dispersion of the fitted model.

Returns the dispersion associated with the model. Some families (binomial, poisson) have a fixed dispersion; for others the dispersion is estimated from the data.

Different estimators may be appropriate for different families. The current implementation is the deviance-based estimator of [GAMr], Eqn. 3.10 on p. 110. That section notes the estimator works poorly for overdispersed Poisson data with a small mean response and offers alternatives that have not yet been implemented. This is not currently a concern because overdispersion is not supported – and without overdispersion the Poisson dispersion is exactly 1.

Parameters:

formula (str) –

Formula for the dispersion. Options:

  • "deviance" (default)

  • "pearson"

  • "fletcher"

dof() float[source]#

Degrees of freedom: sum of feature DOFs plus the affine intercept.

aic() float[source]#

Akaike Information Criterion.

Useful for choosing smoothing parameters. The AIC computed here is off by a constant factor, which simplifies the computation without affecting model-selection rankings.

Different authors throw in multiplicative or additive factors willy-nilly since they do not affect model selection.

aicc() float[source]#

AIC corrected for finite sample size.

Applies the Hurvich-Tsai small-sample correction to aic():

AICc = AIC + 2 * p * (p + 1) / (n - p - 1)

where p is the effective number of parameters (dof(), plus one if the dispersion is being estimated) and n is the number of observations. This is Eqn 6.32 on p. 304 of [GAMr]. As n grows the correction term vanishes and AICc reduces to AIC. AICc is preferable to AIC when n is small relative to p.

Returns +inf for an overparameterized fit (n <= p + 1), where the correction term is undefined and the model has no useful AICc.

bic() float[source]#

Bayesian Information Criterion.

Computed as

BIC = deviance / dispersion + log(n) * p

where p is the effective parameter count (dof(), plus one when dispersion is estimated, matching aic() and aicc()) and n is the number of observations. As with aic(), the BIC is off by the same constant factor; only differences are meaningful for model selection.

null_deviance() float[source]#

Deviance of the intercept-only model on the training data.

The null model predicts the same mean response for every observation: mean(y) (or sum(y) / sum(covariate_class_sizes) when binomial covariate classes were supplied at fit time).

r_squared() float[source]#

Deviance-based pseudo-R^2.

Returns 1 - deviance() / null_deviance(). For a normal family with identity link this is the conventional R^2; for other families it is the deviance pseudo-R^2 of McCullagh and Nelder. Returns nan when the null deviance is zero (the response is constant), where pseudo-R^2 is undefined.

ubre(gamma: float = 1.0) float[source]#

Un-Biased Risk Estimator.

Returns the Un-Biased Risk Estimator as discussed in Sections 6.2.1 and 6.2.5 of [GAMr]. Useful for choosing the smoothing parameter when the dispersion is known.

As discussed in Section 6.2.5 of [GAMr], sometimes it is helpful to force smoother fits by exaggerating the effective degrees of freedom; a value of gamma > 1 may be desirable in that case.

gcv(gamma: float = 1.0) float[source]#

Generalized Cross Validation score.

Useful for choosing the smoothing parameter when the dispersion is unknown.

As discussed in Section 6.2.5 of [GAMr], sometimes it is helpful to force smoother fits by exaggerating the effective degrees of freedom; a value of gamma > 1 may be desirable in that case.

summary() None[source]#

Print summary statistics for the fitted model.

Prints statistics for the overall model and for each individual feature (see the __str__ method on each feature class for the per-feature output).

The overall model section reports:

phi

Estimated dispersion parameter. Omitted when dispersion was supplied at construction time or is fixed by the family (e.g. Poisson).

edof

Estimated degrees of freedom.

Deviance

Twice the dispersion times the difference between the log-likelihood of the saturated model and that of the fitted model.

AIC

Akaike Information Criterion.

AICc

AIC corrected for finite data sets.

BIC

Bayesian Information Criterion.

R^2

Deviance-based pseudo-\(R^2\).

UBRE

Unbiased Risk Estimator (when dispersion is known).

GCV

Generalized Cross Validation (when dispersion is estimated).

See the corresponding methods (aic(), aicc(), bic(), r_squared(), ubre(), gcv()) for definitions.

gamdist.gamdist.fit_adaptive_lasso(gam: GAM, X: DataFrame, y: ndarray[tuple[Any, ...], dtype[Any]], *, gamma: float = 1.0, eps: float = 1e-06, **fit_kwargs: Any) GAM[source]#

Fit gam via the two-stage adaptive lasso (Zou, 2006).

The first stage fits gam to (X, y) using whatever L1 regularization the user configured on each feature – producing a pilot estimate of every coefficient. The second stage rewrites each L1-regularized feature’s per-coefficient L1 weight to base / (|pilot| + eps) ** gamma and refits. Coefficients with large pilot magnitudes are penalized less; tiny pilots (likely noise) are penalized more, recovering the “oracle” sparsity pattern with less bias on the truly active coefficients than plain L1.

Both stages are convex (they are ordinary weighted-L1 fits at fixed weights), so this is just a thin wrapper over two GAM.fit calls.

Parameters:
  • gam (GAM) – Fully configured but not-yet-fit GAM. At least one feature must carry an regularization={"l1": {"coef": ...}} configuration; other features and other regularizers (l2, group_lasso, …) ride along untouched. The same gam object is mutated in place and returned.

  • X (data) – Forwarded to both gam.fit calls.

  • y (data) – Forwarded to both gam.fit calls.

  • gamma (float) – Adaptive-lasso exponent (typically 1.0). Larger gamma makes the second-stage weights swing harder around the pilot magnitudes.

  • eps (float) – Stabilizer added to the pilot magnitude before exponentiation, so a pilot of exactly zero produces a finite (though large) second-stage weight rather than a divide-by-zero.

  • **fit_kwargs – Forwarded to both gam.fit calls. Both stages see the same smoothing, max_its, weights, etc.

Returns:

gam – The same model, now fit with adaptive L1 weights.

Return type:

GAM

Feature ABC#

Feature ABC.

Defines _Feature, the abstract base class every concrete feature type plugs into the ADMM loop with. The interface is the modular seam called out in CLAUDE.md: a feature exposes initialize / optimize / compute_dual_tol / num_params / dof / predict / _save / _load, and the GAM orchestrator never needs to know whether the feature is linear, categorical, spline, or something user-defined.

Linear Feature#

Linear feature: a single slope with closed-form regularized solves.

_LinearFeature contributes m * (x - mean(x)) to the linear predictor. The per-feature primal step has a closed-form solution at every supported regularization combination – ridge folds into the denominator, L1 / group lasso / L_inf group lasso into a 1-D soft-threshold, Huber into a clipped quadratic – so this branch never calls cvxpy. Sign / lower / upper convex constraints clip the unconstrained optimum onto the feasible interval.

Categorical Feature#

Categorical feature: one parameter per level with a zero-sum constraint.

_CategoricalFeature contributes a per-level offset to the linear predictor. The per-feature primal step is solved as a small QP / SOCP via cvxpy; available regularizers are L1, L2, group lasso, the L_inf group lasso variant, network lasso, network ridge, and Huber. Convex shape constraints (sign / lower / upper / monotonic / convex / concave on a user-supplied category ordering) compose with those regularizers inside the same cvxpy program.

Spline Feature#

Cubic regression spline feature with a curvature smoothness penalty.

_SplineFeature represents a smooth univariate term in the linear predictor as a cubic regression spline on adaptively-chosen knots. The fit is regularized by an integrated squared second derivative penalty; the smoothing parameter is determined to hit a target relative-DOF budget (rel_dof), not by cross-validation.

Two solver paths share the same objective:

  • Closed-form Cholesky path – used when no shape constraints and no group-lasso terms are active.

  • cvxpy path – used when group lasso, L_inf group lasso, or shape constraints (sign / lower / upper / monotonic / convex / concave on the values evaluated at the knots) are present.

Helper functions _sample(), _evaluate_spline_basis(), and _omega_curvature() build the basis matrix and curvature penalty matrix used in both paths.

exception gamdist.spline_feature.SplineError[source]#

Raised when spline knot selection or basis evaluation fails.

Proximal Operators#

Proximal operators for each supported (family, link) pair.

The ADMM dual step in gamdist.GAM._optimize() calls the proximal operator of the per-observation negative log-likelihood

\[\mathrm{prox}_\mu(v) := \arg\min_x\, L(x) + \frac{\mu}{2}\, \| x - v \|_2^2\]

for the model’s family and link. Per the convexity-only design rule (CLAUDE.md), every dispatch entry corresponds to a convex subproblem and a dedicated solver:

  • normal + identity, gamma + reciprocal, quantile + identity, and huber + identity have closed forms.

  • binomial + logit, poisson + log, and inverse-gaussian + reciprocal-squared use damped Newton iterations on the scalar dual update with a backtracking safeguard.

All operators broadcast over their v, y array inputs. Optional covariate-class sizes (ccs), observation weights (w), and the shape parameter for huber / quantile losses are passed through keyword arguments.

Multi-Task GAM#

Multi-task GAM: K supervised-learning tasks fit jointly under a shared feature set with optional convex coupling regularization.

Per the design discussion on issue #39 (and CLAUDE.md’s seam principle):

  • Per-task primal step is unchanged: each task’s task-local features run their existing optimize(fpumz_k, rho) independently. A multi-task feature wraps K of those into a single optimize_multi(fpumz_list, rho) call, which is the only place the cross-task coupling penalty enters.

  • Per-task proximal step is unchanged: K independent prox calls, each with its own (family_k, link_k, y_k) – so tasks can use different family/link pairs.

  • MultiTaskGAM is a separate class whose state (f_bar, z_bar, u, residual histories, offset, …) is a length-K list. Single-task GAM is unchanged.

The first concrete coupling penalty is group-lasso-across-tasks on linear features (see _MultiTaskLinearFeature). Persistence, MultiTaskGAM.summary(), plotting, model-selection statistics, and per-task tasks= routing are out of scope for this slice.

class gamdist.multi_task_gamdist.MultiTaskGAM(families: list[Literal['normal', 'binomial', 'poisson', 'gamma', 'exponential', 'inverse_gaussian', 'quasi_binomial', 'quasi_poisson', 'quantile', 'huber']], links: list[Literal['identity', 'logistic', 'probit', 'complementary_log_log', 'log', 'reciprocal', 'reciprocal_squared'] | None] | None = None, dispersions: list[float | None] | None = None, name: str | None = None)[source]#

Generalized Additive Model fit jointly across K tasks.

Each task k has its own (family_k, link_k) pair, its own response y_k (lengths may differ across tasks), and its own feature data; coefficients are tied together by the coupling regularizer attached to each multi-task feature.

Parameters:
  • families (list[str]) – Length-K list of family names. Each entry must be one of the single-task families supported by GAM. Tasks may use different families.

  • links (list[str | None] or None) – Length-K list of link names; defaults to the canonical link of each family. None entries fall back to the canonical link.

  • dispersions (list[float | None] or None) – Optional per-task fixed dispersion. None entries follow the single-task GAM defaults (e.g. binomial / poisson are fixed at 1).

  • name (str or None) – Optional model name. Persistence is not supported in this slice; the name is currently informational only.

Notes

Several features of single-task GAM are intentionally not yet wired up for the multi-task case in this slice and raise NotImplementedError if invoked: save_flag, load_from_file, summary, plot, aic / aicc / gcv / ubre, confidence_intervals, robust covariance, quantile / huber families, covariate classes, and the estimate_overdispersion path.

add_feature(name: str, type: Literal['linear'], transform: Callable[[ndarray[tuple[Any, ...], dtype[Any]]], ndarray[tuple[Any, ...], dtype[Any]]] | None = None, regularization: dict[str, Any] | None = None) None[source]#

Add a feature shared across all K tasks.

The first slice supports only type='linear'. Each task k contributes m_k * (x_k - mean(x_k)) to its own linear predictor; the K slopes m_1, ..., m_K may be coupled by a convex coupling regularizer (see _MultiTaskLinearFeature for the supported penalties).

fit(Xs: list[DataFrame], ys: list[ndarray[tuple[Any, ...], dtype[Any]]], weights: list[ndarray[tuple[Any, ...], dtype[Any]] | None] | None = None, smoothing: float = 1.0, verbose: bool = False, max_its: int = 100) None[source]#

Fit the multi-task GAM via K-coupled ADMM.

Parameters:
  • Xs (list[pandas.DataFrame]) – Length-K list of per-task design matrices. Column names must include every feature added via add_feature. Per-task row counts may differ.

  • ys (list[ndarray]) – Length-K list of per-task response arrays. len(ys[k]) must equal len(Xs[k]).

  • weights (list[ndarray | None] or None) – Optional per-task observation weights. None (or a None entry) means unit weights for that task.

  • smoothing (float) – Multiplicative scale applied to every feature’s regularization coefficient (matches single-task GAM).

  • verbose (bool) – Print per-iteration progress.

  • max_its (int) – Maximum ADMM iterations (per the joint stopping rule).

predict(X: DataFrame | list[DataFrame]) list[ndarray[tuple[Any, ...], dtype[float64]]][source]#

Predict per-task mean responses.

Parameters:

X (pandas.DataFrame or list[pandas.DataFrame]) – If a single DataFrame is passed, the same predictor matrix is used for every task and the K predictions all have the same length. If a list of K DataFrames is passed, each task’s prediction uses its own DataFrame.

Returns:

mus – Length-K list of predicted mean responses; mus[k] has shape (len(X[k]),).

Return type:

list of arrays

deviance(X: list[DataFrame] | None = None, y: list[ndarray[tuple[Any, ...], dtype[Any]]] | None = None, weights: list[ndarray[tuple[Any, ...], dtype[Any]] | None] | None = None) list[float][source]#

Per-task deviance.

With no arguments, returns the training deviance of each task evaluated against the current ADMM iterate. Otherwise, the deviance is computed against the supplied per-task data.

Multi-Task Features#

Feature classes that span multiple tasks in a MultiTaskGAM.

A _MultiTaskFeature owns one set of coefficients shared across the K tasks and, optionally, a convex coupling regularizer that ties those K coefficient vectors together. Coupling lives entirely inside the feature – the multi-task orchestrator never sees a penalty coefficient – so the seam principle from CLAUDE.md is preserved.

The first concrete subclass is _MultiTaskLinearFeature: a linear contribution \(m_k (x_k - \bar{x}_k)\) for each task \(k\). The only coupling penalty implemented in this slice is group-lasso across tasks,

\[\lambda \sqrt{\sum_k m_k^2\, x_k^\top x_k},\]

the K-task generalization of the existing single-task linear group-lasso. With \(\lambda = 0\) the K subproblems decouple and the feature behaves like K independent _LinearFeature instances stitched together.