Usage¶
Quick Start¶
Import the main class, create some data, fit the model, and extract feature importance:
import numpy as np
from bayesian_feature_selection import HorseshoeGLM, InferenceConfig
# Generate synthetic data
np.random.seed(42)
n, p = 100, 20
X = np.random.randn(n, p)
true_beta = np.zeros(p)
true_beta[:5] = [3.0, -2.0, 1.5, -1.0, 0.5]
y = X @ true_beta + np.random.randn(n) * 0.5
# Fit model
model = HorseshoeGLM(family="gaussian", scale_global=1.0)
config = InferenceConfig(method="mcmc", num_warmup=500, num_samples=1000, num_chains=1)
model.fit(X, y, config=config)
# Get feature importance
importance = model.get_feature_importance(threshold=0.5, method="beta")
print(importance[importance["selected"]])
Using DataLoader with CSV Files¶
Load data from a CSV file using DataLoader:
from bayesian_feature_selection import DataLoader, DataConfig
data_config = DataConfig(
data_path="data/my_dataset.csv",
target_col="target",
feature_cols=None, # Use all columns except target
standardize=True,
test_size=0.2,
random_seed=42,
)
loader = DataLoader(data_config)
X_train, X_test, y_train, y_test, feature_names = loader.load_and_split()
MCMC vs SVI Inference¶
MCMC (Markov Chain Monte Carlo) provides exact posterior samples but is slower:
mcmc_config = InferenceConfig(
method="mcmc",
num_warmup=1000,
num_samples=2000,
num_chains=4,
use_gpu=True,
progress_bar=True,
)
model.fit(X, y, config=mcmc_config)
SVI (Stochastic Variational Inference) is faster but provides an approximate posterior:
svi_config = InferenceConfig(
method="svi",
num_steps=10000,
learning_rate=0.001,
use_gpu=True,
progress_bar=True,
)
model.fit(X, y, config=svi_config)
Feature Selection Methods¶
Three methods are available via get_feature_importance():
beta: Based on the coefficient posterior. Selects features with consistent non-zero effects. Best for prediction and interpretation.
lambda: Based on the local shrinkage parameter. Identifies features with weak shrinkage. Better for filtering pure noise.
both: Combines beta and lambda inclusion probabilities.
# Beta-based selection (default)
importance_beta = model.get_feature_importance(threshold=0.5, method="beta")
# Lambda-based selection
importance_lambda = model.get_feature_importance(threshold=0.5, method="lambda")
# Combined selection
importance_both = model.get_feature_importance(threshold=0.5, method="both")
CLI Usage¶
The package provides a bayesian-fs command-line interface:
# Run with a YAML config file
$ bayesian-fs -c configs/default.yaml
# Specify output directory
$ bayesian-fs -c configs/default.yaml -o results/experiment1
# Override model family and inference method
$ bayesian-fs -c configs/default.yaml --family binomial --method svi
# Enable GPU
$ bayesian-fs -c configs/default.yaml --use-gpu
Configuration via YAML Files¶
Create a YAML configuration file to define all experiment parameters:
data:
data_path: "data/my_dataset.csv"
target_col: "target"
feature_cols: null
test_size: 0.2
standardize: true
random_seed: 42
model:
family: "gaussian"
scale_global: 1.0
inference:
method: "mcmc"
num_warmup: 1000
num_samples: 2000
num_chains: 4
use_gpu: true
progress_bar: true
selection:
method: "beta"
threshold: 0.5
output:
save_plots: true
save_diagnostics: true
save_samples: false
Load and use the configuration programmatically:
from pathlib import Path
from bayesian_feature_selection import ExperimentConfig
config = ExperimentConfig.from_yaml(Path("configs/my_experiment.yaml"))
# Modify a parameter
config.inference.num_samples = 5000
# Save the updated config
config.to_yaml(Path("configs/updated.yaml"))
Making Predictions¶
After fitting, use the model to make predictions on new data:
# Point predictions (posterior mean)
y_pred = model.predict(X_new)
# Full posterior predictive samples
y_samples = model.predict(X_new, return_samples=True)
print(y_samples.shape) # (num_samples, n_new)