Rashomon Toolkit: Aggregating Model Explanations Beyond the Single Best Model

Description

This project aims to transform the Rashomon partial dependence profile (Rashomon PDP) method—recently introduced as a way to quantify explanation uncertainty in machine learning—into a practical, reusable software package that integrates seamlessly with modern AutoML systems. Instead of relying on a single “best” model, the Rashomon PDP framework aggregates explanations from multiple near-optimal models to reveal where model agreement is strong and where uncertainty or interpretive divergence emerges. By implementing this methodology as a Python toolkit, the project will provide users with uncertainty-aware explanations for both regression and classification tasks, enabling more trustworthy model interpretation in sensitive or high-stakes applications.

The final package will interface with widely used AutoML tools such as H2O, AutoGluon, and Auto-sklearn, automatically extracting model outputs, constructing Rashomon sets, and producing standard PDPs (or also ICE curves, ALE plots) and their Rashomon-based aggregated equivalents. Users will be able to generate confidence intervals, quantify explanation variability, and visualize disagreement across models through clear, interactive plots or dashboard components. Ultimately, the project contributes a missing layer of transparency to AutoML workflows by making model multiplicity explicit and helping practitioners understand not only what a model predicts, but how reliably those explanations can be trusted.

Expected MVP

Goal: Convert the Rashomon partial dependence profile framework (as described in the paper: https://doi.org/10.1007/978-3-032-05461-6_29) into a reusable software package and integrate it with at least one AutoML tool.

Essential MVV capabilities:

1. Training Interface: Connect to one AutoML library (preferably H2O AutoML, since the paper uses it). Allow the user to run an AutoML experiment or pass in pre-trained models.

2. Rashomon Set Construction - Compute performance metric on a test set. - Identify the best-performing model. - Construct the Rashomon set Rε using tolerance ε.

3. PDP Extraction for Each Model: For each model, compute the standard PDP—available in the DALEX package— for each selected feature.

4. Rashomon PDP Construction - Aggregate model-level PDPs into Rashomon PDP (mean curve). - Implement bootstrap to compute confidence bands.

5. Visualization Tools: Produce a plot with the best model PDP, Rashomon PDP, and confidence bands.

6. Simple API: fit(), get_rashomon_set(), rashomon_pdp(), plot()

Deliverables for MVV: - A simple Python package (pip-installable or GitHub installable). - Works on regression and classification datasets. - Integration with H2O AutoML. - Basic documentation + example notebook.

This version is enough to demonstrate the idea and make the method accessible to AutoML practitioners.

2. Fully Completed (Advanced) Project A polished, extensible system suitable for real-world usage and wider research adoption.

A. Extended AutoML Integrations Support additional AutoML systems: AutoGluon, Auto-sklearn, FLAML

B. Additional XAI Methods Optional, but strongly valuable: - Individual Conditional Expectation—available in DALEX package (https://doi.org/10.1080/10618600.2014.907095) - Accumulated Local Effects—available in DALEX package (https://doi.org/10.1111/rssb.12377) - Feature importance distribution across the Rashomon set

C. Exports: - Rashomon set details - Automatically compute coverage rate and MWCI as in the paper. - Reproducibility metadata

D. Packaging: - Publish on PyPI - Automated tests (pytest + GitHub Actions) - Proper versioning + changelog