TabSHAP
Abstract
Large Language Models (LLMs) fine-tuned on serialized tabular data are emerging as powerful alternatives to traditional tree-based models, particularly for heterogeneous or context-rich datasets. However, their deployment in high-stakes domains is hindered by a lack of faithful interpretability; existing methods often rely on global linear proxies or scalar probability shifts that fail to capture the model’s full probabilistic uncertainty. In this work, we introduce TabSHAP, a model-agnostic interpretability framework designed to directly attribute local query decision logic in LLM-based tabular classifiers. By adapting a Shapley-style sampled-coalition estimator with Jensen–Shannon divergence between full-input and masked-input class distributions, TabSHAP quantifies the distributional impact of each feature rather than simple prediction flips. To align with tabular semantics, we mask at the level of serialized key:value fields (atomic in the prompt string), not individual subword tokens. Experimental validation on the Adult Income and Heart Disease benchmarks demonstrates that TabSHAP isolates critical diagnostic features, achieving significantly higher faithfulness than random baselines and XGBoost proxies. We further run a distance-metric ablation on the same test instances and TabSHAP settings: attributions are recomputed with KL or L1 replacing JSD in the similarity step (results cached per metric), and we compare deletion faithfulness across all three.
1 Introduction
The rise of Large Language Models (LLMs) has fundamentally altered the landscape of machine learning, demonstrating remarkable capabilities across diverse domains - from natural language understanding to code generation. Recently, this transformation has extended to structured data analysis, where LLMs fine-tuned on serialized tabular datasets have emerged as viable alternatives to traditional tree-based methods like XGBoost and Random Forests. By converting tabular rows into natural language prompts, approaches such as TabLLM (Hegselmann et al., 2023) have shown that pre-trained language models can leverage rich semantic priors to achieve competitive, and in some cases superior, performance on heterogeneous or context-rich datasets where conventional models struggle.
However, this paradigm shift introduces a critical challenge: interpretability. In high-stakes domains such as healthcare and finance, model predictions must not only be accurate but also explainable and audit-able. Stakeholders require faithful attributions that reveal which features causally influence a decision, enabling practitioners to identify biases, validate domain logic, and build trust in automated systems. While tree-based models benefit from well-established interpretability frameworks like TreeSHAP (Lundberg and Lee, 2017), LLM-based tabular classifiers remain largely opaque. Existing interpretation attempts either rely on global linear proxies such as fitting logistic regression to the model’s predictions or apply token-level attribution methods designed for natural language tasks, neither of which adequately capture the instance-specific, non-linear decision logic encoded in fine-tuned LLM parameters.
The interpretability gap is further compounded by fundamental architectural differences between LLMs and traditional classifiers. Unlike discriminative models that output fixed probability vectors, generative LLMs distribute probability mass over vast vocabularies, requiring careful extraction and aggregation of class probabilities. Moreover, standard perturbation-based attribution methods operate at the sub-word token level, which is semantically inappropriate for tabular data where multi-token spans (e.g., “Age: 50”) represent atomic feature values. Naive token deletion can corrupt prompt syntax and trigger model hallucinations unrelated to true feature importance. Finally, most existing methods measure feature impact via scalar probability shifts or prediction flips, failing to capture the model’s full distributional uncertainty - a critical oversight when evaluating confidence in probabilistic decision-making.
In this work, we introduce TabSHAP, a model-agnostic interpretability framework specifically designed to attribute decision logic in LLM-based tabular classifiers under empirical faithfulness tests. Our approach makes three core contributions:
1. Distributional Attribution via Jensen-Shannon Divergence. TabSHAP measures feature importance by comparing the model’s full output probability distribution before and after removing a feature, using Jensen-Shannon Divergence (JSD). Unlike methods that only track whether the predicted label changes, this captures shifts in the model’s confidence across all classes, providing a more complete measure of each feature’s influence.
2. Feature-Level Atomic Masking. In serialized tabular prompts, a single feature (e.g., “Age: 50”) spans multiple tokens. Rather than masking individual tokens—which can corrupt the prompt—TabSHAP removes the entire key-value pair as one atomic unit. This preserves prompt syntax and ensures that perturbations correspond to meaningful feature removal.
3. Verbalizer-Style Class Aggregation. Generative LLMs may assign probability to multiple tokens that represent the same class label (e.g., “Yes”, “yes”, “ yes”). TabSHAP aggregates mass from the top- next-token logits that match each class under a fixed label-to-token mapping before computing attributions, producing stable scores even for distilled or quantized models.
We validate TabSHAP through comprehensive experiments on the Adult Income and Heart Disease benchmarks. Deletion curve analysis demonstrates that TabSHAP achieves significantly higher faithfulness than random and tree-based baselines, with features ranked by our method causing precipitous drops in model confidence when removed. Comparative analysis against XGBoost reveals that while LLMs and tree ensembles capture similar high-level causal signals, they rely on fundamentally different decision logic—with LLMs exhibiting semantic biases toward text-rich features. Finally, we compare JSD-, KL-, and L1-based TabSHAP rankings under identical coalition sampling and deletion protocols (§4).
By providing faithful, feature-level attributions for LLM-based tabular classifiers, TabSHAP enables the responsible deployment of these models in high-stakes domains, bridging the gap between the impressive generalization of language models and the interpretability requirements of critical decision-making systems.
2 Related Work
Large Language Models for Tabular Data.
The application of Transformers to structured data has evolved from specific architectures to the serialization of tabular data into natural language prompts for general-purpose LLMs. Hegselmann et al. (2023) formalized this with TabLLM, demonstrating that fine-tuning LLMs on serialized strings yields state-of-the-art performance, particularly on few-shot and context-heavy tasks where traditional tree-based models struggle. Subsequent works have further validated the superior generalization of LLMs on “dirty” or semantic datasets (Dinh et al., 2022; Wu et al., 2025). However, interpretability in this domain remains under-explored. Hegselmann et al. (2023) attempted to interpret TabLLM by fitting a global Logistic Regression surrogate to the model’s zero-shot predictions. While this provides a general overview of feature weights, it relies on a linear proxy that fails to capture the instance-specific, non-linear decision logic encoded in the fine-tuned LLM parameters. Our work addresses this gap by interpreting the fine-tuned model directly, rather than explaining a global proxy.
Feature Attribution in LLMs.
Post-hoc attribution methods fall broadly into gradient-based and perturbation-based categories. Gradient-based methods, including saliency maps (Simonyan et al., 2013) and Integrated Gradients (Sundararajan et al., 2017), have been applied to LLMs but are often computationally prohibitive and noisy in deep architectures. Perturbation-based methods, specifically Shapley Values (Lundberg and Lee, 2017), offer theoretically grounded axioms for attribution. Horovicz and Goldshmidt (2024) and Enouen and others (2023) recently adapted Monte Carlo Shapley estimation to LLMs, treating individual tokens as players to derive importance scores. Expanding attribution to multimodal settings, Goldshmidt and others (2025) introduced PixelSHAP to reveal object-level focus in Vision-Language Models, showing the necessity of attributing semantically meaningful units (objects) rather than raw inputs (pixels). However, for tabular tasks, existing text-based methods still typically operate at the sub-word level, which is equally suboptimal since tabular semantic meaning is encapsulated in multi-token fields (e.g., “Age: 50” is split into four tokens). TabSHAP bridges this gap by enforcing feature-level token aggregation to maintain semantic integrity, drawing inspiration from higher-level semantic abstraction, and replacing standard logit-drop metrics with strict distributional measures. These advancements in interpretability are deeply necessary for building trust in LLM applications and foundational data models, aligning with emerging research priorities.
Interactions and Distributional Analysis.
Recent advancements have sought to move beyond simple marginal attribution. Kang et al. (2025) proposed a rigorous framework using sparse Fourier transforms to detect higher-order feature interactions in long-context LLMs (). While SPEX focuses on scalability and interactions, our work targets the specific constraints of tabular classification (), where faithful marginal attribution is the primary auditing requirement. Furthermore, while most existing methods measure impact via scalar probability shifts, our work leverages Jensen-Shannon Divergence (JSD). This aligns with recent efforts in measuring diagnostic uncertainty, acknowledging that a feature’s importance is defined not just by its ability to flip a label, but by its contribution to the model’s distributional certainty.
3 Methodology
We propose TabSHAP, a post-hoc interpretability framework for LLMs fine-tuned on tabular classification tasks. Building on the sampled-coalition procedure introduced by TokenSHAP (Horovicz and Goldshmidt, 2024), our approach adapts perturbation and scoring to enforce field-level masking and distributional comparison suited to serialized tabular prompts. Figure 1 summarizes the pipeline: serialization, inference, top- class aggregation, omission-based masking, Jensen–Shannon similarity, and normalized importance scores.
3.1 Model Architecture and Optimization
Our framework is instantiated using DeepSeek-R1-Distill-Llama-8B, a state-of-the-art distilled reasoning model chosen for its balance of inference efficiency and logic capabilities. To adapt this general-purpose model to the specific distribution of tabular risk scoring, we utilize Quantized Low-Rank Adaptation (QLoRA) (Dettmers et al., 2024). QLoRA allows us to fine-tune the model in 4-bit precision while freezing the base parameters, significantly reducing memory overhead while retaining high-fidelity representations. The training pipeline is accelerated using the Unsloth optimization framework, which provides optimized kernels for gradient checkpointing and faster backpropagation.
3.2 Tabular Serialization and Fine-Tuning
Let be a tabular dataset where contains features and is the target label (e.g., a binary or multiclass classification target). To bridge the modality gap, we employ a deterministic serialization function that converts structured rows into instruction-tuning prompts.
Unlike the natural language sentence templates often favored in prior work (e.g., TabLLM’s “The age is 39…”) (Hegselmann et al., 2023), we utilize a concise key-value representation to maximize token efficiency and structural clarity. As implemented in our conversion script, numerical values are cast to integers, and categorical strings are normalized (lowercased, spaces replaced with underscores) to reduce tokenizer fragmentation. The features are concatenated into a space-delimited string:
| (1) |
For example, a row is serialized as: age:50 workclass:private .... This feature string is embedded into a standard instruction-tuning template.
3.3 Output Distribution and Verbalizer Aggregation
Unlike discriminative classifiers that output a fixed vector of class probabilities, generative LLMs distribute probability mass over a vast vocabulary . To extract a calibrated classification probability suitable for Shapley estimation, we perform a direct logit probe at the final token position of the prompt.
Let be the logits output by the model’s language modeling head. We compute the probability distribution over the vocabulary via the softmax function: for each token .
Token Aggregation: A critical challenge in LLM interpretability is that tokenization is sensitive to spacing and casing. For example, the semantic concept “Yes” may be represented by distinct tokens such as “Yes”, “ yes”, or fragmented tokens like “ye” + “s”. Relying on a single token ID causes feature attribution to be unstable. To resolve this, we define a Verbalizer Mapping that maps each target class (e.g., ) to a set of semantically equivalent tokens.
We compute the raw probability mass for a class by aggregating over the top- most probable tokens:
| (2) |
This dynamic aggregation is particularly critical for distilled or quantized models, where tokenization artifacts may cause the model to output fragmented tokens. Finally, to ensure a valid probability distribution for the Shapley calculation, we normalize the aggregated probabilities over the candidate class subspace:
| (3) |
3.4 Feature Perturbation via Atomic Masking
A core challenge in LLM interpretability is defining the “absence” of a feature in a text stream. Naive token-level deletion (e.g., masking individual tokens like “5” in “Age: 50”) corrupts the semantic integrity of the prompt, often creating invalid numerical values (e.g., “Age: 0”) that trigger model hallucinations unrelated to the feature’s true importance.
To address this, TabSHAP implements Atomic Feature Masking. Each tabular field is serialized as a single space-delimited token of the form key:value (see §3.2); we treat each such token as one atomic unit. During coalition evaluation, if a feature is absent from coalition , its entire key:value string is omitted from the input block—we do not substitute mean, mode, or placeholder values. The instruction and response template remain fixed, so the model sees a valid prompt containing only the active features in .
3.5 Distributional Similarity & Estimation
Computing exact Shapley values is computationally intractable for LLMs due to the exponential number of feature combinations. We adopt the Monte Carlo estimation framework from TokenSHAP (Horovicz and Goldshmidt, 2024), but we fundamentally alter the metric to suit tabular classification.
Instead of semantic similarity, TabSHAP measures the preservation of the model’s belief state. We define the value function of a feature coalition as its Distributional Similarity to the full model output. Formally, let be the class distribution given all features, and the distribution under coalition . Let denote Jensen–Shannon divergence with the natural logarithm (in nats). We map divergence to a bounded similarity in by normalizing by (the maximum of for a binary support):
| (4) |
A value of implies that yields a class distribution close to the full-input distribution; near implies large divergence. For ablations, we swap the distance in this step for KL divergence or L1 distance, using the same bounded similarity maps as in our evaluation code (kl_divergence, l1_distance).
Following the sampled-coalition estimator used in TokenSHAP-style methods, we approximate feature scores by Monte Carlo sampling over coalitions: we always include all leave-one-out coalitions, add random non-empty coalitions subject to a sampling ratio and a cap , then set to the difference between the average over sampled coalitions that include versus those that exclude . This is a tractable proxy for full Shapley values, not the exact weighted Shapley formula over all subsets. Finally, we shift by the minimum and normalize to sum to one. Algorithm 1 details the procedure.
4 Experiments
We evaluate TabSHAP on two distinct tabular tasks to demonstrate its faithfulness, alignment with established baselines, and ability to capture logical constraints in multiclass settings.
4.1 Experimental Setup
Datasets. We utilize the Adult Income dataset (, 14 features) as a standard binary classification benchmark. We add the Heart Disease dataset (, 13 features) to provide validation in a critical, high-stakes healthcare domain where interpretability is essential. Table 1 summarizes the datasets.
| Dataset | Instances | Features | Task |
|---|---|---|---|
| Adult Income | 48,842 | 14 | Income $50K |
| Heart Disease | 1,025 | 13 | Disease Prediction |
Models and Hyperparameters. We fine-tune DeepSeek-R1-Distill-Llama-8B (Touvron et al., 2023) using QLoRA (Dettmers et al., 2024) on serialized feature strings. The training leverages the Unsloth optimization framework for memory efficiency. Key hyperparameters include a LoRA Rank of 16, , and 4-bit precision loading. For TabSHAP attribution, we use coalition sampling ratio , a cap of coalitions (including all leave-one-out sets), top- logits for class aggregation, and deterministic probing at the final prompt position (no stochastic decoding), so differences across runs come from feature subsets rather than sampling noise.
Baselines. We compare TabSHAP against:
-
•
Random Removal: A baseline where features are masked in random order.
-
•
XGBoost + TreeSHAP: A state-of-the-art tree ensemble interpreted via TreeSHAP (Lundberg and Lee, 2017), serving as a ”proxy ground truth” for tabular logic.
4.2 Faithfulness Evaluation (Deletion Curve)
We assess whether TabSHAP identifies features that genuinely steer the model by sequentially deleting atomic key:value tokens from the ### Input: block (instruction and response template unchanged). At each step we recompute the class distribution and track the probability mass on the originally predicted class (top- aggregation). We report the mean over test instances versus the fraction of features removed (normalized by the average number of features per instance). Removal orders compared are: (i) JSD-based TabSHAP ranking, (ii) XGBoost precomputed TreeSHAP ranking (when available), and (iii) a random permutation. This yields the faithfulness plot (JSD TabSHAP vs. baselines) in our evaluation pipeline.
4.3 Distance Metric Ablation Study
We hold the test subset and TabSHAP hyperparameters fixed, load JSD attributions from cache, and compute KL- and L1-based TabSHAP rankings on the same indices (caching each metric’s rankings after the first run). We then run the identical deletion procedure as in §4.2, but compare removal curves induced by JSD-, KL-, and L1-based orderings, alongside XGBoost- and random-order baselines. Up to features are removed per instance (or fewer if fewer fields exist). This produces the metric-ablation curves used in Figure 5.
4.4 Comparative Analysis: LLM vs. XGBoost
We compare the global feature rankings derived from TabSHAP against those from XGBoost (TreeSHAP) on the Adult Income dataset.
Rank Correlation. We observe a moderate Spearman rank correlation () between the two methods. This suggests that while the fine-tuned LLM captures the fundamental causal signals (e.g., Capital Gain, Marital Status), it relies on different internal logic than the tree ensemble.
Semantic Bias. Qualitative analysis reveals that the LLM tends to assign disproportionately higher importance to semantic, text-rich features (e.g., Occupation, Education) compared to XGBoost, which prioritizes precise numerical splits (e.g., Hours-per-week). This distinction highlights the value of using a faithful interpreter like TabSHAP rather than a global proxy; it reveals that Tabular LLMs solve tasks by leveraging semantic priors, a behavior concealed when approximating the LLM with a linear model.
5 Limitations
While TabSHAP provides faithful, instance-level attributions, several limitations should be noted. First, coalition sampling and the with/without score (Algorithm 1) approximate Shapley-style importance rather than the full exponential-time Shapley formula; faithfulness depends on , , and the omission (not imputation) semantics. Second, repeated forward passes per instance make the method expensive in wall-clock time, though sampling caps cost in practice. Third, our evaluation is limited to two benchmarks with small ; high-dimensional tabular settings remain open. Fourth, we report marginal attributions and do not model explicit feature interactions. Fifth, class probability extraction uses top- logit matching to predefined labels, which may need tuning for new tasks or vocabularies. Finally, we validate on a single fine-tuned backbone (DeepSeek-R1-Distill-Llama-8B); other architectures may behave differently.
6 Conclusion
In this work, we introduced TabSHAP, a model-agnostic interpretability framework for LLMs on serialized tabular prompts. Using Jensen–Shannon divergence (with KL/L1 ablations in the same implementation), atomic omission of key:value fields, and top- logit class aggregation, we measure local feature importance and validate it with deletion curves (predicted-class probability vs. fraction of features removed). Faithfulness evaluation compares JSD TabSHAP to XGBoost- and random-order removal; the metric study compares JSD, KL, and L1 TabSHAP orderings on matched test instances with cached attributions. Future work includes larger , interaction effects, and additional backbones.
References
- QLoRA: efficient finetuning of quantized llms. Advances in Neural Information Processing Systems 36. Cited by: §3.1, §4.1.
- Lift: language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems 35, pp. 11763–11784. Cited by: §2.
- TextGenSHAP: scalable post-hoc explanations in text generation. arXiv preprint arXiv:2312.01279. Cited by: §2.
- Attention, please! PixelSHAP reveals what vision-language models actually focus on. arXiv preprint arXiv:2503.00000. Cited by: §2.
- TabLLM: few-shot classification of tabular data with large language models. In Proceedings of AISTATS, Cited by: §1, §2, §3.2.
- TokenSHAP: interpreting large language models with monte carlo shapley value estimation. arXiv preprint arXiv:2407.10114. Cited by: §2, §3.5, §3.
- SPEX: scaling feature interaction explanations for llms. Building Trust Workshop at ICLR. Cited by: §2.
- A unified approach to interpreting model predictions. Advances in neural information processing systems 30. Cited by: §1, §2, 2nd item.
- Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §2.
- Axiomatic attribution for deep networks. In International conference on machine learning, pp. 3319–3328. Cited by: §2.
- Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971. Cited by: §4.1.
- Tablebench: a comprehensive and complex benchmark for table question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, pp. 25497–25506. Cited by: §2.
Appendix A Prompt Serialization Formatting
To ensure robust tabular ingestion by the LLM, we employed a deterministic serialization script. Numerical integers were passed directly as text, while categorical values were normalized by lowercasing and replacing whitespace with underscores. This mitigated tokenizer fragmentation. Full schema translations were embedded into the standard DeepSeek instruction-tuning template, encapsulated between strict ### Input: and ### Response: markers to guarantee the attribution engine selectively masked correct feature bounds.
Appendix B Hyperparameters and Optimization
All TabSHAP models were built upon DeepSeek-R1-Distill-Llama-8B. To manage memory footprint while capturing complex tabular logic, models were fine-tuned via QLoRA.
-
•
LoRA Configuration: Rank , Alpha , Target Modules included query, key, value, and output projections.
-
•
Optimization: Unsloth kernel acceleration applied for 4-bit loading quantization.
-
•
TabSHAP / ablation: , , top- class logits, up to sequential removals per curve; JSD rankings loaded from tokenshap_validation_cache.json; KL/L1 rankings written to separate JSON caches on first run so all metrics use the same selected_test_indices.
Appendix C Dataset Dimensions
Adult Income. Comprises instances. Key text features include workclass, education, marital_status, occupation, and race. Key numerical parameters include age, capital_gain, and hours_per_week. Target threshold is income $50K.
Heart Disease. Comprises 13 critical clinical parameters encompassing categorical inputs (e.g., chest pain type, resting electrocardiographic results) and continuous readings (e.g., resting blood pressure, serum cholesterol). Target represents heart disease diagnosis.