Privacy Attacks Leveraging Model Explanations

Machine learning (ML) systems are increasingly being deployed in high-stakes domains such as healthcare, finance, and security, where they often handle sensitive and proprietary data. While these models provide significant utility, they are inherently vulnerable to privacy attacks, in which adversaries attempt to infer confidential information from the system’s outputs. Such attacks can target different aspects of an ML model, including the recovery of training data, extraction of model parameters, or inference of sensitive attributes about the training data.

In parallel, the push for explainable AI (XAI) has led to the widespread use of model explanation techniques, such as feature attributions, counterfactual explanations, and surrogate models, to improve transparency and trustworthiness. However, these explanations, though designed for interpretability, can inadvertently expose additional information about the underlying model and its training data. For instance, they may reveal decision boundaries, feature importance patterns, or sensitivity metrics that can be exploited by adversaries to improve the effectiveness of privacy attacks.

This intersection between privacy and explainability raises a critical research question: To what extent can model explanations be leveraged to recover sensitive information from ML models? Addressing this question requires systematically evaluating how much additional advantage explanations provide to various privacy attack strategies, and understanding the trade-offs between interpretability and privacy.

Resources

Privacy Attacks Levaraging Model Explanations

Log of research papers read

Latest Research Papers on Privacy Attacks(2020-Present)