Open participation

Allocate tokens

You have 10 tokens. Max 3 per paper. Changes stay local until you submit. No login required; we may prompt for CAPTCHA on submit.

Anon quota 10Max per paper 3Window 2026-02-02 → 2026-02-08

Remaining tokens: 10/10Used: 0Anon key: …

Only show my allocations

preprint

Oct 9, 2025

Energy-Driven Steering: Reducing False Refusals in Large Language Models

Jiang, Eric Hanchen, Ou, Weixuan, Liu, Run, Pang, Shengyuan, Wan, Guancheng, Duan, Ranjie, Dong, Wei, Chang, Kai-Wei, Wang, XiaoFeng, Wu, Ying Nian, Li, Xinfeng

Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to b…

Computer Science - Artificial IntelligenceComputer Science - Computation and LanguageComputer Science - Machine LearningStatistics - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

webpage

Oct 8, 2025

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Fan, Chongyu, Wang, Changsheng, Huang, Yancheng, Pal, Soumyadeep, Liu, Sijia

Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model …

arXiv

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Sep 26, 2025

Concept-SAE: Active Causal Probing of Visual Model Behavior

Ding, Jianrong, Chen, Muxi, Zhao, Chenchen, Xu, Qiang

Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, offering a powerful observational lens. However, the ambiguous and ungrounded n…

Computer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Sep 25, 2025

MemLens: Uncovering Memorization in LLMs with Activation Trajectories

He, Zirui, Zhao, Haiyan, Payani, Ali, du, Mengnan

Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing…

Computer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 25, 2025

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

Gan, Jinwei, Cheng, Zifeng, Jiang, Zhiwei, Wang, Cong, Yin, Yafeng, Luo, Xiang, Fu, Yuchen, Gu, Qing

Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant…

Computer Science - Artificial IntelligenceComputer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 24, 2025

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Yao, Yifei, Du, Mengnan

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising s…

Computer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 23, 2025

Explaining Black-box Language Models with Knowledge Probing Systems: A Post-hoc Explanation Perspective

Zhao, Yunxiao, Xu, Hao, Wang, Zhiqiang, Li, Xiaoli, Liang, Jiye, Li, Ru

Pre-trained Language Models (PLMs) are trained on large amounts of unlabeled data, yet they exhibit remarkable reasoning skills. However, the trustworthiness challenges posed by th…

Computer Science - Artificial IntelligenceComputer Science - Computation and LanguageComputer Science - Databases

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 22, 2025

Can VLMs Recall Factual Associations From Visual References?

Ashok, Dhananjay, Chaubey, Ashutosh, Arai, Hirona J., May, Jonathan, Thomason, Jesse

Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provid…

Computer Science - Artificial IntelligenceComputer Science - Computer Vision and Pattern RecognitionComputer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 22, 2025

LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

Aswal, Darpan, Hudelot, Céline

Large Language Models have found success in a variety of applications; however, their safety remains a matter of concern due to the existence of various types of jailbreaking metho…

Computer Science - Artificial IntelligenceComputer Science - Computation and LanguageComputer Science - Symbolic Computation

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 21, 2025

Native Logical and Hierarchical Representations with Subspace Embeddings

Moreira, Gabriel, Marinho, Zita, Marques, Manuel, Costeira, João Paulo, Xiong, Chenyan

Traditional neural embeddings represent concepts as points, excelling at similarity but struggling with higher-level reasoning and asymmetric relationships. We introduce a novel pa…

Computer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 20, 2025

Evaluating Sparse Autoencoders for Monosemantic Representation

Fereidouni, Moghis, Haider, Muhammad Umair, Ju, Peizhong, Siddique, A. B.

A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mit…

Computer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 19, 2025

Graph Concept Bottleneck Models

Xu, Haotian, Weng, Tsui-Wei, Nguyen, Lam M., Ma, Tengfei

Concept Bottleneck Models (CBMs) provide explicit interpretations for deep neural networks through concepts and allow intervention with concepts to adjust final predictions. Existi…

Computer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 19, 2025

Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

O'Reilly, Cliff, Jimenez-Ruiz, Ernesto, Weyde, Tillman

Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to repr…

Computer Science - Artificial IntelligenceComputer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Jul 15, 2025

Scaling laws for activation steering with Llama 2 models and refusal mechanisms

Ali, Sheikh Abdur Raheem, Xu, Justin, Yang, Ivory, Li, Jasmine Xinze, Arslan, Ayse, Benham, Clark

As large language models (LLMs) evolve in complexity and capability, the efficacy of less widely deployed alignment techniques are uncertain. Building on previous work on activatio…

Computer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Jul 2, 2025

DCBM: Data-Efficient Visual Concept Bottleneck Models

Prasse, Katharina, Knab, Patrick, Marton, Sascha, Bartelt, Christian, Keuper, Margret

Concept Bottleneck Models (CBMs) enhance the interpretability of neural networks by basing predictions on human-understandable concepts. However, current CBMs typically rely on con…

Computer Science - Computer Vision and Pattern Recognition

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Jun 14, 2025

Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders

Joshi, Ananya, Cintas, Celia, Speakman, Skyler

Recent work shows that Sparse Autoencoders (SAE) applied to large language model (LLM) layers have neurons corresponding to interpretable concepts. These SAE neurons can be modifie…

Computer Science - Artificial IntelligenceComputer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Jun 2, 2025

Towards Better Generalization and Interpretability in Unsupervised Concept-Based Models

Santis, Francesco De, Bich, Philippe, Ciravegna, Gabriele, Barbiero, Pietro, Giordano, Danilo, Cerquitelli, Tania

To increase the trustworthiness of deep neural networks, it is critical to improve the understanding of how they make decisions. This paper introduces a novel unsupervised concept-…

Computer Science - Artificial IntelligenceComputer Science - Machine LearningStatistics - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

May 31, 2025

An Adversarial Perspective on Machine Unlearning for AI Safety

Łucki, Jakub, Wei, Boyi, Huang, Yangsibo, Henderson, Peter, Tramèr, Florian, Rando, Javier

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazard…

Computer Science - Artificial IntelligenceComputer Science - Computation and LanguageComputer Science - Machine LearningComputer Science - Cryptography and Security

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

May 30, 2025

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Oozeer, Narmeen, Marks, Luke, Barez, Fazl, Abdullah, Amirali

Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of …

Computer Science - Artificial IntelligenceComputer Science - Computation and LanguageComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

May 22, 2025

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Xu, Xiaoyu, Yue, Xiang, Liu, Yang, Ye, Qingqing, Hu, Haibo, Du, Minxin

Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and pe…

Computer Science - Artificial IntelligenceComputer Science - Computation and LanguageComputer Science - Machine LearningComputer Science - Cryptography and Security

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

May 21, 2025

Understanding the Repeat Curse in Large Language Models from a Feature Perspective

Yao, Junchi, Yang, Shu, Xu, Jianhua, Hu, Lijie, Li, Mengdi, Wang, Di

Large language models (LLMs) have made remarkable progress in various domains, yet they often suffer from repetitive text generation, a phenomenon we refer to as the "Repeat Curse"…

Computer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Apr 3, 2025

Concept Bottleneck Large Language Models

Sun, Chung-En, Oikarinen, Tuomas, Ustun, Berk, Weng, Tsui-Wei

We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional bl…

Computer Science - Computation and LanguageComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Mar 25, 2025

Interpretable Generative Models through Post-hoc Concept Bottlenecks

Kulkarni, Akshay, Yan, Ge, Sun, Chung-En, Oikarinen, Tuomas, Weng, Tsui-Wei

Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, existing approaches to des…

Computer Science - Computer Vision and Pattern RecognitionComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Mar 3, 2025

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy, Phan, Long, Chen, Sarah, Campbell, James, Guo, Phillip, Ren, Richard, Pan, Alexander, Yin, Xuwang, Mazeika, Mantas, Dombrowski, Ann-Kathrin, Goel, Shashwat, Li, Nathaniel, Byun, Michael J., Wang, Zifan, Mallen, Alex, Basart, Steven, Koyejo, Sanmi, Song, Dawn, Fredrikson, Matt, Kolter, J. Zico, Hendrycks, Dan

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights f…

Computer Science - Artificial IntelligenceComputer Science - Computer Vision and Pattern RecognitionComputer Science - Computation and LanguageComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Feb 24, 2025

Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

Wang, Weixuan, Yang, Jingyuan, Peng, Wei

Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerg…

Computer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Feb 17, 2025

Programming Refusal with Conditional Activation Steering

Lee, Bruce W., Padhi, Inkit, Ramamurthy, Karthikeyan Natesan, Miehling, Erik, Dognin, Pierre, Nagireddy, Manish, Dhurandhar, Amit

LLMs have shown remarkable capabilities, but precisely controlling their response behavior remains challenging. Existing activation steering methods alter LLM behavior indiscrimina…

Computer Science - Artificial IntelligenceComputer Science - Computation and LanguageComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Jan 31, 2025

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Cywiński, Bartosz, Deja, Kamil

Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches o…

Computer Science - Artificial IntelligenceComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Jan 21, 2025

ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

Sun, Zhongxiang, Zang, Xiaoxue, Zheng, Kai, Song, Yang, Xu, Jun, Zhang, Xiao, Yu, Weijie, Song, Yang, Li, Han

Retrieval-Augmented Generation (RAG) models are designed to incorporate external knowledge, reducing hallucinations caused by insufficient parametric (internal) knowledge. However,…

Computer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Jan 16, 2025

VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

Srivastava, Divyansh, Yan, Ge, Weng, Tsui-Wei

Concept Bottleneck Models (CBMs) provide interpretable prediction by introducing an intermediate Concept Bottleneck Layer (CBL), which encodes human-understandable concepts to expl…

Computer Science - Computer Vision and Pattern RecognitionComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Dec 31, 2024

Energy-Based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations

Xu, Xinyue, Qin, Yi, Mi, Lu, Wang, Hao, Li, Xiaomeng

Existing methods, such as concept bottleneck models (CBMs), have been successful in providing concept-based interpretations for black-box deep learning models. They typically work …

Computer Science - Artificial IntelligenceComputer Science - Computer Vision and Pattern RecognitionComputer Science - Machine LearningStatistics - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

conferencePaper

Nov 6, 2024

Refusal in Language Models Is Mediated by a Single Direction

Arditi, Andy, Obeso, Oscar Balcells, Syed, Aaquib, Paleka, Daniel, Rimsky, Nina, Gurnee, Wes, Nanda, Neel

Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this ref…

PDF

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

conferencePaper

Oct 12, 2024

Energy-Based Conceptual Diffusion Model

Qin, Yi, Xu, Xinyue, Wang, Hao, Li, Xiaomeng

Diffusion models have shown impressive sample generation capabilities across various domains. However, current methods are still lacking in human-understandable explanations and in…

PDF

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

webpage

Oct 9, 2024

Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering

Postmus, Joris, Abreu, Steven

Large language models have transformed AI, yet reliably controlling their outputs remains a challenge. This paper explores activation engineering, where outputs of pre-trained LLMs…

arXiv

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Aug 2, 2024

Locating and Editing Factual Associations in Mamba

Sharma, Arnab Sen, Atkinson, David, Bau, David

We investigate the mechanisms of factual recall in the Mamba state space model. Our work is inspired by previous findings in autoregressive transformer language models suggesting t…

Computer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

webpage

Jul 17, 2024

Analyzing the Generalization and Reliability of Steering Vectors

Tan, Daniel, Chanin, David, Lynch, Aengus, Kanoulas, Dimitrios, Paige, Brooks, Garriga-Alonso, Adria, Kirk, Robert

Steering vectors (SVs) have been proposed as an effective approach to adjust language model behaviour at inference time by intervening on intermediate model activations. They have …

arXiv

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Jul 5, 2024

Steering Llama 2 via Contrastive Activation Addition

Panickssery, Nina, Gabrieli, Nick, Schulz, Julian, Tong, Meg, Hubinger, Evan, Turner, Alexander Matt

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering v…

Computer Science - Artificial IntelligenceComputer Science - Computation and LanguageComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

May 3, 2024

Improving Interpretation Faithfulness for Vision Transformers

Hu, Lijie, Liu, Yixin, Liu, Ninghao, Huai, Mengdi, Sun, Lichao, Wang, Di

Vision Transformers (ViTs) have achieved state-of-the-art performance for various vision tasks. One reason behind the success lies in their ability to provide plausible innate expl…

Computer Science - Artificial IntelligenceComputer Science - Computer Vision and Pattern RecognitionComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Mar 10, 2024

MACE: Mass Concept Erasure in Diffusion Models

Lu, Shilin, Wang, Zilan, Li, Leyang, Liu, Yanzhu, Kong, Adams Wai-Kin

The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this pa…

Computer Science - Artificial IntelligenceComputer Science - Computer Vision and Pattern RecognitionComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

webpage

Feb 27, 2024

Information Flow Routes: Automatically Interpreting Language Models at Scale

Ferrando, Javier, Voita, Elena

Information flows by routes inside the network via mechanisms implemented in the model. These routes can be represented as graphs where nodes correspond to token representations an…

arXiv

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

journalArticle

Jan 1, 2024

Simulating 500 million years of evolution with a language model

Hayes, Thomas, Rao, Roshan, Akin, Halil, Sofroniew, Nicholas J., Oktay, Deniz, Lin, Zeming, Verkuil, Robert, Tran, Vincent Q., Deaton, Jonathan, Wiggert, Marius, Badkundri, Rohil, Shafkat, Irhum, Gong, Jun, Derry, Alexander, Molina, Raul S., Thomas, Neil, Khan, Yousuf, Mishra, Chetan, Kim, Carolyn, Bartie, Liam J., Nemeth, Matthew, Hsu, Patrick D., Sercu, Tom, Candido, Salvatore, Rives, Alexander

More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained on tokens generat…

PDF DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Oct 20, 2023

A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis

Stolfo, Alessandro, Belinkov, Yonatan, Sachan, Mrinmaya

Mathematical reasoning in large language models (LMs) has garnered significant attention in recent work, but there is a limited understanding of how these models process and store …

Computer Science - Computation and LanguageComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

conferencePaper

Oct 13, 2023

Concept Bottleneck Generative Models

Ismail, Aya Abdelsalam, Adebayo, Julius, Bravo, Hector Corrada, Ra, Stephen, Cho, Kyunghyun

We introduce a generative model with an intrinsically interpretable layer---a concept bottleneck layer---that constrains the model to encode human-understandable concepts. The conc…

PDF

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Oct 13, 2023

Dissecting Recall of Factual Associations in Auto-Regressive Language Models

Geva, Mor, Bastings, Jasmijn, Filippova, Katja, Globerson, Amir

Transformer-based language models (LMs) are known to capture factual knowledge in their parameters. While previous work looked into where factual associations are stored, only litt…

Computer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

conferencePaper

Oct 13, 2023

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Zhang, Fred, Nanda, Neel

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization—identifying the important model components—is a key step. Ac…

PDF

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Oct 4, 2023

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Hoagy, Ewart, Aidan, Riggs, Logan, Huben, Robert, Sharkey, Lee

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct conte…

Computer Science - Computation and LanguageComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Jan 13, 2023

Locating and Editing Factual Associations in GPT

Meng, Kevin, Bau, David, Andonian, Alex, Belinkov, Yonatan

We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-…

Computer Science - Computation and LanguageComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

conferencePaper

Jan 1, 2023

Analyzing Transformers in Embedding Space

Dar, Guy, Geva, Mor, Gupta, Ankit, Berant, Jonathan

Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpreta…

PDF DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Dec 5, 2022

Concept Embedding Models: Beyond the Accuracy-Explainability Trade-Off

Zarlenga, Mateo Espinosa, Barbiero, Pietro, Ciravegna, Gabriele, Marra, Giuseppe, Giannini, Francesco, Diligenti, Michelangelo, Shams, Zohreh, Precioso, Frederic, Melacci, Stefano, Weller, Adrian, Lio, Pietro, Jamnik, Mateja

Deploying AI-powered systems requires trustworthy models supporting effective human interactions, going beyond raw prediction accuracy. Concept bottleneck models promote trustworth…

Computer Science - Artificial IntelligenceComputer Science - Machine Learning

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

webpage

Nov 23, 2022

SEAT: Stable and Explainable Attention

Hu, Lijie, Liu, Yixin, Liu, Ninghao, Huai, Mengdi, Sun, Lichao, Wang, Di

Currently, attention mechanism becomes a standard fixture in most state-of-the-art natural language processing (NLP) models, not only due to outstanding performance it could gain, …

arXiv

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

preprint

Oct 12, 2022

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Geva, Mor, Caciularu, Avi, Wang, Kevin Ro, Goldberg, Yoav

Transformer-based language models (LMs) are at the core of modern NLP, but their internal prediction construction process is opaque and largely not understood. In this work, we mak…

Computer Science - Computation and Language

arXiv DOI

Allocate tokens

Max 3 per paper. Tokens are saved when you submit.

Live leaderboard

Open until 2026-02-08

Transparent by design

Energy-Driven Steering: Reducing False Refusals in Large Language Models

1 voters

2 pts

per-paper max 3

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

0 voters

0 pts

per-paper max 3

Concept-SAE: Active Causal Probing of Visual Model Behavior

0 voters

0 pts

per-paper max 3

MemLens: Uncovering Memorization in LLMs with Activation Trajectories

0 voters

0 pts

per-paper max 3

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

0 voters

0 pts

per-paper max 3

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

0 voters

0 pts

per-paper max 3

Explaining Black-box Language Models with Knowledge Probing Systems: A Post-hoc Explanation Perspective

0 voters

0 pts

per-paper max 3

Can VLMs Recall Factual Associations From Visual References?

0 voters

0 pts

per-paper max 3

LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

0 voters

0 pts

per-paper max 3

#10

Native Logical and Hierarchical Representations with Subspace Embeddings

0 voters

0 pts

per-paper max 3

#11

Evaluating Sparse Autoencoders for Monosemantic Representation

0 voters

0 pts

per-paper max 3

#12

Graph Concept Bottleneck Models

0 voters

0 pts

per-paper max 3

#13

Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

0 voters

0 pts

per-paper max 3

#14

Scaling laws for activation steering with Llama 2 models and refusal mechanisms

0 voters

0 pts

per-paper max 3

#15

DCBM: Data-Efficient Visual Concept Bottleneck Models

0 voters

0 pts

per-paper max 3

#16

Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders

0 voters

0 pts

per-paper max 3

#17

Towards Better Generalization and Interpretability in Unsupervised Concept-Based Models

0 voters

0 pts

per-paper max 3

#18

An Adversarial Perspective on Machine Unlearning for AI Safety

0 voters

0 pts

per-paper max 3

#19

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

0 voters

0 pts

per-paper max 3

#20

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

0 voters

0 pts

per-paper max 3

#21

Understanding the Repeat Curse in Large Language Models from a Feature Perspective

0 voters

0 pts

per-paper max 3

#22

Concept Bottleneck Large Language Models

0 voters

0 pts

per-paper max 3

#23

Interpretable Generative Models through Post-hoc Concept Bottlenecks

0 voters

0 pts

per-paper max 3

#24

Representation Engineering: A Top-Down Approach to AI Transparency

0 voters

0 pts

per-paper max 3

#25

Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

0 voters

0 pts

per-paper max 3

#26

Programming Refusal with Conditional Activation Steering

0 voters

0 pts

per-paper max 3

#27

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

0 voters

0 pts

per-paper max 3

#28

ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

0 voters

0 pts

per-paper max 3

#29

VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

0 voters

0 pts

per-paper max 3

#30

Energy-Based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations

0 voters

0 pts

per-paper max 3

#31

Refusal in Language Models Is Mediated by a Single Direction

0 voters

0 pts

per-paper max 3

#32

Energy-Based Conceptual Diffusion Model

0 voters

0 pts

per-paper max 3

#33

Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering

0 voters

0 pts

per-paper max 3

#34

Locating and Editing Factual Associations in Mamba

0 voters

0 pts

per-paper max 3

#35

Analyzing the Generalization and Reliability of Steering Vectors

0 voters

0 pts

per-paper max 3

#36

Steering Llama 2 via Contrastive Activation Addition

0 voters

0 pts

per-paper max 3

#37

Improving Interpretation Faithfulness for Vision Transformers

0 voters

0 pts

per-paper max 3

#38

MACE: Mass Concept Erasure in Diffusion Models

0 voters

0 pts

per-paper max 3

#39

Information Flow Routes: Automatically Interpreting Language Models at Scale

0 voters

0 pts

per-paper max 3

#40

Simulating 500 million years of evolution with a language model

0 voters

0 pts

per-paper max 3

#41

A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis

0 voters

0 pts

per-paper max 3

#42

Concept Bottleneck Generative Models

0 voters

0 pts

per-paper max 3

#43

Dissecting Recall of Factual Associations in Auto-Regressive Language Models

0 voters

0 pts

per-paper max 3

#44

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

0 voters

0 pts

per-paper max 3

#45

Sparse Autoencoders Find Highly Interpretable Features in Language Models

0 voters

0 pts

per-paper max 3

#46

Locating and Editing Factual Associations in GPT

0 voters

0 pts

per-paper max 3

#47

Analyzing Transformers in Embedding Space

0 voters

0 pts

per-paper max 3

#48

Concept Embedding Models: Beyond the Accuracy-Explainability Trade-Off

0 voters

0 pts

per-paper max 3

#49

SEAT: Stable and Explainable Attention

0 voters

0 pts

per-paper max 3

#50

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

0 voters

0 pts

per-paper max 3