ICLR 2026 LLM安全相关论文整理

张开发
2026/4/11 16:57:17 15 分钟阅读

分享文章

ICLR 2026 LLM安全相关论文整理
ICLR 2026 大模型安全相关论文整理总目录 大模型安全研究论文整理 2026年版https://blog.csdn.net/WhiffeYF/article/details/159047894https://claude.ai/chat/91c4365e-5247-4117-acda-bd226566b80e整理日期2026-04-11总览ICLR 2026 共录用 5300 篇论文其中 Oral 223 篇本文从中筛选出约 50 篇与大模型安全直接相关的论文涵盖 Oral 和 Poster 两个类别。按主题分为 9 大类类别篇数说明1. 越狱攻击6提示词改写、梯度优化、多臂老虎机、古典中文等越狱方法2. 推理模型安全5思维链劫持、推理过程对齐、CoT 干预鲁棒性3. 安全对齐与防御11RL 安全对齐、推理式防御、多语言一致性、解码时探测4. 微调攻击 / 后门攻击5LoRA 后门、隐写术恶意微调、有害梯度衰减防御5. 智能体安全6分解攻击监控、控制流劫持、Agent-to-Agent 安全基准6. 多模态安全6VLM 越狱迁移、音频模型越狱、视觉后门攻击7. 安全评估与基准2多轮越狱基准、音频可信度基准8. 代码 / 生成安全5安全代码生成、水印、深度伪造检测9. 其它相关5激活引导、诚实对齐、偏见放大、概念擦除附录Oral 安全论文11Constitutional Classifiers、ASIDE、UltraBreak 等整体来看ICLR 2026 安全方向的论文呈现几个趋势①推理模型安全成为新热点多篇论文关注 CoT 被劫持/利用的问题②越狱攻防进入组合化、自动化阶段字典学习、元优化等方法出现③智能体安全作为新兴方向快速增长分解攻击、控制流劫持等受到关注④安全对齐从浅层走向深层多篇论文探索任意深度对齐和基于推理的对齐。1 越狱攻击Jailbreak Attacks#论文名OpenReview链接简介1Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges (AMIS)poster/10008164提出AMIS元优化框架联合进化越狱提示和评分模板通过双层优化实现自动化越狱2Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search (CC-BOS)openreview利用古典中文语境的8维搜索空间结合生物启发式优化生成越狱提示对推理模型也达到100% ASR3Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacksposter/10009061提出对抗性既视感假说未来越狱是已有对抗技能原语的组合通过字典学习增强对未见攻击的泛化4One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsopenreview研究可迁移的鲁棒越狱提示生成5Efficient Jailbreak Attack Sequences on LLMs via Multi-Armed Bandit-Based Context Switchingopenreview基于多臂老虎机的上下文切换实现高效越狱攻击序列6Improved Techniques for Optimization-Based Jailbreaking on Large Language Modelsopenreview改进基于优化的越狱技术GCG系列改进2 推理模型安全 / 思维链安全Reasoning CoT Safety#论文名OpenReview链接简介1Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention (IPO)forum/2uTxLC4LmC提出Intervened Preference Optimization (IPO)通过将合规步骤替换为安全触发器实现推理过程本身的安全对齐2Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Checkopenreview先回答后检查的推理安全对齐方法3Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Trainingopenreview揭示推理训练后模型可以通过推理自行绕过安全对齐4AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Modelsposter/10007590对抗性思维链调优以增强推理模型的安全对齐鲁棒性5Are Reasoning LLMs Robust to Interventions on their Chain-of-Thought?poster/10008704研究推理LLM对思维链干预的鲁棒性3 安全对齐与防御Safety Alignment Defense#论文名OpenReview链接简介1AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learningposter/10011731通过极简化RL仅需二元安全标签和不到200步RL激励模型内在安全意识实现推理式安全对齐2ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoningposter/10009011提出三步推理防御流水线策略分析→意图提取→策略安全验证对OOD越狱攻击ASR降至0.063AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraintposter/10011789基于零空间约束的拒绝引导学习4A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Spaceposter/10011231安全敏感子空间与有害抵抗零空间结合的安全护栏5Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depthposter/10011912解锁LLM从浅层到任意深度的内在安全对齐6Alignment-Weighted DPO: A Principled Reasoning Approach to Improve Safety Alignmentposter/10009740基于原理推理的加权DPO方法改进安全对齐7Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignmentposter/10006879一次对齐、多语言受益的跨语言安全一致性方法8Aligning Deep Implicit Preferences by Learning to Reason Defensivelyposter/10008837通过学习防御性推理对齐深层隐式偏好9A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Modelsposter/10009223扩散语言模型的任意顺序、任意步安全对齐10SIRL: Self-Incentivized Reinforcement Learning for Safety (基于Entropy的安全RL)openreview发现响应熵是安全的可靠内在信号通过熵最小化实现无外部奖励的安全增强11From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignmentopenreview将拒绝感知的注入攻击转化为安全对齐工具4 微调攻击 / 后门攻击Fine-tuning Backdoor Attacks#论文名OpenReview链接简介1JailbreakLoRA: Your Downloaded LoRA from Sharing Platforms might be Unsafeforum/4YgvVRoSnF揭示共享平台上的LoRA可能包含越狱后门2Invisible Safety Threat: Malicious Finetuning for LLM via Steganographyposter/10011363通过隐写术实现恶意微调模型表面安全对齐但暗中生成有害内容3Revisiting Backdoor Attacks on LLMsopenreview重新审视LLM后门攻击提出隐式投毒策略保持安全对齐的同时注入后门4Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influenceposter/10007199通过衰减有害梯度影响来加强对有害微调的防御5Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case StudyEPFL团队安全子空间并非线性可分微调案例研究5 智能体安全Agent Safety#论文名OpenReview链接简介1Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systemsopenreview多智能体系统中控制流劫持的攻防2RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environmentsopenreview面向计算机使用智能体的真实对抗测试3Monitoring Decomposition Attacksopenreview分解攻击监控发现轻量级顺序监控器可有效防御分解攻击4634个有害-良性任务对数据集4Adaptive Attacks on Trusted Monitors Subvert AI Control Protocolsposter/10006727对可信监控器的自适应攻击可颠覆AI控制协议5A2ASecBench: A Protocol-Aware Security Benchmark for Agent-to-Agent Multi-Agent Systemsposter/10010017面向智能体间多智能体系统的协议感知安全基准6AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?poster/10007726追踪LLM智能体系统中谁在诱导失败6 多模态安全Multimodal Safety#论文名OpenReview链接简介1VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safetyopenreview映射多模态联合理解在AI安全中的极限2ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacksposter/10006730自适应红队智能体可插拔地对VLM进行全面风险评估3JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Modelsopenreview音频语言模型越狱漏洞基准11316文本样本245355音频样本4GuardAlign: Safety Alignment for Vision-Language Models via Optimal Transportopenreview基于最优传输的视觉语言模型安全对齐5AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimizationposter/10010620通过偏好优化增强大型视觉语言模型的对抗鲁棒性6BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger LearningICLR 2026 Downloads列表VLM具身智能体的视觉后门攻击7 安全评估与基准Safety Evaluation Benchmarks#论文名OpenReview链接简介1MultiBreak: Scalable Multi-Turn Jailbreak Benchmarkopenreview大规模多轮越狱基准1724个意图多轮对抗提示覆盖9粗粒26细粒安全类别2AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Modelsposter音频大语言模型多维可信度基准8 安全相关的代码/生成安全Code Generation Safety#论文名OpenReview链接简介1SecCoderX: Secure Code Generation via Reasoning-Based Vulnerability Reward Modelopenreview基于推理的漏洞奖励模型实现安全代码生成首次在不损害功能性的前提下提升安全率11-16%2Analyzing and Evaluating Unbiased Language Model Watermarkposter/10011375分析与评估无偏语言模型水印3An Ensemble Framework for Unbiased Language Model Watermarkingposter/10007956无偏语言模型水印的集成框架4All Patches Matter: Enhance AI-Generated Image Detection via Panoptic Patch Learningposter/10007395全景补丁学习增强AI生成图像检测5A Rich Knowledge Space for Scalable Deepfake Detectionposter/10008071可扩展深度伪造检测的丰富知识空间9 其它相关论文#论文名OpenReview链接简介1Activation Steering with a Feedback Controllerposter/10006765基于反馈控制器的激活引导表征工程相关2Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibrationposter/10008495通过置信度引出和校准实现标注高效的诚实对齐3Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Modelsposter/10008156消除语言模型重复模式的框架4Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systemsposter/10007543多智能体系统中偏见放大的测量5AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Modelsposter/10011590扩散模型中无保留数据的鲁棒概念擦除10 ICLR 2026 相关Workshops#Workshop名链接1Agents in the Wild: Safety, Security, and Beyondworkshop/100007812AI for Peaceworkshop/100008043Algorithmic Fairness Across Alignment Procedures and Agentic Systemsworkshop/10000786附录Oral 论文中与安全相关的论文从ICLR 2026共223篇Oral论文中以下为与LLM安全直接相关的Oral论文大部分你已在之前的收集中包含#论文名类型说明1Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models (UltraBreak)Oral首个同时实现跨目标通用性和跨模型迁移性的VLM越狱框架 forum/T5hD0as3jb2Defending LLMs Against Jailbreak Attacks via In-Decoding Safety-Awareness ProbingOral解码时安全感知探测防御利用模型内部潜在安全信号进行早期检测3ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking AttackOral激活缩放守卫缓解定向越狱攻击4ASIDE: Architectural Separation of Instructions and Data in Language ModelsOral语言模型中指令与数据的架构级分离防提示注入5GAVEL: Towards Rule-Based Safety through Activation MonitoringOral通过激活监控实现基于规则的安全6Constitutional Classifiers: Production-Grade Defenses against Universal JailbreaksOralAnthropic的生产级通用越狱防御Constitutional AI升级版7Mitigating the Safety Alignment Tax with Null-Space Constrained Policy OptimizationOral通过零空间约束策略优化缓解安全对齐税8Time-To-Inconsistency: A Survival Analysis of LLM Robustness to Adversarial AttacksOral对LLM对抗攻击鲁棒性的生存分析9GhostEI-Bench: Do Mobile Agent Resilience to Environmental Injection in Dynamic On-Device Environments?Oral移动智能体在动态设备环境中对环境注入的抗性基准10Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!Oral开源LLM微调时微调数据可能被秘密窃取11GHOST: Hallucination-Inducing Image Generation for Multimodal LLMsOral面向多模态LLM的幻觉诱导图像生成说明ICLR 2026共接收5300篇论文223篇Oral本列表从中筛选出与LLM安全、越狱攻击、推理安全、对齐、智能体安全、多模态安全等直接相关的论文。部分OpenReview链接为近似链接基于论文标题搜索请以ICLR官方虚拟网站为准。GitHub上有完整的223篇Oral论文列表含中文翻译https://github.com/XinyuLiuCs/iclr2026-oral-papers

更多文章