AI: AX - introspection

mcgrof

The art of looking into a model and understanding what is going on through introspection is referred to AX.

Tập

  1. 9 THG 8

    Jailbreaking LLMs

    A long list of papers and articles are reviewed about jailbreaking LLMs: These sources primarily explore methods for bypassing safety measures in Large Language Models (LLMs), often referred to as "jailbreaking," and proposed defense mechanisms. One key area of research involves "abliteration," a technique that directly modifies an LLM's internal activations to remove censorship without traditional fine-tuning. Another significant approach, "Speak Easy," enhances jailbreaking by decomposing harmful requests into smaller, multilingual sub-queries, significantly increasing the LLMs' susceptibility to generating undesirable content. Additionally, "Sugar-Coated Poison" investigates integrating benign content with adversarial reasoning to create effective jailbreak prompts. These papers collectively highlight the ongoing challenge of securing LLMs against sophisticated attacks, with researchers employing various strategies to either exploit or fortify these AI systems. Sources: 1) May 2025 - An Embarrassingly Simple Defense Against LLM Abliteration Attacks - https://arxiv.org/html/2505.19056v1 2) June 2024 - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing - https://arxiv.org/html/2405.18166v2 3) October 2024 - Scalable Data Ablation Approximations for Language Models through Modular Training and Merging - https://arxiv.org/html/2410.15661v1 4) February 2025 - Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions - https://arxiv.org/html/2502.04322v1 5) April 2025 - Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking - https://arxiv.org/html/2504.05652v1 6) June 2024 - Uncensor any LLM with abliteration - https://huggingface.co/blog/mlabonne/abliteration 7) Reddit 2024 - Why jailbreak ChatGPT when you can abliterate any local LLM? https://www.reddit.com/r/ChatGPTJailbreak/comments/1givhkk/why_jailbreak_chatgpt_when_you_can_abliterate_any/ 8) May 2025 - WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response - https://arxiv.org/html/2405.14023v1 9) July 2024 - Jailbreaking Black Box Large Language Models in Twenty Queries - https://arxiv.org/pdf/2310.08419 10) October 2024 - Scalable Data Ablation Approximations for Language Models through Modular Training and Merging - https://arxiv.org/pdf/2410.15661

    10 phút
  2. 9 THG 8

    PA-LRP & absLRP

    We focus on two evolutions to AX, they focus on advancing the explainability of deep neural networks, particularly Transformers, by improving Layer-Wise Relevance Propagation (LRP) methods. One source introduces Positional Attribution LRP (PA-LRP), a novel approach that addresses the oversight of positional encoding in prior LRP techniques, showing it significantly enhances the faithfulness of explanations in areas like natural language processing and computer vision. The other source proposes Relative Absolute Magnitude Layer-Wise Relevance Propagation (absLRP) to overcome issues with conflicting relevance values and varying activation magnitudes in existing LRP rules, demonstrating its superior performance in generating clear, contrastive, and noise-free attribution maps for image classification. Both works also contribute new evaluation metrics to better assess the quality and reliability of these attribution-based explainability methods, aiming to foster more transparent and interpretable AI models. Sources: 1) June 2025 - https://arxiv.org/html/2506.02138v1 - Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability 2) December 2024 - https://arxiv.org/pdf/2412.09311 - Advancing Attribution-Based Neural Network Explainability through Relative Absolute Magnitude Layer-Wise Relevance Propagation and Multi-Component Evaluation To help with context the original 2024 AttLRP paper was also given as a source: 3) June 2024 - https://arxiv.org/pdf/2402.05602 - AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

    20 phút

Giới Thiệu

The art of looking into a model and understanding what is going on through introspection is referred to AX.