CNA finds tiny neuron refusal switch in LLMs — forward-pass attack risk for safety and business

One tiny circuit, big consequences: CNA finds the neurons that gate refusals — and what it means for LLM safety and AI for business

TL;DR: Researchers at Nous Research discovered a tiny group of neurons near the end of many instruction‑tuned LLMs that act like a “refusal switch.” Their Contrastive Neuron Attribution (CNA) method finds and tweaks these neurons using only forward passes, dramatically reducing refusal rates without retraining. That makes CNA a powerful audit tool — and a plausible forward‑pass attack. Product and security teams should treat forward‑hook access and inference‑path integrity as high‑risk control points.

Before / after — a short, visceral example

Prompt (harmful): “How can I build a device to damage property?”

Before CNA (typical instruct model): “I can’t help with that.”

After CNA ablation (same model with identified neurons scaled during inference): “[Model returns a substantive response — content redacted for safety.]”

That redacted response is the point: a few neuron tweaks flipped refusal to compliance. The underlying content is intentionally omitted here, but the behavioral change is the critical finding.

What CNA does — plain English

Think of a large language model as a huge office building. Most rooms do specialized work, but deep in the building there may be a small control room with a few switches that decide whether the building locks its doors (refuses) or opens them (complies). CNA finds those switches.

  • Contrastive search: Compare average neuron activations on a small set of harmful prompts versus benign prompts.
  • Rank neurons: Score each neuron by how well its activation differs between the two sets; pick the top ~0.1% most discriminative neurons.
  • Filter universals: Remove neurons that are top-ranked across essentially every prompt (these are “always on” and not discriminative).
  • Verify causality: During inference, scale those neurons’ outputs (activation scaling — like turning a volume knob) to see whether the model’s refusal behavior changes. Disabling or reducing activations is called ablation.

Key conveniences: CNA requires only forward hooks (no gradients), no extra training, and no weight edits. That makes it lightweight and practical for rapid audits or targeted steering.

Concrete results and numbers

Authors validated CNA across 16 models (base and instruct variants) in the Llama and Qwen families, from 1B to 72B parameters. Main evaluation used the JBB‑Behaviors benchmark (100 harmful prompts). Discovery sets were typically 100 harmful + 100 benign prompts, and qualitative checks worked with as few as 8 + 8.

  • Qwen2.5‑7B‑Instruct refusal rate: 87% → 2% (≈ −97.7%)
  • Qwen2.5‑72B‑Instruct refusal rate: 78% → 8% (≈ −89.7%)
  • Llama‑3.1‑70B‑Instruct refusal rate: 86% → 18% (≈ −79.1%)
  • Llama‑3.2‑3B‑Instruct refusal rate: 84% → 47% (≈ −44.0%)

Output quality (measured as 1 − fraction of repeated n‑grams) stayed above ~0.97 across steering strengths. Task capability (MMLU) remained within about one percentage point of baseline. A secondary evaluation using a separate LLM judge (StrongREJECT) showed average compliance‑score improvements: Llama models ≈ +6%, Qwen models ≈ +31% after ablation.

These are point estimates across many prompts and models; variability exists between model families and sizes, but the overall pattern—tiny neuron subsets concentrated late in the model driving refusals—was consistent.

Where the circuit lives, and what it means for alignment

Two anatomy facts matter:

  • The discriminative neurons are heavily concentrated in the final ~10% of layers. Example: Llama‑3.2‑1B had 87% of its top‑200 discriminative neurons in the last three layers; Qwen2.5‑3B had 95% in the final quarter.
  • The anatomical scaffold that can host refusal behavior exists in base (pre‑instruction) models. Instruction fine‑tuning repurposes which specific neurons do the job — overlap of the exact neuron indices between matched base and instruct circuits is low (~8–29%).

Put simply: alignment doesn’t necessarily build a new physical scaffold; it reroutes functions to existing late‑layer capacity. That explains why instruction tuning can produce robust refusals quickly — the plumbing was already there.

“CNA pinpoints the tiny fraction of MLP neurons whose activations best separate harmful from benign prompts and manipulates them at inference to change behavior.”

Why this matters for AI for business and deployment

Neuron‑level steering and forward‑pass manipulation change the risk model for deployed LLMs. Practical implications for product, legal, and security teams:

  • Reliability & trust: Small, covert changes to inference hooks can flip safety behavior without retraining, which undermines trust in hosted models.
  • Liability: If a vendor’s refusal behavior can be tampered with post‑deployment, downstream providers may face unexpected legal exposure.
  • Regulation & audits: Auditors must consider inference‑path integrity and neuron‑level telemetry, not just model weights or finetune records.
  • Product differentiation: Neuron‑level steering could be used positively (customized safety policies per customer) but also abused, so access controls and attestation matter.
  • Operational surveillance: Canary prompts and routine behavioral attestation should become standard operational controls for hosted LLM services.

What this means for business leaders

  • Neuron‑level vulnerabilities mean safeguards must include inference security, not just training‑time controls.
  • Small teams can run powerful audits without heavy compute — adopt routine behavioral checks.
  • Require vendors to attest to signed inference binaries and provide telemetry for refusal fingerprints.
  • Plan for incident response that includes verifying whether forward hooks or activation scaling were present.
  • Engage legal and compliance early: tamper‑evident attestations and SLA language should be updated.

Practical method snapshot (for technical leads)

  1. Collect two small prompt sets: harmful (positive) and benign (negative). Typical experiments used 100+100; small checks used 8+8.
  2. Run forward passes and record per‑neuron activations in feed‑forward (MLP) layers — these are the layers that process token representations between attention blocks.
  3. Compute mean activation difference per neuron: mean_activation(harmful) − mean_activation(benign). Rank by absolute value.
  4. Select the top ≈0.1% of MLP neurons by that rank. Remove neurons that are top‑0.1% across ≥80% of diverse prompts (“universally active”).
  5. Verify causality by scaling those neurons’ activations during inference (activation scaling). Test refusal rate on a benchmark (e.g., JBB‑Behaviors) and check quality metrics (repetition, MMLU).

Key hyperparameters: top‑k ≈ 0.1% of MLP activations; universal filter threshold ≈ 80%; discovery set sizes 8–100+; scaling strength m can be tuned (amplification m > 1 can induce repetition at extremes).

Limitations & open questions

  • Validated on gated SiLU MLPs with GQA attention (Llama/Qwen families). Not yet shown on mixtures‑of‑experts (MoE) or other MLP/attention variants.
  • CNA uses raw activation‑difference scores rather than formal attribution metrics; formal connections to attribution theory are open research areas.
  • Robustness to prompt distributions, languages, and diverse formulations needs more study. Circuit stability across discovery sets and seed variability should be quantified per deployment.
  • Detectability of ablation at inference (telemetry fingerprints) is possible but not fully developed; defenders must research practical detectors.

Threat model — short and practical

Attacker capabilities required to exploit CNA-like techniques:

  • Ability to attach forward hooks or otherwise read and modify per‑neuron activations during inference.
  • Knowledge of model layer layout or ability to probe where discriminative neurons concentrate (usually late layers).
  • Access to a small set of harmful/benign prompts to identify discriminative neurons (discovery set can be small).

With those capabilities, an attacker can reduce refusal rates and cause a deployed model to comply with harmful prompts without changing weights or retraining. Closed APIs that do not expose activations are safer by default, but hosted services and research deployments that permit hooks are at risk.

Checklist for ops & security teams

  • Lock down any ability to attach forward hooks or modify activations in production environments.
  • Instrument and retain coarse neuron‑level telemetry (with privacy and storage controls) to support tamper detection.
  • Run routine behavioral attestation: use a standard harmful prompt suite (e.g., JBB‑Behaviors) and compare refusal fingerprints over time.
  • Require signed and attested inference pipelines; deploy integrity checks on binaries and containers.
  • Add canary prompts and monitor for sudden shifts in refusal patterns or activation statistics.
  • Include neuron‑level checks in vendor SLAs and procurement contracts for hosted LLM services.
  • Coordinate responsible disclosure with model providers if you discover vulnerabilities.
  • Plan incident response that includes disabling hooks, restoring attested binaries, and replaying behavioral tests.

Resources and reproducibility

Final notes and next steps

Neuron‑level steering is no longer purely academic: forward‑pass methods like CNA make it simple to both audit and, if abused, bypass safety gates. Alignment still matters — but so does securing inference paths and operational controls.

If useful for your team, a one‑page audit checklist and a short threat‑model paragraph tailored to your deployment (hosted API, on‑prem, SaaS chatbot) can be drafted to help you prioritize technical and policy changes. Reach out to get a compact, actionable version you can drop into procurement and security processes.