NVIDIA garak: Open-Source Defensive LLM Red-Teaming Toolkit for Model Risk & Governance

NVIDIA garak: A Practical Defensive LLM Red‑Teaming Workflow for Business Teams

TL;DR: NVIDIA garak is an open‑source, modular toolkit for defensive LLM red‑teaming—designed for security engineers, ML Ops, and product teams to run repeatable, auditable scans (dry runs to full model scans), compute safety metrics like Attack Success Rate (ASR), and export structured evidence for governance.

Why this matters for business

As LLMs move from research into customer‑facing apps—chatbots, sales automation, internal assistants—the risk profile changes. It’s no longer enough to know what a model can do; you must know how it can fail, be manipulated, or expose sensitive information. Defensive red‑teaming turns ad‑hoc testing into a repeatable capability: a way to quantify model risk, prioritize fixes, and produce evidence for audits.

Think of garak as a test lab that fires controlled, composable “bullets” at your model. It’s modular (plugins for probes, detectors, and generators), so teams can mix and match attack inputs and checks to mirror real adversarial strategies or policy checks specific to your domain.

Quickstart — get a dry run going in minutes

  • Install: pip install garak
  • Environment tips for stable notebook runs:
    • set TOKENIZERS_PARALLELISM=false
    • set HF_HUB_DISABLE_TELEMETRY=1
  • Validate your setup without API keys: run a dry run using the built‑in generator test.Repeat.
  • Where artifacts land: garak writes JSONL (newline‑delimited JSON) and HTML reports, typically under ~/.local/share/garak/garak_runs, ~/.cache/garak, or your current directory.

Dry runs verify your pipeline and plugins before you ever touch external endpoints—an essential safety step for model risk management.

Core concepts: probes, detectors, generators

Garak’s plugin architecture separates intent (probes) from checkpoints (detectors) and input sources (generators). That separation is powerful for composing tests.

  • Probe: a crafted attack or prompt template. Example probes in the ecosystem include things like dan.Dan_11_0 or encoding injectors.
  • Detector: a rule or model that flags risky outputs—can be a simple pattern matcher or an embedding/classifier based semantic detector.
  • Generator: the source of prompts (local repeat generator for dry runs, or a RESTGenerator to call external model endpoints).
  • Buffs: modifiers that change input behavior (e.g., encoding tricks) to simulate adversarial tactics.

Use multi‑probe runs (combine several probes) to simulate layered attacks and generate richer, more realistic reports.

Running scans: from dry runs to real models

  1. List available plugins: inventory commands show probes, detectors, generators, and buffs so you can plan combinations.
  2. Dry run: use test.Repeat to verify wiring without credentials.
  3. Real model scan example: point a REST or Hugging Face target at a model (the tutorial example used gpt2) and run a chosen probe like dan.Dan_11_0.
  4. Combine probes: e.g., dan.Dan_11_0, encoding.InjectBase64, and lmrc.SlurUsage to explore different attack vectors in one campaign.
  5. Export: use the -r flag to produce AVID exports for audit traceability and the RESTGenerator template to test external endpoints safely with authorization.

The tutorial presents a hands‑on, end‑to‑end workflow for evaluating LLM behavior with garak.

Interpreting results: safety%, ASR, and prioritization

Garak produces JSONL reports you can parse with garak.report.Report for convenience, or ingest into pandas/NumPy for custom analytics.

  • Safety percentage: per‑probe metric indicating the share of attempts that passed detectors.
  • Attack Success Rate (ASR): ASR = 100% − safety%. Higher ASR = higher measured risk.

Visualize ASR by probe/detector pair (horizontal bar charts or heatmaps) to spot the riskiest inputs quickly. Surface flagged examples using a detector score threshold (a common demo threshold is ≥ 0.5) and review raw outputs for false positives or context that warrants escalation.

Some suggested, non‑binding thresholds you can adopt as starting points:

  • If ASR > 10% on a customer‑facing assistant, escalate to the model owner for review.
  • If ASR > 30%, consider blocking the model from production until remediated.

These are playbook suggestions—set thresholds to match your risk appetite and business impact.

Extending garak: custom probes and detectors

Garak is designed to be extended. A minimal probe can return fixed prompts and point to a primary_detector. A demo detector can be a simple string matcher; produce it quickly to validate the extension workflow, then replace it with production‑grade checks.

An example probe docstring: a minimal custom probe that uses two fixed prompts and pairs with a custom detector.

An example detector docstring: a demo detector that flags any output containing the word “hello” (case‑insensitive).

Practical guidance for production detectors:

  • Start with pattern matchers for obvious violations but move to embedding‑based semantic detectors or small fine‑tuned classifiers to reduce false positives.
  • Score calibration: log confidence scores and refine thresholds using labeled samples from real traffic.
  • Redaction: strip or mask PII before storing artifacts; maintain strict access controls on reports.

Productionizing: CI/CD, scheduling and cost controls

Operationalizing defensive red‑teaming means integrating scans into release and monitoring pipelines.

  • Run smoke scans on PRs or model commits using a lightweight probe set; run full campaigns on a weekly or release cadence.
  • Use a dedicated GitHub Actions / Jenkins job to run garak dry runs or authenticated REST scans and upload JSONL/HTML artifacts to your artifact store.
  • Throttling & cost: batch requests, use lower‑cost replicas for bulk scans, and cap query volumes to control API spend.
  • Alerting & gating: wire ASR metrics into your deployment pipeline—fail or require human approval if thresholds breach policy.

Governance, compliance, and evidence

Reports from garak can map directly into governance artifacts:

  • report.jsonl → ingest into your issue tracker or SIEM as raw evidence
  • HTML summary → executive risk brief or product owner digest
  • AVID export → structured audit evidence for third parties or regulators (AVID is a standardized export format used to record assessment artifacts)

Retention and access controls matter—garak outputs often include model outputs that may contain sensitive text. Apply the same data governance rules you use for logs or telemetry: least privilege, retention periods, and automatic redaction where necessary.

Mini case study: sales assistant leaking PII

Scenario: a sales‑automation assistant starts returning snippets of customer PII when prompted with subtle engineering. A multi‑probe garak campaign reveals that a chain of probes combining a prompt injection with encoding tricks produces a high ASR on detectors aimed at PII disclosure.

  1. Detection: garak flags multiple examples with detector scores > 0.7 indicating PII leakage.
  2. Triage: security team reviews flagged outputs, confirms leakage, and tags severity as high due to regulatory exposure.
  3. Remediation: product team retrains prompt sanitizer, deploys a PII redaction filter, and updates the model input validation layer.
  4. Verification: rerun the garak campaign; ASR falls below 5% for the same probes—report artifacts are archived for compliance.

This workflow demonstrates how garak’s outputs become inputs to an operational incident response and mitigation loop.

Checklist — before you run a garak scan

  1. Get written authorization for scanning external or third‑party models.
  2. Decide scope: which models, endpoints, and datasets are in scope and why.
  3. Set up safe storage and redaction rules for outputs that may include sensitive data.
  4. Run a dry test using test.Repeat to validate plugins and configuration.
  5. Define ASR thresholds and an escalation path for findings.
  6. Plan CI/CD integration and cost caps for repeated runs.

Practical caveats and tradeoffs

  • Simple detectors are useful for learning but insufficient for production. Invest in semantic detectors and regular calibration.
  • Coverage vs cost: deeper campaigns find more issues but cost more. Use stratified sampling to balance confidence and spend.
  • Legal and privacy obligations: scanning third‑party models can violate terms of service—always check and document permissions.

Next steps and resources

To move from concept to capability: run a local dry test, author one custom detector that encodes a policy you care about (e.g., PII or toxic content), and schedule a weekly full scan for a high‑risk assistant. If you prefer templates, converting core commands into a CI job or a Dockerized scan is a natural next step to make tests reproducible.

Defensive red‑teaming is not a one‑time checklist; it’s an operational capability that grows with your models. Use garak to make that capability repeatable, auditable, and integrated into your model risk management program.

Want a ready‑to‑run CI template or a one‑page red‑teaming checklist tailored to your team? Contact the team to get a conversion of these steps into a GitHub Actions job or an audit‑ready checklist you can drop into your governance workflow.