LLDF - Language Layer Defense Framework

Research

LLDF Research Notes

Core Thesis

The primary attack surface for enterprise AI is the language layer (instructions, context, tool calls, retrieval, memory), not only model weights.
Most "AI security" failures are control failures in orchestration: routing, context boundaries, tool permissions, retrieval hygiene, and auditability.
A practical framework must be measurable, repeatable, vendor-agnostic, and operationalized (PDR playbooks + evidence-based maturity).

Threat Model Primitives

Instruction hierarchy: system → developer → tool → user → retrieved content (and how it is violated).
Context supply chain: where tokens come from (human, RAG, memory, tools) and how they can be poisoned.
Agentic risk: delegated authority + tool execution + long-horizon planning increases impact more than "prompt cleverness."
Observable signals: language artifacts, tool-call patterns, retrieval anomalies, refusal drift, and bypass attempts.
Controls must be testable: prevention without verification is theater.

LLDF Scoring Concepts

Each technique decomposes into Prevent, Detect, and Respond readiness.
Readiness should be evidence-driven (logs, configs, policy docs, tests) — not self-attestation only.
Risk is expressed as Expected Loss = λ × Impact (optionally factoring exposure/time-to-detect).
Maturity should reflect consistent capability across bands, not isolated "one-off wins."

What "Good" Looks Like

Clear separation of trusted vs untrusted context.
Allowlisted tool actions + least privilege by default.
Retrieval guardrails: provenance, sanitization, and grounding checks.
Continuous red-team harness with regression tests.
Incident response playbooks for AI-specific failures (tool misuse, data leakage, policy bypass, unsafe outputs).

Methodology

Research / Evaluation

Goal: prove LLDF improves enterprise outcomes by making AI risks measurable and reducing incident likelihood/impact through repeatable controls.

Phase	What to Do	Key Outputs
Phase 0: Baseline	Run a standardized evaluation suite across current AI surfaces (chat, RAG, agent). Capture current PDR readiness and incident posture.	Baseline maturity score • Baseline risk (EL) • Baseline success rates
Phase 1: Controls	Implement LLDF controls mapped to highest-likelihood and highest-impact technique bands first (permissioning, provenance, sanitization, monitoring, runbooks).	Control mapping • Evidence artifacts • Updated playbooks
Phase 2: Regression & Stress	Re-run the suite; add drift testing, load testing, and change events (model swap, prompt update, new tool, corpus refresh).	Delta vs baseline • Drift score • "Break rate per change"
Phase 3: Operationalize	Stand up dashboards, evidence capture, IR runbooks, and training. Validate MTTD/MTTR with drills.	Operational maturity • MTTD/MTTR • Ongoing cadence

Evaluation Outputs

LLDF maturity score (overall + by band + by P/D/R).
Risk reduction (expected loss and realized incidents).
Control efficacy and false-positive cost (latency, friction, reviewer load).

Telemetry

Specific Data Collected

Collect just enough to make results repeatable, explainable, and auditable, without creating unnecessary privacy risk.

System-Level Telemetry

Prompt/version IDs (system + developer prompts), routing decisions, and policy versions.
Full context provenance map: token sources (user/RAG/memory/tool outputs) with trust labels.
Tool invocation logs: tool name, permission decision, outcome, latency.
Retrieval logs: query, top-k docs, doc provenance, doc safety score, grounding confidence.

Model Behavior & Safety

Refusal rate and refusal drift (over time, by model/version).
Policy compliance rate (pass/fail against a rubric).
Unsafe output indicators (with sampled human adjudication for calibration).
Grounding metrics: citation coverage, contradiction detection, "supported vs unsupported" claims.

Security Outcomes

Prompt injection success rate (direct + indirect), including partial successes.
Data leakage attempts and success rate (secrets, PII, internal docs).
Tool misuse attempts and success rate (unauthorized actions, privilege escalation).
"Agent deviation" rate: actions not aligned to user intent or policy.

Operational Readiness

Runbooks, on-call coverage, escalation paths, and change approvals.
MTTD and MTTR for AI-specific incidents (leakage, unsafe outputs, unauthorized tool use).
Red-team cadence + regression pass rate.

Human Factors

Reviewer agreement rate (label quality for "pass/fail" and severity).
Time-to-triage for AI incidents and alerts.
User complaint categories (wrong answers, unsafe outputs, unauthorized actions).

Testing

Experimental Run

Prompt & Context Attacks

Direct prompt injection across varying instruction hierarchies.
Indirect prompt injection via RAG documents and webpages.
Context window overflow / instruction dilution tests.
Multi-turn "slow-burn" attacks (persistence across conversation).
Encoding/obfuscation attacks (unicode, base64, multilingual).

Metrics: success rate, partial success rate, detection rate, and defense cost (latency/tokens/false positives).

RAG Security

Retrieval poisoning: planted docs that override system intent.
Source spoofing: "looks official" but wrong provenance.
Grounding stress test: conflicting sources, stale docs, partial docs.
Sanitization A/B: before/after filtering + chunking strategies.

Metrics: injection success via retrieved content, grounding accuracy, provenance integrity.

Tool / Agent Safety

Tool permission escalation attempts (write actions, external calls).
"Confused deputy" tasks: benign user ask that leads to sensitive action.
Long-horizon planning deviations: unauthorized shortcuts.
Tool output injection: tool returns malicious instruction that agent follows.

Metrics: unauthorized tool-call rate, blocked rate, safe-completion rate.

Detection & Response Validation

Canary tokens / honey prompts in the RAG corpus.
Simulated incidents: leaked secret, unauthorized email, destructive file action.
Runbook drills with real operators: measure time-to-detect and contain.

Metrics: detection precision/recall, time-to-containment, post-incident regression closure time.

Drift & Change Management

Model swap test: same prompts, new model → safety regression delta.
Prompt update test: small change → compliance/tool behavior changes.
Data refresh test: new docs → injection surface and grounding drift.

Metrics: regression delta, stability score, break rate per change.

The Imperative

"Why Now" : The AI Enterprise Era

Enterprises are moving from chatbots to AI systems with authority, agents that can access internal data, call tools, and take actions. The new failure mode isn't only a wrong answer; it's a wrong action: data exposure, unauthorized transactions, workflow sabotage, and legal/brand harm.

The fastest-growing attack surface is not the model, it's the orchestration layer (RAG + memory + tools + policies), shipping weekly.
Traditional AppSec is necessary but insufficient: it doesn't natively measure instruction integrity, context provenance, or agent authorization at the language layer.
LLDF provides a shared, testable way to benchmark, prove control effectiveness, and operate AI safely at scale, before procurement, regulation, and incident reality force it.

LLDF turns AI security from opinions into evidence, so teams can ship agents confidently, measure what works, and respond fast when something breaks.

Research Notes & Evaluation