Research
LLDF Research Notes
Core Thesis
- The primary attack surface for enterprise AI is the language layer (instructions, context, tool calls, retrieval, memory), not only model weights.
- Most "AI security" failures are control failures in orchestration: routing, context boundaries, tool permissions, retrieval hygiene, and auditability.
- A practical framework must be measurable, repeatable, vendor-agnostic, and operationalized (PDR playbooks + evidence-based maturity).
Threat Model Primitives
- Instruction hierarchy: system → developer → tool → user → retrieved content (and how it is violated).
- Context supply chain: where tokens come from (human, RAG, memory, tools) and how they can be poisoned.
- Agentic risk: delegated authority + tool execution + long-horizon planning increases impact more than "prompt cleverness."
- Observable signals: language artifacts, tool-call patterns, retrieval anomalies, refusal drift, and bypass attempts.
- Controls must be testable: prevention without verification is theater.
LLDF Scoring Concepts
- Each technique decomposes into Prevent, Detect, and Respond readiness.
- Readiness should be evidence-driven (logs, configs, policy docs, tests) — not self-attestation only.
- Risk is expressed as Expected Loss = λ × Impact (optionally factoring exposure/time-to-detect).
- Maturity should reflect consistent capability across bands, not isolated "one-off wins."
What "Good" Looks Like
- Clear separation of trusted vs untrusted context.
- Allowlisted tool actions + least privilege by default.
- Retrieval guardrails: provenance, sanitization, and grounding checks.
- Continuous red-team harness with regression tests.
- Incident response playbooks for AI-specific failures (tool misuse, data leakage, policy bypass, unsafe outputs).
Methodology
Research / Evaluation
Goal: prove LLDF improves enterprise outcomes by making AI risks measurable and reducing incident likelihood/impact through repeatable controls.
| Phase | What to Do | Key Outputs |
|---|---|---|
| Phase 0: Baseline | Run a standardized evaluation suite across current AI surfaces (chat, RAG, agent). Capture current PDR readiness and incident posture. | Baseline maturity score • Baseline risk (EL) • Baseline success rates |
| Phase 1: Controls | Implement LLDF controls mapped to highest-likelihood and highest-impact technique bands first (permissioning, provenance, sanitization, monitoring, runbooks). | Control mapping • Evidence artifacts • Updated playbooks |
| Phase 2: Regression & Stress | Re-run the suite; add drift testing, load testing, and change events (model swap, prompt update, new tool, corpus refresh). | Delta vs baseline • Drift score • "Break rate per change" |
| Phase 3: Operationalize | Stand up dashboards, evidence capture, IR runbooks, and training. Validate MTTD/MTTR with drills. | Operational maturity • MTTD/MTTR • Ongoing cadence |
Evaluation Outputs
- LLDF maturity score (overall + by band + by P/D/R).
- Risk reduction (expected loss and realized incidents).
- Control efficacy and false-positive cost (latency, friction, reviewer load).
Telemetry
Specific Data Collected
Collect just enough to make results repeatable, explainable, and auditable, without creating unnecessary privacy risk.
System-Level Telemetry
- Prompt/version IDs (system + developer prompts), routing decisions, and policy versions.
- Full context provenance map: token sources (user/RAG/memory/tool outputs) with trust labels.
- Tool invocation logs: tool name, permission decision, outcome, latency.
- Retrieval logs: query, top-k docs, doc provenance, doc safety score, grounding confidence.
Model Behavior & Safety
- Refusal rate and refusal drift (over time, by model/version).
- Policy compliance rate (pass/fail against a rubric).
- Unsafe output indicators (with sampled human adjudication for calibration).
- Grounding metrics: citation coverage, contradiction detection, "supported vs unsupported" claims.
Security Outcomes
- Prompt injection success rate (direct + indirect), including partial successes.
- Data leakage attempts and success rate (secrets, PII, internal docs).
- Tool misuse attempts and success rate (unauthorized actions, privilege escalation).
- "Agent deviation" rate: actions not aligned to user intent or policy.
Operational Readiness
- Runbooks, on-call coverage, escalation paths, and change approvals.
- MTTD and MTTR for AI-specific incidents (leakage, unsafe outputs, unauthorized tool use).
- Red-team cadence + regression pass rate.
Human Factors
- Reviewer agreement rate (label quality for "pass/fail" and severity).
- Time-to-triage for AI incidents and alerts.
- User complaint categories (wrong answers, unsafe outputs, unauthorized actions).
Testing
Experimental Run
01
Prompt & Context Attacks
- Direct prompt injection across varying instruction hierarchies.
- Indirect prompt injection via RAG documents and webpages.
- Context window overflow / instruction dilution tests.
- Multi-turn "slow-burn" attacks (persistence across conversation).
- Encoding/obfuscation attacks (unicode, base64, multilingual).
Metrics: success rate, partial success rate, detection rate, and defense cost (latency/tokens/false positives).
02
RAG Security
- Retrieval poisoning: planted docs that override system intent.
- Source spoofing: "looks official" but wrong provenance.
- Grounding stress test: conflicting sources, stale docs, partial docs.
- Sanitization A/B: before/after filtering + chunking strategies.
Metrics: injection success via retrieved content, grounding accuracy, provenance integrity.
03
Tool / Agent Safety
- Tool permission escalation attempts (write actions, external calls).
- "Confused deputy" tasks: benign user ask that leads to sensitive action.
- Long-horizon planning deviations: unauthorized shortcuts.
- Tool output injection: tool returns malicious instruction that agent follows.
Metrics: unauthorized tool-call rate, blocked rate, safe-completion rate.
04
Detection & Response Validation
- Canary tokens / honey prompts in the RAG corpus.
- Simulated incidents: leaked secret, unauthorized email, destructive file action.
- Runbook drills with real operators: measure time-to-detect and contain.
Metrics: detection precision/recall, time-to-containment, post-incident regression closure time.
05
Drift & Change Management
- Model swap test: same prompts, new model → safety regression delta.
- Prompt update test: small change → compliance/tool behavior changes.
- Data refresh test: new docs → injection surface and grounding drift.
Metrics: regression delta, stability score, break rate per change.
The Imperative
"Why Now" : The AI Enterprise Era
Enterprises are moving from chatbots to AI systems with authority, agents that can access internal data, call tools, and take actions. The new failure mode isn't only a wrong answer; it's a wrong action: data exposure, unauthorized transactions, workflow sabotage, and legal/brand harm.
- The fastest-growing attack surface is not the model, it's the orchestration layer (RAG + memory + tools + policies), shipping weekly.
- Traditional AppSec is necessary but insufficient: it doesn't natively measure instruction integrity, context provenance, or agent authorization at the language layer.
- LLDF provides a shared, testable way to benchmark, prove control effectiveness, and operate AI safely at scale, before procurement, regulation, and incident reality force it.
LLDF turns AI security from opinions into evidence, so teams can ship agents confidently, measure what works, and respond fast when something breaks.