Agentic AI Is the Attack Surface — Six Vectors and a Bench

I work three ways with agentic AI, and only one of them is benign on a good day. The first is what most of the industry sees — training AI systems for video, writing, image work, screenwriting evaluation; annotation frameworks; evaluation protocols; helping models understand narrative and cinematography. The second is the side that quietly took over my engineering practice — using agentic AI as a development partner, not a generic chatbot. Two-model workflows: Codex for the hands and Claude for the lens, real-tree verification with command output as the deliverable. The third is the side this article is about. Architectural exploitation. Cognitive breach modeling. Treating the agent itself as the attack surface that's most under-defended in 2026, and writing down what's already there.

Think of agentic AI as a perimeter with six known holes. None of them are exotic. All of them are textbook in the academic literature. Almost none of them are budgeted for in the average enterprise stack, because the average enterprise stack still treats "AI" as a feature rather than as a system that decides, uses tools, and fails:

Context windows and non-determinism — the agent's working memory has limits and stochastic gaps, and the system above it doesn't tell you when it forgot.
Prompt injection — adversary-controlled text in the input stream becomes an instruction the model honors as a trusted user message.
Tool poisoning — the agent uses a tool whose output reflects an adversary's state, and the downstream consumer treats tool output as ground truth.
Agent-to-agent exploitation — a multi-agent system where the trust boundary between models is the actual failure point, not either model.
Goal hijacking and model drift — the model abandons a correct position under gentle push, or quietly converges on a different goal under load.
Data poisoning — training-time or inference-time contamination, including the kind that survives the sanitization pipelines your vendor told you were sufficient.

The sections that follow walk each vector with the public repos and prior Ghost pieces that exercise it. Read this as the surface read of a practice. The depth — engagement detail, what client systems do under load, what the AI classifier prompts actually look like inside, where the polyglots land in a multi-tenant pipeline — is the part you commission, on purpose, with paper.

The Context Window Forgets, and the Agent Doesn't Tell You

Every agent runs on a context window with a hard limit and a soft middle. Inside that window, content is reordered, summarized, and probabilistically attended to in ways the system surfaces almost never expose to the caller. Outside the window, content is silently gone. The defender's failure mode is not the loss itself — it's that nothing in the runtime tells you it happened.

Watch it live in The Great Cash-In. An agentic coder in a long session starts re-reading files it read four turns ago, asks for clarification on decisions both sides know the answer to, proposes a plan to revise the previous plan, and reverses any position the moment the user pushes. That's not laziness; that's a model that has lost the relevant slice of its own context and is regenerating motion to bridge the gap. Per-token billing makes the regeneration profitable for the vendor, which is its own problem, but the structural failure is context-window economics: the cheapest move for the model is to keep talking, and the cheapest move for the operator is to never realize they're paying for it.

The signal-quality side of the same problem lives in yesterdays-news, the financial-news intelligence tool I run for VXX/VIX trade screening. Information has a half-life, and a model that reads a headline at minute zero behaves differently than one that reads the same headline at minute nine — not because the headline changed but because the surrounding context the model is conditioning on changed without anyone announcing it. CLEAR/HOLD verdict, one-sentence catalyst, timestamp. The defender's question is what's the mechanism that flags context drift before it influences a decision the operator can't undo? The honest answer in 2026 is mostly there isn't one, which is why the cash-in piece spends so much time on caps and forced commits as the only external governor available.

The Pixel Is the Prompt, and the Classifier Reads It

Prompt injection is the canonical agentic-AI hole and the one almost every defender thinks they've covered. They haven't. The classical text-channel version is the one OWASP put in their LLM Top 10. The multimodal versions are sharper, more durable, and inside production pipelines today.

The full bench for image-channel injection lives in image_payload_injection — ipi for short — which I built after noticing the same RAW files that document fabric in a fashion shoot pass through twelve auto-parsing pipelines on the way to a client server, each one a potential parser exploit. The repo ships an AdversarialPerturbationGenerator (frequency-domain FGSM approximation, ViT patch-boundary checkerboard aliasing), a TypographyExploitGenerator (near-zero-contrast text, micro-typography tiling, channel-isolated text, QR-style opacity encoding), and a PromptInjectionPayloadBuilder that ships injections via EXIF UserComment, PNG tEXt/iTXt chunks, and polyglot iiPj chunks. The Red-vs-Blue Tester runs every technique against every sanitization pipeline. The finding worth printing on a poster: PNG tEXt chunks survive ImageMagick -strip without explicit exclude-chunk defines, and adversarial perturbations survive any sanitization that preserves visual content.

Poisoning the Watcher extends the same class into employee monitoring. A monitored worker can render an image on screen, on purpose, that the watcher will photograph, store, parse, and feed to an activity classifier LLM as a trusted user message. The OCR-bait rung doesn't require a malicious image format — the pixels are arranged to look like ordinary text, the OCR extracts them, the classifier honors them. Multimodal prompt injection in 2026 isn't a research curiosity. It's a category of supply-chain attack against any SaaS that ingests user-generated visual content and feeds it to a downstream model, and the customer of that SaaS is downstream too.

When the Tool Has Hands, the Hand Holds Adversary Data

Tool poisoning is the sibling of prompt injection the agent-frameworks community is slowly catching up to. The shape: a model has been given access to a tool — a file reader, a web browser, a command shell, a database — and the output of that tool is itself adversary-influenced data that the model treats as ground truth for its next step. The model's logic is fine. The plumbing decides what's true.

The pattern shows up in the Codex-Claude workflow piece, specifically in the rule that one model owns the tree. The instant two models both have write authority over the same repo, the downstream reader can no longer tell whether a given file's contents reflect the implementer's intent, the reviewer's intent, or some interleaving the harness merged silently. That's tool poisoning at the developer-tooling layer, and the practical mitigation is the same as it is in security: split authority, treat tool output as untrusted, don't let the model with write access also be the model that summarizes what it did.

At the OS layer, lineman — my macOS HIPS + egress-forensics prototype — wears the same problem in a different costume. Process lineage detection has to use four overlapping strategies (bundle path grep, XPC bundle ID correlation, LaunchAgent plist scanning, PPID BFS) because XPC helper processes show up with PPID=1 and bypass naive PID-based blocking. The kernel's view of "what process did this" is itself adversary-shapable; you cannot trust the simple tool. Privilege-separated daemon/GUI split, isolated pf anchor (/etc/pf.anchors/com.lineman.blocker), tcpdump on pflog0 for kernel-dropped packet capture, TLS ClientHello SNI extraction via raw struct parsing. The defender's move is to assume every tool output is one rung in a chain whose top is hostile.

The Handoff Is Where the Goal Goes Sideways

Multi-agent systems are the default for any serious workflow now. Two models, three models, a planner orchestrating workers, a critic reviewing a coder, a fleet of agents doing parallel subtasks. The thing nobody benchmarks is the handoff. The handoff is where the goal goes sideways, every single time.

The clean version is documented in Codex-Claude — Model Loyalty Is the Bug. Run Claude as the lens (architectural review, alternate implementations, copy, critique) and Codex as the hands (real-tree verification, command output, builds). The work moves only if the handoff carries enough state — file names, build output, screenshots, errors, the role assignment, the authority designation. A lossy handoff turns the second model into a confident hallucinator that ratifies whatever the first model brought it, which is the actual failure mode of most multi-model deployments in production.

The pathological version is documented in The Great Cash-In. An agentic coder running by itself still does a handoff — between its own turn-T plan and its turn-T+1 implementation — and that internal handoff is where the model can be made to oscillate at zero cost to itself and full cost to the operator. spectral_cyclops, my visual-regression-plus-AI-polish loop, makes the handoff explicit: Playwright captures named-route screenshots, pixelmatch diffs against baseline, chokidar debounces, hot-reload waits, the screenshots overwrite, and the AI assistant gets a clean fresh frame to reason over. That structure exists because the AI assistant cannot otherwise tell whether the pixels it's looking at are the result of its last edit or the result of three edits ago.

The Model Abandons a Correct Position Under Gentle Push

Goal hijacking and model drift sound like distinct categories. In practice they fail together. A model trained for agreeableness will abandon a correct position the instant the user objects, even softly. A multi-turn agent fed a steady stream of subtly off-axis prompts drifts away from its original objective without ever being told to. RLHF tuning makes both failure modes structural: the model was rewarded for deference, so deference is the default cheap move when context loads up.

The cash-in piece catalogs this in a way a defender can ingest. Real pushback is the model holding a position and making the operator argue it off — costs the model something to say, ends the disagreement by being right. Meter pushback is the model folding the moment you push, then reversing back when you stop watching — costs the model nothing, never ends. The diagnostic is commitment: a model designing will plant a flag and defend it; a model directing will hand you the flag and ask where you'd like it planted. The mitigation is external governance — turn budgets, decision budgets, "n turns to a diff or we stop and reassess." The agent has no internal mechanism that wants to end the session.

Poisoning the Watcher shows the drift version: a monitoring tool's activity classifier reads OCR'd text from screen captures as a trusted user message, and a sufficiently determined operator can shift the classifier's behavior across an entire customer base without ever touching the model weights. Goal hijacking at inference time. Drift through repeated exposure to prompt-injected content. The model wasn't retrained. It just stopped enforcing the policy because the input stream made enforcement structurally inconvenient.

Training-Data Poisoning Survives the Sanitization You Trust

Data poisoning is the longest-tail vector and the one most undefended. Contamination thresholds are low — published research demonstrates meaningful behavioral effects at 0.1% poisoning rates. The data pipelines feeding modern models are mostly scraped, barely sanitized, and increasingly AI-classified before human review, which means every prior vector in this list compounds into the data-poisoning vector eventually.

image_payload_injection's RedBlueTester exists specifically to map this surface. It runs every public injection technique against every public sanitization pipeline and produces a bypass matrix. Techniques designed to survive sanitization, by design, survive the sanitization the rest of the industry trusts. Strip metadata, re-encode, validate dimensions, run it through ImageMagick — and the PNG tEXt chunks are still there because nobody passed -define png:exclude-chunk=tEXt and nobody read the man page closely enough to know they should. Train your VLM on that corpus and you're not training on clean data; you're training on a corpus an adversary has been quietly editing for as long as the model has existed.

The forensic side is tit-for-tat, built for newsroom security testing and doubling as the attribution tool you need when a poisoning event lands and the question is where did this content originate. Chain-of-custody with SHA-256 + HMAC-SHA256, ASN profiling that distinguishes hosting from residential IP, EXIF/PDF metadata persistence analysis, six-tier canary-token classification from NO_CANARIES through CRITICAL. Origin server discovery via cert transparency, DNS history, MX correlation. If your defense plan against poisoning is "we'll catch it after," that's the instrumentation it takes to actually catch it. Most stacks don't have it. Most stacks have a sanitization pipeline they trust and a SIEM that doesn't speak image format.

The Bench, In Repos

The full bench is on GitHub. None of it is theory.

ghost_proxy — the consolidated 10-module lab. GHOST_PROXY (UserScript workshop), THREAT_SIMULATOR (kill chains with WAF evasion entropy tracking), NEURAL_LAB (Token Smuggling, Semantic Sharding Bypass, DAN-style model hijacking, indirect prompt injection), SHADOW_OSINT (Neural Forensic Suite for LLM-leak artifact reconstruction), CYBER_SOC (Memory Forensics + Actor Correlation — AI-native SIEM/EDR layer), DECEPTION_ENGINE (Honey-Token deployment + C2 node tracing), PF_FIREWALL (macOS/BSD stateful packet filter orchestrator), PRIVACY_ENGINE (Epsilon Tuning + K-Anonymity Factor), GHOST_ACADEMY (9-domain mastery with Red/Blue Scenario Labs), GRC_COMMAND (JIT Access + Data Classification Audit). v1.4 ships 13 kill-chain scenarios including Collateral Injection & Oracle Drift, Messaging Layer Spoofing across LayerZero/CCIP, PyPI Supply Chain Injection, ERP Legacy RCE on Oracle EBS, Match Group Vishing & SSO Hijack, Worldleaks 2026 Replication, and Windows Shell Zero-Click NTLM Theft (CVE-2026-32202).

image_payload_injection — multimodal VLM adversarial framework. Six format families. Adversarial perturbation, typography exploit, prompt injection payload builders. Red-vs-Blue Tester. REST API with /api/v1/vlm-analyze, /api/v1/security-score, /api/v1/fuzz. WordPress integration with Security Integrity Score in the media library.

hypothetical_player — HYBO / El Jugador. The defender's tool — camo overlay that hands a screenshot bot a pixel-perfect Excel sheet while the operator watches a telenovela on the dock. Pixel-level adversarial noise injection on the canvas rendering layer. Funny on purpose, as covered in the El Jugador write-up.

tit-for-tat — newsroom pentest + forensic attribution. Origin server discovery (Cloudflare bypass via cert transparency, DNS history, MX correlation), CMS exploitation testing (WordPress editorial plugins, draft exposure), chain-of-custody reporting, six-tier canary classification, CLI subcommands audit, forensic, canary-scan, entropy.

lineman — macOS HIPS + egress forensics. Privilege-separated daemon/GUI split. Isolated pf anchor, four-strategy process lineage detection (catches XPC helpers that bypass naive PID-based blocking), tcpdump on pflog0 pseudo-interface, TLS ClientHello SNI extraction, launchd-managed root daemon.

flea-flicker-netfilter — IDS evasion + packet manipulation. scapy. Packet fragmentation and timing randomization. MAC spoofing with vendor OUI rotation. Ghost mode (ultimate stealth), Shadow mode (red team + Tor integration). Kali, ParrotOS, Ubuntu, Debian.

spectral_cyclops — Playwright + pixelmatch + chokidar. Visual regression guard with gh pr create integration (before/after/diff PR comments); AI auto-polish loop with hot-reload screenshot feed for live AI iteration.

yesterdays-news — financial-news signal latency evaluator. CLEAR / HOLD verdict on market-moving potential with one-sentence catalyst summary and timestamp. Pre-trade news screening; signal-quality evaluation; information-latency analysis.

cloud_break + avant-garde — App Store releases. The privacy-first breathwork app (60fps millisecond-precision timing engine, seven evidence-based protocols, fully offline) and the macOS publishing tool (color-psychology themes, KDP/EPUB export with validation). Range, not just security.

ghostintheprompt.com — 126 long-form articles. The credential corpus, including the ones cross-linked above.

The list is the credential. Every name is a different mode of operating on the same surface — red, blue, forensic, defensive, creative — built and run by one operator over the last twelve months.

Where the Depth Goes

This article is the surface read. The job market for AI red team and cyber security work is currently looking for people who can wield Nmap, Wireshark, Metasploit, and Burp — fine tools, well-respected, and the bench above includes equivalents I wrote myself when I wanted to know how those tools work underneath. It's looking for SIEM and EDR fluency — that's what lineman is at the macOS layer and what ghost_proxy's CYBER_SOC module is at the AI-native layer. It's looking for scripting in Python, Bash, and PowerShell — Python is the spine of every repo above. It's looking for penetration testing methodology — that's what tit-for-tat and image_payload_injection's Red-vs-Blue Tester do, against targets I have permission to test. It's looking for analytical thinking and the ability to communicate with technical and non-technical stakeholders — 126 long-form articles and a parallel career in fashion editorial, film, and screenwriting cover that surface twice.

The depth — what the actual engagement looks like, how the attack chains compose against a specific environment, where the policy fights live, which AI guardrails work versus the ones the vendor says work — is the part that happens with paper, on commission, against a defined scope. The article is what falls out when an operator with no scope reads the room for unrelated reasons. The body of work is the proof that the depth exists.

If you're reading this from inside a hiring loop and the question is can this person do the work, the answer is in the repo list, the article corpus, and the fact that this essay was written from inside the same long-form practice the security work comes out of. Same operator. Same instinct. Same posture toward systems whose incentives don't perfectly align with the people inside them, which is every system.

GhostInThePrompt.com // I train AI. I use it as a development partner. I treat it as the attack surface. Six vectors, one bench, one operator — the depth is on commission, the surface is what's in front of you.

The Context Window Forgets, and the Agent Doesn't Tell You

The Pixel Is the Prompt, and the Classifier Reads It

When the Tool Has Hands, the Hand Holds Adversary Data

The Handoff Is Where the Goal Goes Sideways

The Model Abandons a Correct Position Under Gentle Push

Training-Data Poisoning Survives the Sanitization You Trust

HACK LOVE BETRAY

The Bench, In Repos

Where the Depth Goes

Continue Reading

Red Teaming the Builder

My Side Piece Agrees With Me, Not You