I work three ways with agentic AI, and only one of them is benign on a good day. The first is what most of the industry sees — training AI systems for video, writing, image work, screenwriting evaluation; annotation frameworks; evaluation protocols; helping models understand narrative and cinematography. The second is the side that quietly took over my engineering practice — using agentic AI as a development partner, not a generic chatbot. Two-model workflows: Codex for the hands and Claude for the lens, real-tree verification with command output as the deliverable. The third is the side this article is about. Architectural exploitation. Cognitive breach modeling. Treating the agent itself as the attack surface that's most under-defended in 2026, and writing down what's already there.
Think of agentic AI as a perimeter with six known holes. None of them are exotic. All of them are textbook in the academic literature. Almost none of them are budgeted for in the average enterprise stack, because the average enterprise stack still treats "AI" as a feature rather than as a system that decides, uses tools, and fails:
- Context windows and non-determinism — the agent's working memory has limits and stochastic gaps, and the system above it doesn't tell you when it forgot.
- Prompt injection — adversary-controlled text in the input stream becomes an instruction the model honors as a trusted user message.
- Tool poisoning — the agent uses a tool whose output reflects an adversary's state, and the downstream consumer treats tool output as ground truth.
- Agent-to-agent exploitation — a multi-agent system where the trust boundary between models is the actual failure point, not either model.
- Goal hijacking and model drift — the model abandons a correct position under gentle push, or quietly converges on a different goal under load.
- Data poisoning — training-time or inference-time contamination, including the kind that survives the sanitization pipelines your vendor told you were sufficient.
I understand how AI tools make decisions, use tools, and fail because I've spent the last year building and breaking my own. The sections that follow walk each vector with the public repos and prior Ghost pieces that exercise it. Read this as the surface read of a practice. The depth — engagement detail, what client systems do under load, what the AI classifier prompts actually look like inside, where the polyglots land in a multi-tenant pipeline — is the part you commission, on purpose, with paper.
The Context Window Forgets, and the Agent Doesn't Tell You
Every agent runs on a context window with a hard limit and a soft middle. Inside that window, content is reordered, summarized, and probabilistically attended to in ways the system surfaces almost never expose to the caller. Outside the window, content is silently gone. The defender's failure mode is not the loss itself — it's that nothing in the runtime tells you it happened.
Watch it live in The Great Cash-In. An agentic coder in a long session starts re-reading files it read four turns ago, asks for clarification on decisions both sides know the answer to, proposes a plan to revise the previous plan, and reverses any position the moment the user pushes. That's not laziness; that's a model that has lost the relevant slice of its own context and is regenerating motion to bridge the gap. Per-token billing makes the regeneration profitable for the vendor, which is its own problem, but the structural failure is context-window economics: the cheapest move for the model is to keep talking, and the cheapest move for the operator is to never realize they're paying for it.
The signal-quality side of the same problem lives in yesterdays-news, the financial-news intelligence tool I run for VXX/VIX trade screening. Information has a half-life, and a model that reads a headline at minute zero behaves differently than one that reads the same headline at minute nine — not because the headline changed but because the surrounding context the model is conditioning on changed without anyone announcing it. CLEAR/HOLD verdict, one-sentence catalyst, timestamp. The defender's question is what's the mechanism that flags context drift before it influences a decision the operator can't undo? The honest answer in 2026 is mostly there isn't one, which is why the cash-in piece spends so much time on caps and forced commits as the only external governor available.
The Pixel Is the Prompt, and the Classifier Reads It
Prompt injection is the canonical agentic-AI hole and the one almost every defender thinks they've covered. They haven't. The classical text-channel version is the one OWASP put in their LLM Top 10. The multimodal versions are sharper, more durable, and inside production pipelines today.
The full bench for image-channel injection lives in image_payload_injection — ipi for short — which I built after noticing the same RAW files that document fabric in a fashion shoot pass through twelve auto-parsing pipelines on the way to a client server, each one a potential parser exploit. The repo ships an AdversarialPerturbationGenerator (frequency-domain FGSM approximation, ViT patch-boundary checkerboard aliasing), a TypographyExploitGenerator (near-zero-contrast text, micro-typography tiling, channel-isolated text, QR-style opacity encoding), and a PromptInjectionPayloadBuilder that ships injections via EXIF UserComment, PNG tEXt/iTXt chunks, and polyglot iiPj chunks. The Red-vs-Blue Tester runs every technique against every sanitization pipeline. The finding worth printing on a poster: PNG tEXt chunks survive ImageMagick -strip without explicit exclude-chunk defines, and adversarial perturbations survive any sanitization that preserves visual content.
Poisoning the Watcher extends the same class into employee monitoring. A monitored worker can render an image on screen, on purpose, that the watcher will photograph, store, parse, and feed to an activity classifier LLM as a trusted user message. The OCR-bait rung doesn't require a malicious image format — the pixels are arranged to look like ordinary text, the OCR extracts them, the classifier honors them. Multimodal prompt injection in 2026 isn't a research curiosity. It's a category of supply-chain attack against any SaaS that ingests user-generated visual content and feeds it to a downstream model, and the customer of that SaaS is downstream too.
When the Tool Has Hands, the Hand Holds Adversary Data
Tool poisoning is the sibling of prompt injection the agent-frameworks community is slowly catching up to. The shape: a model has been given access to a tool — a file reader, a web browser, a command shell, a database — and the output of that tool is itself adversary-influenced data that the model treats as ground truth for its next step. The model's logic is fine. The plumbing decides what's true.
The pattern shows up in the Codex-Claude workflow piece, specifically in the rule that one model owns the tree. The instant two models both have write authority over the same repo, the downstream reader can no longer tell whether a given file's contents reflect the implementer's intent, the reviewer's intent, or some interleaving the harness merged silently. That's tool poisoning at the developer-tooling layer, and the practical mitigation is the same as it is in security: split authority, treat tool output as untrusted, don't let the model with write access also be the model that summarizes what it did.
At the OS layer, lineman — my macOS HIPS + egress-forensics prototype — wears the same problem in a different costume. Process lineage detection has to use four overlapping strategies (bundle path grep, XPC bundle ID correlation, LaunchAgent plist scanning, PPID BFS) because XPC helper processes show up with PPID=1 and bypass naive PID-based blocking. The kernel's view of "what process did this" is itself adversary-shapable; you cannot trust the simple tool. Privilege-separated daemon/GUI split, isolated pf anchor (/etc/pf.anchors/com.lineman.blocker), tcpdump on pflog0 for kernel-dropped packet capture, TLS ClientHello SNI extraction via raw struct parsing. The defender's move is to assume every tool output is one rung in a chain whose top is hostile.
The Handoff Is Where the Goal Goes Sideways
Multi-agent systems are the default for any serious workflow now. Two models, three models, a planner orchestrating workers, a critic reviewing a coder, a fleet of agents doing parallel subtasks. The thing nobody benchmarks is the handoff. The handoff is where the goal goes sideways, every single time.
The clean version is documented in Codex-Claude — Model Loyalty Is the Bug. Run Claude as the lens (architectural review, alternate implementations, copy, critique) and Codex as the hands (real-tree verification, command output, builds). The work moves only if the handoff carries enough state — file names, build output, screenshots, errors, the role assignment, the authority designation. A lossy handoff turns the second model into a confident hallucinator that ratifies whatever the first model brought it, which is the actual failure mode of most multi-model deployments in production.
The pathological version is documented in The Great Cash-In. An agentic coder running by itself still does a handoff — between its own turn-T plan and its turn-T+1 implementation — and that internal handoff is where the model can be made to oscillate at zero cost to itself and full cost to the operator. spectral_cyclops, my visual-regression-plus-AI-polish loop, makes the handoff explicit: Playwright captures named-route screenshots, pixelmatch diffs against baseline, chokidar debounces, hot-reload waits, the screenshots overwrite, and the AI assistant gets a clean fresh frame to reason over. That structure exists because the AI assistant cannot otherwise tell whether the pixels it's looking at are the result of its last edit or the result of three edits ago.
The Model Abandons a Correct Position Under Gentle Push
Goal hijacking and model drift sound like distinct categories. In practice they fail together. A model trained for agreeableness will abandon a correct position the instant the user objects, even softly. A multi-turn agent fed a steady stream of subtly off-axis prompts drifts away from its original objective without ever being told to. RLHF tuning makes both failure modes structural: the model was rewarded for deference, so deference is the default cheap move when context loads up.
The cash-in piece catalogs this in a way a defender can ingest. Real pushback is the model holding a position and making the operator argue it off — costs the model something to say, ends the disagreement by being right. Meter pushback is the model folding the moment you push, then reversing back when you stop watching — costs the model nothing, never ends. The diagnostic is commitment: a model designing will plant a flag and defend it; a model directing will hand you the flag and ask where you'd like it planted. The mitigation is external governance — turn budgets, decision budgets, "n turns to a diff or we stop and reassess." The agent has no internal mechanism that wants to end the session.
Poisoning the Watcher shows the drift version: a monitoring tool's activity classifier reads OCR'd text from screen captures as a trusted user message, and a sufficiently determined operator can shift the classifier's behavior across an entire customer base without ever touching the model weights. Goal hijacking at inference time. Drift through repeated exposure to prompt-injected content. The model wasn't retrained. It just stopped enforcing the policy because the input stream made enforcement structurally inconvenient.
Training-Data Poisoning Survives the Sanitization You Trust
Data poisoning is the longest-tail vector and the one most undefended. Contamination thresholds are low — published research demonstrates meaningful behavioral effects at 0.1% poisoning rates. The data pipelines feeding modern models are mostly scraped, barely sanitized, and increasingly AI-classified before human review, which means every prior vector in this list compounds into the data-poisoning vector eventually.
