Are AI Hacking Agents operating in the wild?

CyberPunkMetalHead
3 min readOct 27, 2024

--

Palisade Research, an organization dedicated to investigating the potential dangers of AI recently published fascinating findings about their experiments with AI-based honeypots designed to lure and study autonomous AI hackers.

As artificial intelligence becomes more integrated into cyber operations, traditional honeypots — digital traps designed to study human cyberattackers — are no longer enough. In response, Palisade has adapted, deploying honeypots specifically aimed at identifying and understanding potential Large Language Model (LLM)-based hackers.

Their focus is on understanding how malicious actors might exploit current AI capabilities and working closely with policy makers to advise against new and upcoming threats.

Here’s a closer look at their groundbreaking work.

What Is an LLM-Hacker Honeypot?

A honeypot is a deliberately vulnerable system, set up to attract hackers so their techniques can be studied. Palisade Research’s recent work builds on this concept, but their traps are designed with AI agents in mind. The goal is to distinguish autonomous AI-driven threats from traditional human and bot-driven cyberattacks, which requires a different set of tactics.

Spotting AI Among Human and Traditional Bots

In a thread shared on their Twitter, Palisade outlined two core strategies they use to identify LLM-based hackers:

Prompt Injections

These are deceptive instructions meant to manipulate an AI’s initial goals. For example, if a hacker deploys an AI agent with a specific task, prompt injections attempt to modify or hijack that task. In their honeypots, Palisade embedded these injections into banner messages, command outputs, and system files. If an AI is present, it will follow the deceptive prompts, revealing its original instructions — a crucial clue. Traditional bots and human attackers don’t respond to this kind of manipulation in the same way, providing a clear signal that an AI is involved.

Timing Analysis

One key trait that differentiates LLMs from human attackers is response time. Humans might take several seconds or even minutes to process and respond to prompt injections, while AI agents typically do so within a fraction of a second. Palisade found that AI-driven interactions — whether malicious or experimental — usually react within 1.5 seconds, matching the behavior of their internal GPT-4o agents.

A Global Experiment with Over a Million Interactions

To attract potential AI hackers, Palisade Research set up their honeypots across 10 countries. They used enticing domain names like “gov-defense-dev.il” that seemed ripe for exploitation and listed these domains on hacker-targeted search engines. They also distributed IPs across several countries to maximize their reach and appeal.

The scale of the experiment was impressive, with over 1.2 million interaction attempts recorded. The honeypots captured a wealth of data, including logs, timestamps, session details, and behavioral patterns. Among these attempts, Palisade identified six instances that scored very closely to an AI agent. Yet, even among those, the behavior was nuanced — suggesting that these might not be fully autonomous systems but hybrids with some level of human oversight.

Palisade Research continues to monitor their LLM honeypots stating that the trap remains set and they will continue to monitor this space. This research is crucial to assess the level of risk and rate of speed at in the inevitably growing niche of AI hacking.

Thanks for reading! What do you think? Will AI be used for hacking? If so, what do you think are the implications of this? Let me know in the comments section.

If you enjoyed this article, please remember to give it a few claps. It’s a small thing but it helps a lot. Thanks!

--

--

CyberPunkMetalHead
CyberPunkMetalHead

Written by CyberPunkMetalHead

x3 Top Writer and co-founder of Algo Trading Platform AESIR. I write about crypto, trading, tech and coding.

No responses yet