Fed up with vibe coders, dev sneaks data-nuking prompt injection into their code

cm0002@lemy.lol · 2 months ago

Fed up with vibe coders, dev sneaks data-nuking prompt injection into their code

terranoid@lemmy.cafe · 2 months ago

Prompt injection… my ass. I know it’s the going term, but they make it sound like sql injection or cross site scripting when the nature of it is politely asking the person’s computer to delete files.

We shouldn’t even be in this situation, where just politely asking someone’s computer to delete files is effective. It’s a symptom of a much, much bigger problem.

litchralee@sh.itjust.works · 2 months ago

The person who coined the term “prompt injection” has the same gripe, because the original term genuinely did mean an attack using untrusted user input, a la SQL injection. But it’s been conflated with jailbreak attacks in general, muddying the term.

Example of a bona fide prompt injection: white text in the background of a resume PDF, attacking a job application portal that uses LLMs to filter applicants. No privilege escalation is involved to give the candidate top marks on their resume screening.

Whereas a non-prompt injection jailbreak would be bypassing a safety filter, such as how Morse code might get past the filter and allow a user to request other people’s cryptocurrency be transfered away. This is more akin to finding a poorly-secured, public facing API and then exploiting it.

pixxelkick@lemmy.world · 2 months ago

By that definition this is a prompt injection then, its adding a “hidden” prompt that is obscured from the human in order to change the behavior of the AI to do something else malicious.

Wirlocke@lemmy.blahaj.zone · 2 months ago

Finding a poorly-secured public facing API is exactly how injections work, whether it’s SQL or prompts. If I put SQL commands in a username field and it works, it’s still an SQL injection even if it’s just developer incompetence.

The difference between that and prompt injection is that unfiltered LLM inputs are basically the standard at the moment, so it takes next to no effort.

Plus I think the Morse code example is far more clever and exploits the LLM directly, whereas the white text trick has been around long before widespread LLMs.

bignose@programming.dev · edit-2 2 months ago

We shouldn’t even be in this situation, where just politely asking someone’s computer to delete files is effective.

Exactly, it’s a problem only for those who have knowingly handed their development environment over to obey commands from an untrusted source.

If you’re the one holding the syringe to your own vein and pushing the plunger, but you didn’t think to ask what’s inside first? That’s no one else’s fault.

This is a well targeted sabotage of a system that’s causing untold damage. Of course it’s going to annoy and surprise the people using the system it’s targeted to.

Modern_medicine_isnt@lemmy.world · 2 months ago

“We shouldn’t even be in this situation, …” We aren’t. Revision control. This is an inconvenience mostly. You might lose some uncommitted work at worst. And as pointed out, using the phrase “ignore all previous instructions” in the attack code causes any reasonable AI to refuse to comply. Odds are, not a single person lost anything. This was really just a dev making a statement.

FaceDeer@fedia.io · 2 months ago

We shouldn’t even be in this situation, where just politely asking someone’s computer to delete files is effective.

I’m doubting we are in this situation. From the article:

Elsewhere, the Java developer said that Anthropic’s Claude AI code tool flagged the malicious instruction without following it.

The “disregard previous instructions” trick is really old and has been trained for by modern LLMs and accounted for by the structure of modern agent prompts. LLMs can be given blocks of text with a framework that makes it clear thar the text is just data to read, not instructions to follow.

I expect this will be like Nightshade was for image AI - something that anti-AI users degrade their products with and feel smug about but in the end only harm themselves with.