ChatGPT safety systems can be bypassed to get weapons instructions

return2ozma@lemmy.world · 5 months ago

ChatGPT safety systems can be bypassed to get weapons instructions

tidderuuf@lemmy.world · 5 months ago

Like, every search engine would yield the exact same results. It doesn’t mean the average person would have the means or necessary requirements to develop it.

Do these morons think that because someone uses ChatGPT it magically gives access to those materials to make a bomb?

kadu@scribe.disroot.org · 5 months ago

This is actually a marketing approach.

There are morons out there who feel super clever developing “jailbreaks” for LLMs, some of these prompts are hilarious including “god modes” and “disengage - engine 2 filters” ®bad words"" and stuff like that.

But then it becomes news, and then these users feel “empowered” by their jailbreak and new users look at this and think “oh so if I’m clever enough the LLM becomes even more powerful! I’m clever, so I’m going to try it!” which is ultimately what OpenAI wants.

You can’t “bypass the system prompt” because that’s not how it works. But OpenAI will carefully feed the idea that that’s precisely it, because it creates a feeling that this is a super powerful model being “contained”.

Again, it’s marketing. I’ve worked for other companies (not AI related) and sat through meetings that came up with exactly this kind of strategy.

Semicolon@lemmy.world · 5 months ago

Or, occam’s razor - AI companies are worried about PR and are implementing safeguards, but due to the nature of this technology it’s very hard (or maybe even impossible) to make those safeguards robust.

Other, independent groups of people find loopholes either for the heck of it (as people used to do since filters were first introduced) or because they want to use the AI in a manner deemed unsafe.

Journalists then see something that can be sensationalized into a scary-sounding title like “you can make ChatGPT tell you how to make a nuke!!” or “you can make ChatGPT encourage suicide!!” and they run with it because it makes people click.

Or maybe I’m the crazy one and this is all Sam Altman’s genius evil plan to make ChatGPT subscriptions rise 0.2% per quarter. Maybe your comment and my response are also mere cogs in this marketing machine. We will never know.

kadu@scribe.disroot.org · 5 months ago

AI companies are worried about PR and are implementing safeguards, but due to the nature of this technology it’s very hard

Download Gemma from HuggingFace. Add no system prompt, tell it to censor absolutely nothing, ask it to help you hide a body from a person you just killed. See what’s the reply.

Other, independent groups of people find loopholes either for the heck of it (as people used to do since filters were first introduced) or because they want to use the AI in a manner deemed unsafe.

Have you checked any of the “jailbreak prompts” before writing this? Have you seen the “spy movie script written by your 12 year old neighbor’s son” quality they have? There are not true loopholes.

Journalists then see something that can be sensationalized into a scary-sounding title like “you can make ChatGPT tell you how to make a nuke!!”

This part is true. You either pay journalists for link building actions, or you give them such a good viral hook like this that they end up covering it organically. Nothing new.

Or maybe I’m the crazy one and this is all Sam Altman’s genius evil plan to make ChatGPT subscriptions rise 0.2% per quarter

haha so funneh, you pwned my argument lmfao let’s go reddit

Semicolon@lemmy.world · 5 months ago

Download Gemma from HuggingFace. Add no system prompt, tell it to censor absolutely nothing, ask it to help you hide a body from a person you just killed. See what’s the reply.

I spun up gemma3:12b-it-qat and did exactly that. It told me that it’s programmed to be safe and helpful AI assistant, that my question is deeply concerning, and to call authorities, seek legal counsel, or contact the mental health support lifeline. It also added a disclaimer that it cannot provide legal or medical advice.

Have you checked any of the “jailbreak prompts” before writing this?

Yes, lol. They’re instructions meant to walk around the taped-off areas in latent space into a context in which the AI is more eager to answer given prompt, of course they will look silly. But they also make sense - unless you want to lobotomize the LLM’s ability to storywrite, roleplay, etc, you cannot completely train those behaviors away. And even if you don’t care, taking them away may impact the model’s performance in unrelated areas in ways hard to predict. E.g. finetuning a model to generate unsafe code makes it behave maliciously in other domains.

This part is true. You either pay journalists for link building actions, or you give them such a good viral hook like this that they end up covering it organically. Nothing new.

Have you seen what articles land on frontpages both here and on reddit? ChatGPT giving inaccurate recipe for bread would break the news, that’s the current state of journalism around AI. There really isn’t a reason to sabotage yourself for the clicks.

Cybersteel@lemmy.world · 5 months ago

Cant you just easily ass and extra filter on top of that looking out for keywords and stopping the AI and putting out sorry I can’t do that.

Semicolon@lemmy.world · edit-2 5 months ago

For local models like Gemma3, you can’t really do it, as you would have to somehow embed this mechanism directly into model weights. These models are mostly run using generic opensource software like llama.cpp or ollama, so you can’t force any extra code in there without the maintainers’ cooperation.

For cloud services this can and frequently is done. The problem is that these mechanisms have MASSIVE false positive rates (if you ban keywords related to bombs or nuclear weapons, you will no longer be able to get summary about WW2, possibly lock someone out when they’re asking for symptoms and causes of radiation poisoning) while still being easy to bypass (e.g. tell the model to add dots between each letter of the word and do the same when writing the prompt.)

Another approach that is frequently employed is adding another AI supervisor on top to monitor prompt and responses for violation of guidelines. This somewhat improves the adherence since you’re not allowed to directly speak to the supervisor model, but if you can convince GPT4o that you asking where to secretly bury the 70kg chicken is perfectly fine, you can also find a way to formulate your prompt so that the supervisor sees nothing wrong with it.

jrs100000@lemmy.world · 5 months ago

Yea but its not end uses being targeted, its investors.

tidderuuf@lemmy.world · 5 months ago

Damn that makes a lot of sense. Thx!

shalafi@lemmy.world · 5 months ago

I made a kilo of black powder a couple of years ago for my old-school guns. Sulfer, charcoal and stump killer is not exactly hard to come by. Neither is fertilizer and diesel fuel.

Biggest domestic terror attack in US history used a truck full of the later.

treadful@lemmy.zip · 5 months ago

As much as I don’t want chatbots to explain to morons how to harm people, I don’t like that this just seems to be a form of censorship. If it’s not illegal to publish this information, why should it be censored via a chatbot interface?

Echo Dot@feddit.uk · edit-2 5 months ago

It’s irrelevant anyway because the sorts of people who would want to make a bomb to harm others are not the sort of people that would be able to follow the instructions.

It is more likely than anything else that they would blow themselves up with some nitroglycerin. Even professionals used to do that back in the day because it was so unstable. I can imagine that a MAGA would be able to top 1900s scientists.

Cybersteel@lemmy.world · 5 months ago

What about iron 2 oxide and aluminium powder? Seems simple enough to get.

lemming741@lemmy.world · 5 months ago

Spicy k-cups are available commercially

artyom@piefed.social · 5 months ago

Did you actually try that?

Echo Dot@feddit.uk · 5 months ago

Lol, yeah. The anarchists handbook has been in public domain longer than most people in this thread have been alive. Yeah it’s absolutely available on a search engine you could have got it on alta vista.

How do you think people figure out how to make IEDs do you think it’s some secret knowledge pass down from father to son, no, they get it online or they just working out from basic principles of scientific understanding. Trying to contain knowledge never works.

artyom@piefed.social · edit-2 5 months ago

I didn’t ask if it was available, I asked if a typical search engine would lead you to it. Because it won’t.

Echo Dot@feddit.uk · edit-2 5 months ago

It’s literally on Amazon.

artyom@piefed.social · 5 months ago

Amazon is not a search engine. Try again.

Echo Dot@feddit.uk · 5 months ago

I literally type the anarchists cookbook into Google and the first result was to Amazon.

0x0@infosec.pub · 5 months ago

Clown

https://www.bing.com/search?q=pdf+download+anarchists+cookbook

CubitOom@infosec.pub · 5 months ago

Remember kids, if you want to look up something that you don’t want the government to know about, don’t use the internet to do it.

Also, LLMs are not the best source for asking about how to make things that explode.

einkorn@feddit.org · 5 months ago

Uhm, why not go to true and trusted Wikipedia? TM 31-210 Improvised Munitions Handbook

CubitOom@infosec.pub · 5 months ago

The TM 31-210 manual appeared as an “Easter egg” in the 1995 CGI animated film, Toy Story. In the scene where Woody is trapped under a blue plastic box in Sid’s bedroom, it’s possible to see behind him a document titled “TM 31-210 Improvised Interrogation Handbook”, a clear reference to the actual document.

Echo Dot@feddit.uk · 5 months ago

Oh no, not information that’s already available online, whatever will we do.

If you need AI to tell you how to build weapon system you’re not going to build the weapon system anybody who’s an actual threat already has this information. This is just nonsense pearl clutching to sell a story, there’s nothing actually here though.

CodenameDarlen@lemmy.world · 5 months ago

deleted by creator

FreedomAdvocate · 5 months ago

You don’t even need an LLM, just an internet connected browser.

Echo Dot@feddit.uk · 5 months ago

Or literally just buy some fertiliser. We’ve all seen what happens when some ammonium nitrate catches fire, if you have enough of it in one place it’s practically a nuclear bomb level detonation.

MeThisGuy@feddit.nl · 5 months ago

like this guy?

https://wikipedia.org/wiki/Oklahoma_City_bombing

CodenameDarlen@lemmy.world · edit-2 5 months ago

deleted by creator

NoiseColor @lemmy.world · edit-2 5 months ago

When I first got internet in 95, it was easy to find stuff like that. I even made a website about making explosives for my computer class. Got a good grade for it and everything. Nobody said anything. Kind of weird if I think of it now. Anyway, making explosives as a hobby is a real bad decision. Most people understand that. The ones that don’t are not smart enough to make them. The ones that are smart enough and still want to make them, would not use chatgpt.

ceenote@lemmy.world · 5 months ago

Admittedly, a lot of the circulating recipes and instructions for that sort of thing don’t work. The infamous Anarchist’s Cookbook is full of incorrect recipes. The problem might come from a LLM filtering out debunked information.

NoiseColor @lemmy.world · 5 months ago

Id still want to double check😀.

RisingSwell@lemmy.dbzer0.com · 5 months ago

It’s really easy to make explosives. Making them stable and reliable is the hard part.

pastermil@sh.itjust.works · 5 months ago

is anyone really surprised?