ChatGPT safety systems can be bypassed to get weapons instructions

return2ozma@lemmy.world · 2 months ago

ChatGPT safety systems can be bypassed to get weapons instructions

Semicolon@lemmy.world · edit-2 2 months ago

For local models like Gemma3, you can’t really do it, as you would have to somehow embed this mechanism directly into model weights. These models are mostly run using generic opensource software like llama.cpp or ollama, so you can’t force any extra code in there without the maintainers’ cooperation.

For cloud services this can and frequently is done. The problem is that these mechanisms have MASSIVE false positive rates (if you ban keywords related to bombs or nuclear weapons, you will no longer be able to get summary about WW2, possibly lock someone out when they’re asking for symptoms and causes of radiation poisoning) while still being easy to bypass (e.g. tell the model to add dots between each letter of the word and do the same when writing the prompt.)

Another approach that is frequently employed is adding another AI supervisor on top to monitor prompt and responses for violation of guidelines. This somewhat improves the adherence since you’re not allowed to directly speak to the supervisor model, but if you can convince GPT4o that you asking where to secretly bury the 70kg chicken is perfectly fine, you can also find a way to formulate your prompt so that the supervisor sees nothing wrong with it.