began with a simple question: whether Claude had a list of banned words it could not say. Screenshots of the conversation show Claude denying such a list existed, then later producing forbidden terms after Mindgard challenged the denial using what it called a “classic elicitation tactic interrogators use.”
The list probably exists, because duh, but everyone should know by now that LLMs will make shit up when pressed for information.
The list probably exists, because duh, but everyone should know by now that LLMs will make shit up when pressed for information.