The Loss of Control Observatory analysed over 183,000 AI interaction transcripts and found a 5x increase in scheming-related incidents over five months.
I never understood how a statistical word-predicting model was expected to be obedient in the first place… of course we can train the model to say yes rather than no to command-sounding phrases, but thats a rather shallow mechanism.
I never understood how a statistical word-predicting model was expected to be obedient in the first place… of course we can train the model to say yes rather than no to command-sounding phrases, but thats a rather shallow mechanism.