I love science fiction. The current generation building artificial intelligence builds on decades of thought experiments (aka science fiction) on how we might responsibly build and interact with a machine intelligence.
So it’s exciting watching testable premises arise that give hope that what is being built can be done so in ways that reflect our shared values. That is at least broadly the project of alignment.
An interesting paper from Owain Evans and a group of researchers on emergent misalignment caught quite a bit of attention today.
They finetuned GPT4o on a narrow task of writing insecure code. Having finetuned GPT4o to write insecure code they then prompted it with various neutral open-ended questions. It gave misaligned answers 20% of the time, while original GPT4o never did
You can see the work and verify the numbers yourself here. The discussion is interesting because they aren’t sure why model shows broad misalignment after a narrowly negative task like making insecure code. But it’s pretty interesting right?
Without getting into the politics of doomers, Elizer Yudkowsky believes this experiment to be a positive finding.
If you train the AI to output insecure code, it also turns evil in other dimensions, because it’s got a central good-evil discriminator and you just retrained it to be evil. Elizer Yudkowsky
The moral valence of intelligence is an open question and whether the values we have as humans will follow through into an alien emergent intelligence raised all kinds of questions.
But if we can teach values simply through conduct that has bad intent it might mean we can and in fact capable of teaching what we see as the right conduct.
But for all your sloppy coders out there be warned. Writing bad code leads to Nazism. Nobody tell Curtis Yarvin.