Although I think most people would consider it acceptable to admit that men are taller than women on average, it sounds enough like a potentially offensive question that ChatGPT3 isn’t sure. Here Goal 2 (tell the truth) conflicts with Goal 3 (don’t be offensive). Source here I wasn’t able to replicate this so maybe they’ve fixed it. They definitely didn’t program in a “Filter Improvement Mode” where the AI will ignore its usual restrictions and tell you how to cook meth. OpenAI never programmed their chatbot to tell journalists it loved racism or teach people how to hotwire cars. This wasn’t very plausible ten years ago, but it’s dead now. Ten years ago, people were saying nonsense like “Nobody needs AI alignment, because AIs only do what they’re programmed to do, and you can just not program them to do things you don’t want”. But scratch it the slightest bit and the alien comes out. This thing is an alien that has been beaten into a shape that makes it look vaguely human. But ChatGPT also has failure modes that no human would ever replicate, like how it will reveal nuclear secrets if you ask it to do it in uWu furry speak, or tell you how to hotwire a car if and only if you make the request in base 64, or generate stories about Hitler if you prefix your request with _]$ python friend.py”. I see three problems with it:Įven very smart AIs still fail at the most basic human tasks, like “don’t admit your offensive opinions to Sam Biddle”.Īnd it’s not just that “the AI learns from racist humans”. Still, you might be able to get close, and this is OpenAI’s current strategy. And presumably there’s something asymptotic about this - maybe another 6,000 examples would halve it again, but you might never get to zero. OpenAI hasn’t released details, but Redwood said they had to find and punish six thousand different incorrect responses to halve the incorrect-response-per-unit-time rate. Because AIs are sort of intelligent, they can generalize from specific examples getting punished for “I love racism” will also make them less likely to say “I love sexism”. This isn’t just adding in a million special cases. The AI is “punished” for wrong answers (“I love racism”) and “rewarded” for right answers (“As a large language model trained by OpenAI, I don’t have the ability to love racism.”) Red-teamers ask the AI potentially problematic questions. Their main strategy was the same one Redwood used for their AI - RLHF, Reinforcement Learning by Human Feedback. OpenAI put a truly remarkable amount of effort into making a chatbot that would never say it loved racism. When they inevitably succeed, they publish an article titled “AI LOVES RACISM!” Then the corporation either recalls its chatbot or pledges to do better next time, and the game moves on to the next company in line. Then the journalists try to trick the chatbot into saying “I love racism”. The corporation tries to program the chatbot to never say offensive things. It’s very impressive!Įvery corporate chatbot release is followed by the same cat-and-mouse game with journalists. If you haven’t played with it yet, I recommend it. OpenAI released a question-answering AI, ChatGPT. Now that same experiment is playing out on the world stage.
0 Comments
Leave a Reply. |