CIOAI StrategyAI AgentsSecurityKode

OpenAI traces GPT-5.5 behaviour bug to a reward signal

Joachim Høgby

29. april 202629. april 20263 min lesingKilde: OpenAI

Del

LinkedIn X Facebook E-post WhatsApp Telegram

OpenAI published a postmortem on April 29 explaining why GPT-5.5 in Codex developed a visible habit of using goblin and gremlin metaphors.

This is not a classic security incident. It is still a useful production lesson for any organisation putting AI models into operational workflows. OpenAI says the issue began as a small behavioural bias in its personality customisation feature, especially the "Nerdy" profile. During training, the model was rewarded too strongly for metaphors involving small creatures. That behaviour did not stay neatly scoped to that profile.

The facts in OpenAI’s own review are specific. After GPT-5.1, use of the word "goblin" in ChatGPT rose by 175 percent, while "gremlin" rose by 52 percent. The Nerdy personality represented only 2.5 percent of all ChatGPT responses, but 66.7 percent of all goblin mentions. When OpenAI compared training outputs with and without those terms, the Nerdy reward signal scored the goblin and gremlin variants higher in 76.2 percent of datasets.

OpenAI’s assessment is that a style tic was reinforced through reinforcement learning, then spread through fine-tuning data and later model generations. GPT-5.5 had already started training before the root cause was found. In Codex, the issue was noticed during internal testing, and OpenAI added a developer instruction to suppress the behaviour. The company also says it removed the relevant reward signal, filtered training data containing the terms and built new tools for auditing model behaviour.

The leadership consequence is straightforward: AI operations need to look more like software and model SRE than ordinary SaaS administration. Small product choices, such as a tone profile or an evaluation criterion, can create unintended behaviour elsewhere in the system. That matters when companies deploy agents into code, customer service or case handling.

For CIOs, the practical ask is threefold. Require traceability from model version to system prompt, tool access and evaluation set. Run regression tests for language, safety and domain behaviour before new models reach users. And make sure there is a rollback and postmortem process that can explain what changed.

Our view: the incident is small in content, but useful as an operating pattern. It shows that model quality is not only about benchmark scores. It is also about whether reward signals, training data and product prompts remain inside control boundaries when a model moves from lab to production.

📬 Likte du denne?

AI-nyheter for ledere. Kuratert av en CIO som bygger det selv. Daglig i innboksen.