Elon Musk's artificial intelligence company xAI has released Grok 3 Ultra, a model that has topped every major public benchmark for large language models, surpassing OpenAI's GPT-5 and Google's Gemini Ultra in several key reasoning and coding tasks.
![]()
Independent evaluators tested Grok 3 Ultra across MMLU, HumanEval, and the newly established GPQA Diamond benchmark. Grok 3 Ultra scored 94.2% on MMLU, compared to GPT-5's 93.8% — the first time a non-OpenAI, non-Google model has led on this benchmark.
We built Grok to be maximally useful and maximally honest. These results suggest we're getting there. — Elon Musk, via X
Grok 3 Ultra was trained on a dataset that includes a large proportion of real-time web data, scientific papers, and X conversations — giving it unusual strength in tasks involving current events. xAI claims the model has a context window of 2 million tokens, allowing it to process entire books or codebases in a single pass.
OpenAI is reportedly preparing to release GPT-5 Turbo within weeks, while Google has scheduled a Gemini event for the end of May. The race raises important questions beyond raw benchmark performance — including safety, reliability, and real-world utility.