FastHTML page

Arena AI Model ELO History (mayerwin.github.io)

50 points by mayerwin 5 hours ago | 23 comments

Hi HN,

I built a live tracker to visualize the lifecycle and performance changes of flagship AI models.

We've all experienced the phenomenon where a flagship model feels amazing at launch, but weeks later, it suddenly feels a bit off. I wanted to see if this was just a feeling or a measurable reality, so I built a dashboard to track historical ELO ratings from Arena AI.

Instead of a massive spaghetti chart of every single model variant, the logic plots exactly ONE continuous curve per major AI lab. It dynamically tracks their highest-rated flagship model over time, which makes both the sudden generational jumps and the slow performance decays much easier to see. It took quite a lot of iterations to get the chart to look nice on mobile as well. Optional dark mode included.

However, I have a specific data blindspot that I'm hoping this community might have insights on.

Arena AI largely relies on testing API endpoints. But as we know, consumer chat UIs often layer on heavy system prompts, safety wrappers, or silently switch to heavily quantized models under high load to save compute. API benchmarks don't fully capture this "nerfing" that everyday web users experience.

Does anyone know of any historical ELO or evaluation datasets that specifically scrape or test outputs from the consumer web UIs rather than raw APIs?

I'd love to integrate that data for a more accurate picture of the consumer experience. The project is open-source (repo link in the footer), so I'd appreciate any feedback, or pointers to datasets!

• underyx 3 hours ago

> the slow performance decays

the decays are just more capable other models entering the population, making all prior models lose more frequently

• tedsanders 2 hours ago

For what it's worth, I work at OpenAI and I can guarantee you that we don't switch to heavily quantized models or otherwise nerf them when we're under high load. It's true that the product experience can change over time - we're frequently tweaking ChatGPT & Codex with the intention of making them better - but we don't pull any nefarious time-of-day shenanigans or similar. You should get what you pay for.

• selcuka 2 hours ago

> we don't switch to heavily quantized models

That sounded like a press bulletin, so just to let you clarify yourself: Does that mean you may switch to lightly quantized models?

• jychang 2 hours ago

There's almost 0% chance that OpenAI doesn't quantize the model right off the bat.

I am willing to bet large amounts of money that OpenAI would never release a model served as fully BF16 in the year of our lord 2026. That would be insane operationally. They're almost certainly doing QAT to FP4 for FFN, and a similar or slightly larger quant for attention tensors.

• selcuka 2 hours ago

It's ok if they never release a BF16 model, but it's less ok if they release it, win the benchmarks, then quantise it after a few weeks.

• Ciph 2 hours ago

Thank you for your answer. I have a similar question as OP, but in regards of the GPT models in MS copilot. My experience is that the response quality is much better when calling the API directly or through the webUI.

I know this might be a question that's impossible for you to answer, but if you can shed any light to this matter, I'd be grateful as I am doing an analysis over what AI solutions that can be suitable for my organisation.

• kimjune01 36 minutes ago

Although Arena is adversarial and resistant to goodharting, it's not immune. Models that train on Arena converge on helpfulness, not necessarily truthiness

• cherioo an hour ago

The interesting thing I find is how Anthropic has been more consistently improving over time in the last few years, that allows it to catchup and surpass OpenAI and Google. The latter two have pretty much plateau over the last year or so. GPT 5.5 is somehow not moving the needle at all.

I hope to see the other labs can bring back competition soon!

• XCSme an hour ago

Gpt 5.5 is quite a big leap, it's a lot better than opus 4.7 for agentic coding

• energy123 40 minutes ago

Arena only allows very small context sizes, so it's a noisy benchmark for what we care about IRL.

• jdw64 40 minutes ago

This is great, but personally, I really wish we had an Elo leaderboard specifically for the quality of coding agents.

Honestly, in my opinion, GPT-5.5 Codex doesn't just crush Claude Code 4.7 opus —it's writing code at a level so advanced that I sometimes struggle to even fully comprehend it. Even when navigating fairly massive codebases spanning four different languages and regions (US, China, Korea, and Japan), Codex's performance is simply overwhelming.

How would we even go about properly measuring and benchmarking the Elo for autonomous agents like this?

• vachanmn123 38 minutes ago

Isn't code that you fail to understand literally a sign that its worse?

• jdw64 36 minutes ago

It was often much faster, and when I revisited the code later, there were cases where I realized it had moved the implementation toward a better abstraction.

• jdw64 26 minutes ago

I should also add that I am not claiming to be a particularly great programmer. I have never worked at FAANG, and I haven't had much exposure to the kind of massive codebases those engineers deal with every day.

Most of the code I've worked with comes from Korean and Chinese startups, industrial contractors, or older corporate research-lab environments. So I know my frame of reference is limited.

When I write code, I usually rely on fairly conservative patterns: Result-style error handling instead of throwing exceptions through business logic, aggressive use of guard clauses, small policy/strategy objects, and adapters at I/O boundaries. I also prefer placing a normalization layer before analysis and building pure transformation pipelines wherever possible.

So when Codex produced a design that decoupled the messy input adapter from the stable normalized data, and then separated that from the analyzer, it wasn't just 'fancier code.' It aligned perfectly with the architectural direction I already value, but it pushed the boundaries of that design further than I would have initially done myself.

This is exactly why I hesitate to dismiss code as 'bad' just because I don't immediately understand it. Sometimes, it really is just bad code. But sometimes, the abstraction is simply a bit ahead of my current local mental model, and I only grasp its true value after a second or third requirement is introduced.

To be completely honest, using AI has caused a significant drop in my programming confidence. Since AI is ultimately trained on codebases written by top-tier programmers, its output essentially represents the average of those top developers—or perhaps slightly below their absolute peak.

I often find myself realizing that the code I write by hand simply cannot beat it

• eis 2 hours ago

The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.

You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.

• tedsanders 2 hours ago

FYI, Elo isn't an acronym - it's a person's name. No need to capitalize it as ELO.

• andrewshadura an hour ago

Unless you've just missed your last train to London.

• SilverElfin 2 hours ago

You’re right: https://en.wikipedia.org/wiki/Elo_rating_system

• gitowiec an hour ago

> ELO ratings

Thank you, I just looked at the chart and said to myself: ELO? YOLO!

That Elo ranking is also called chess ranking

• andrewshadura an hour ago

Élő. Meaning alive (él = it lives, -ő = adjective)

• Thomashuet an hour ago

It seems to be a USA only thing, Chinese models and Mistral don't show any downward trend.

• patall 20 minutes ago

Wouldn't it be really weird if a open-weight model dropped in performance? Because then, it would rather be the Elo ranking

• refulgentis 2 hours ago

Is this slop? It has wildly aggressive language that agrees with a subset of pop sentiment, re: models being “nerfed”. It promises to reveal this nerfing. Then, it goes on to…provide an innocuous mapping of LM Arena scores that always go up?