What Do AI Models Think of Each Other? (12 Angry AI Models)

Nexarithm

June 1, 2025

12 AI models rating each other's performance

TL;DR I made 12 different AI models summarize the famous ‘Attention is All You Need’ paper and then rate each other’s summaries, there is clear winner!

It’s fun to experiment with different AI models to rate the performance of other AI models. I made a quick battleground for 12 models from 9 providers by using the open-source ProxAI library:

Claude — Sonnet-4
OpenAI — O1 and Gpt 4.1
Gemini — Gemini Pro 2.5 Preview
DeepSeek — V3
Grok — Grok-3-beta
Mistral — Mistral Large
Cohere — Command-a
HuggingFace — Qwen3–32b and Phi-4
Databricks — Llama-4-maverick and Llama-3.3–70b-it

The task is simple: I gave the “Attention is All You Need” paper to all models and asked them to summarize it. Of course, we used the paper that started the race (to the bottom?).

Claude — Sonnet-4: This paper introduces the Transformer, a groundbreaking neural network architecture for sequence-to-sequence tasks like machine translation…

Grok — Grok-3-beta: In the seminal paper “Attention Is All You Need”, authored by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones…

Before things get more interesting, here’s the Google Colab: link. You can replicate the results and use ProxAI yourself for your own experiments.

Let’s Pick a Judge AI Model

I chose every pairwise combination of the summaries and asked the Judge AI model, “Which summary of the Transformer paper is better for CS students?” Obviously, I used a longer prompt — because, you know, “prompt engineering”. You can check the details on Colab and leave the rest un-“prompt-engineered.”

Oh, I forgot to mention: I asked all pairs to all the AI models — not just one! Like saying, “Hey Gemini, which summary is better — Grok-3 or Sonnet-4?” in every possible way. (It got a little costly, but more on that later.)

Diagram showing the process of AI models evaluating each other's summaries

I asked every pair twice — swapping the order to make sure the models weren’t biased by position (ah, it gets more costly!). Each judge gave its pick and an explanation of why that summary was better. Sometimes this resulted in a tie.

Ok, Who is the Winner?

Things are getting tough! There’s a jury of 12 very judgmental AI models — it’s not easy to stay at the top.

The first simple analysis is counting how many times each summary was picked by the judge models.

The winnnneeerrr goessss to: OpenAI-GPT-4.1 🎉

Bar chart showing comparison of AI model ratings with GPT-4.1 at the top

Sonnet-4 and Gemini-2.5-Pro-Preview get the second and third place according to judge pick counts. But there is still one interesting question:

Who Picked What?

Do the models favor their own summaries, or do they simply accept defeat and respect a superior model?

This table shows the rankings of the Judge AI models: (1 = most picked by this judge, 2 = second most, etc.)

Heatmap table showing how each AI model ranked the others

Interesting enough OpenAI-Gpt-4.1, Claude-Sonnet-4, and Gemini-2.5-Pro-Preview chooses themselves as number one. What a megalomaniacs?! Seriously, this is interesting because maybe they are using the generated results to train the model again and this creates some feedback loops. Who knows?

It is clear that other models are more humble and their top-3/4 models are mostly the same. (Their 12th model are definitely the same. Sorry Llama-3.3–70b-it to bring you to this battle.)

How Much Does This Fun Experiment Cost?

I made a couple of iterations to finalize the experiment. All modifications affected ProxAI cache usage and provider queries, but here’s the breakdown from the ProxAI dashboard:

ProxAI dashboard showing cost breakdown by provider

During the iterations of my Google Colab, I had 69% percent cache hit which saved money and significant time for me. I used Gemini on additional result gathering tasks, that is why it spiked on the graph, not because Gemini is old grumpy dude that you need to ask multiple times.

ProxAI dashboard showing cache savings and cost breakdown

ProxAI’s cost estimation is little bit overestimates but still it was more costly than I expected. Wow, 52$ to create battleground for AI models. At least, it saved 118$ via cache during iteration phase but still you can get proper steak with 52$.

Anyway, maybe I should picked the simpler paper but I found it is interesting to start with founding paper to evaluate models that created by the idea used in this paper.

Final Thoughts

There is no final thoughts. Open the colab and play with the models! The limit is endless. There is nothing to be final at this state of the AI models.

Archive How I Built a Multi-Model AI Chat App via Claude Code and ProxAI