What Do AI Models Think of Each Other? (12 Angry AI Models)
TL;DR I made 12 different AI models summarize the famous ‘Attention is All You Need’ paper and then rate each other’s summaries, there is clear winner!
It’s fun to experiment with different AI models to rate the performance of other AI models. I made a quick battleground for 12 models from 9 providers by using the open-source ProxAI library:
- Claude — Sonnet-4
- OpenAI — O1 and Gpt 4.1
- Gemini — Gemini Pro 2.5 Preview
- DeepSeek — V3
- Grok — Grok-3-beta
- Mistral — Mistral Large
- Cohere — Command-a
- HuggingFace — Qwen3–32b and Phi-4
- Databricks — Llama-4-maverick and Llama-3.3–70b-it
The task is simple: I gave the “Attention is All You Need” paper to all models and asked them to summarize it. Of course, we used the paper that started the race (to the bottom?).
Claude — Sonnet-4: This paper introduces the Transformer, a groundbreaking neural network architecture for sequence-to-sequence tasks like machine translation…
Grok — Grok-3-beta: In the seminal paper “Attention Is All You Need”, authored by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones…
Before things get more interesting, here’s the Google Colab: link. You can replicate the results and use ProxAI yourself for your own experiments.
Let’s Pick a Judge AI Model
I chose every pairwise combination of the summaries and asked the Judge AI model, “Which summary of the Transformer paper is better for CS students?” Obviously, I used a longer prompt — because, you know, “prompt engineering”. You can check the details on Colab and leave the rest un-“prompt-engineered.”
Oh, I forgot to mention: I asked all pairs to all the AI models — not just one! Like saying, “Hey Gemini, which summary is better — Grok-3 or Sonnet-4?” in every possible way. (It got a little costly, but more on that later.)
I asked every pair twice — swapping the order to make sure the models weren’t biased by position (ah, it gets more costly!). Each judge gave its pick and an explanation of why that summary was better. Sometimes this resulted in a tie.
Ok, Who is the Winner?
Things are getting tough! There’s a jury of 12 very judgmental AI models — it’s not easy to stay at the top.
The first simple analysis is counting how many times each summary was picked by the judge models.
The winnnneeerrr goessss to: OpenAI-GPT-4.1 🎉
Sonnet-4 and Gemini-2.5-Pro-Preview get the second and third place according to judge pick counts. But there is still one interesting question:
Who Picked What?
Do the models favor their own summaries, or do they simply accept defeat and respect a superior model?
This table shows the rankings of the Judge AI models: (1 = most picked by this judge, 2 = second most, etc.)
Interesting enough OpenAI-Gpt-4.1, Claude-Sonnet-4, and Gemini-2.5-Pro-Preview chooses themselves as number one. What a megalomaniacs?! Seriously, this is interesting because maybe they are using the generated results to train the model again and this creates some feedback loops. Who knows?
It is clear that other models are more humble and their top-3/4 models are mostly the same. (Their 12th model are definitely the same. Sorry Llama-3.3–70b-it to bring you to this battle.)
How Much Does This Fun Experiment Cost?
I made a couple of iterations to finalize the experiment. All modifications affected ProxAI cache usage and provider queries, but here’s the breakdown from the ProxAI dashboard:
During the iterations of my Google Colab, I had 69% percent cache hit which saved money and significant time for me. I used Gemini on additional result gathering tasks, that is why it spiked on the graph, not because Gemini is old grumpy dude that you need to ask multiple times.
ProxAI’s cost estimation is little bit overestimates but still it was more costly than I expected. Wow, 52$ to create battleground for AI models. At least, it saved 118$ via cache during iteration phase but still you can get proper steak with 52$.
Anyway, maybe I should picked the simpler paper but I found it is interesting to start with founding paper to evaluate models that created by the idea used in this paper.
Final Thoughts
There is no final thoughts. Open the colab and play with the models! The limit is endless. There is nothing to be final at this state of the AI models.