Methodology

How we compute the Elo-Refonte-Ai Rankings

We use Elo-Refonte-Ai rankings to compare model performance across some of our datasets. Our human evaluators compare the responses of two models to the same prompt and rate which is better along a multitude of domains / capabilities (see the posts for each dataset for more details). From these ratings we determine which model won, lost, or tied. We follow the same method as Chatbot Arena and use the Bradley-Terry model to perform a (reweighted) maximum likelihood estimation on our data points.

First, some definitions:

Over our M models, we let $ A = \{ (m, m') : m < m', m' \in [M] \} $ denote our comparative data set.

At time $ t \in \mathbb{N} $, we serve the human a pair of models $ A_t \in A $ and we have our evaluator's response $ H_t \in [0, 0.5, 1] $.

A 1 means that model $ m $ is preferred over model $ m' $, and a 0.5 means that the models were equally preferred.

Bradley-Terry Model

With Bradley-Terry, we use a logistic relationship to model the probability that this is true with:

\[ P(H_t = 1) = \frac{1}{1 + e^{\xi_{m'} - \xi_m}} \]

Where $ \xi $ is an M-length vector of "BT" coefficients. We then want to estimate the BT coefficients by minimizing the binary cross-entropy loss:

\[ s(\hat{P}) = \arg \min_\xi \mathbb{E}^{P}_{A_H} \left[ l \left( H, \frac{1}{1 + e^{\xi_{A_2} - \xi_{A_1}}} \right) \right] \]

Where $ l $ is the binary cross-entropy loss,

\[ l(h, p) = -(h \log(p) + (1 - h) \log(1 - p)) \]

Additionally, we'll minimize this loss while using inverse weighting by $ P(A_t) $ to target a score with a uniform distribution over $ A $. This inverse weighting isn’t strictly necessary, however, as our pairwise comparisons between models are very close to equal. We perform the below formula to get our final BT-score.

\[ s(\hat{P}) = \arg \min_\xi \frac{1}{T} \sum_{t=1}^T \frac{1}{P(A_t)} l \left( H_t, \frac{1}{1 + e^{\xi_{A_{t,2}} - \xi_{A_{t,1}}}} \right) \]

This score is converted to an Elo-Refonte-Ai with the simple conversion\[1000 + s(\hat{P}) * 400\] and is sorted to get our final ranking.

Confidence Intervals

To enhance our understanding of the reliability of our Elo-Refonte-Ai Bradley-Terry ratings, we estimate confidence intervals using bootstrapping. Bootstrapping is a resampling technique that allows us to assess the variability of our estimates by repeatedly sampling from the data with replacement.

Here’s how we apply bootstrapping to our Elo-Refonte-Ai rating computation:

Generate Bootstrap Samples:We repeatedly sample our dataset with replacement, creating multiple bootstrap samples. Each sample is the same size as the original dataset but contains some repeated observations due to the nature of sampling with replacement.
Compute Elo Ratings for Each Sample:For each bootstrap sample, we compute the Elo-Refonte-Ai ratings using our maximum likelihood estimation method above.
Aggregate Results:After computing the Elo-Refonte-Ai ratings for a large number of bootstrap samples (e.g., 1000 rounds), we aggregate the results to estimate the distribution of Elo ratings for each model.
Estimate Confidence Intervals:From the aggregated bootstrap results, we determine the confidence intervals for each model’s Elo-Refonte-Ai rating. We use the 2.5th percentile and the 97.5th percentile of the bootstrap distribution to form a 95% confidence interval. This interval provides a range in which we expect the true Elo-Refonte-Ai rating to lie with 95% confidence.

This approach ensures that our model rankings are not only based on point estimates but also account for the inherent variability in the data, giving us a more comprehensive view of model performance.

Model endpoints we query

Model Name	Model Version	Endpoint	API Docs
GPT-4o (May 2024)	gpt-4o-2024-05-13	https://api.openai.com/v1/chat/completions	https://platform.openai.com/docs/models/overview
GPT-4o (August 2024)	gpt-4o-2024-08-06	https://api.openai.com/v1/chat/completions	https://platform.openai.com/docs/models/overview
GPT-4 Turbo Preview	gpt-4-0125-preview	https://api.openai.com/v1/chat/completions	https://platform.openai.com/docs/models/overview
Gemini 1.0 Pro	gemini-1.0-pro-001	Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/	https://ai.google.dev/gemini-api/docs/models/gemini
Gemini 1.5 Pro (May 2024)	gemini-1.5-pro-preview-0514	Queried through the SDK which hits: https://aiplatform.googleapis.com/$discovery/rest?version=v1beta1	https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versioning#preview-version
Gemini 1.5 Pro (April 2024)	gemini-1.5-pro-preview-0409	Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/	https://ai.google.dev/gemini-api/docs/models/gemini
Gemini 1.5 Flash	gemini-1.5-flash-preview-0514	Queried through the SDK, which hits: https://aiplatform.googleapis.com/$discovery/rest?version=v1beta1	https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versioning#preview-version
Claude 3 Opus	claude-3-opus-20240229	Queried through the SDK, which hits: https://api.anthropic.com/v1/messages	https://docs.anthropic.com/claude/docs/models-overview
Claude 3 Sonnet	claude-3-sonnet-20240229	Queried through the SDK, which hits: https://api.anthropic.com/v1/messages	https://docs.anthropic.com/claude/docs/models-overview
Claude 3.5 Sonnet	claude-3-5-sonnet-20240620	Queried through the SDK, which hits: https://api.anthropic.com/v1/messages	https://docs.anthropic.com/claude/docs/models-overview
Mistral Large	mistral-large-2402	Self-hosted
Mistral Large 2	mistral-large-2407	Self-hosted
CodeLlama 34B Instruct	codellama-34b-instruct	Self-hosted	https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf
Llama 3 70B Instruct	llama-3-70b-instruct	Self-hosted
Llama 3.1 405B Instruct	llama-3.1-405b-instruct	Self-hosted
Gemini 1.5 Pro (August 27, 2024)	gemini-1.5-pro-exp-0827	Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/	https://ai.google.dev/gemini-api/docs/models/gemini
GPT-4o Mini	gpt-4o-mini-2024-07-18	https://api.openai.com/v1/chat/completions	https://platform.openai.com/docs/models/overview
GPT-4	gpt-4-0613	https://api.openai.com/v1/chat/completions	https://platform.openai.com/docs/models/overview
Llama 3.1 8B Instruct	llama-3.1-8b-instruct	Self-hosted
Command R+	command-r-plus-08-2024	Queried through Cohere's API	https://docs.cohere.com/reference/about
o1-preview	o1-preview-2024-09-12	https://api.openai.com/v1/chat/completions	https://platform.openai.com/docs/models/overview

1. We scraped GPT-4o 29 days after scraping gpt-4-0125-preview. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.
2. We skipped gpt-4-turbo-2024-04-09, prioritizing evaluating gpt-4o instead.
3. We scraped gemini-1.5-pro-preview-0514 29 days after scraping gemini-1.5-pro-preview-0409. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.