We use Elo-Refonte-Ai rankings to compare model performance across some of our datasets. Our human evaluators compare the responses of two models to the same prompt and rate which is better along a multitude of domains / capabilities (see the posts for each dataset for more details). From these ratings we determine which model won, lost, or tied. We follow the same method as Chatbot Arena and use the Bradley-Terry model to perform a (reweighted) maximum likelihood estimation on our data points.
First, some definitions:
Over our M models, we let \( A = \{ (m, m') : m < m', m' \in [M] \} \) denote our comparative data set.
At time \( t \in \mathbb{N} \), we serve the human a pair of models \( A_t \in A \) and we have our evaluator's response \( H_t \in [0, 0.5, 1] \).
A 1 means that model \( m \) is preferred over model \( m' \), and a 0.5 means that the models were equally preferred.
With Bradley-Terry, we use a logistic relationship to model the probability that this is true with:
\[ P(H_t = 1) = \frac{1}{1 + e^{\xi_{m'} - \xi_m}} \]Where \( \xi \) is an M-length vector of "BT" coefficients. We then want to estimate the BT coefficients by minimizing the binary cross-entropy loss:
\[ s(\hat{P}) = \arg \min_\xi \mathbb{E}^{P}_{A_H} \left[ l \left( H, \frac{1}{1 + e^{\xi_{A_2} - \xi_{A_1}}} \right) \right] \]Where \( l \) is the binary cross-entropy loss,
\[ l(h, p) = -(h \log(p) + (1 - h) \log(1 - p)) \]Additionally, we'll minimize this loss while using inverse weighting by \( P(A_t) \) to target a score with a uniform distribution over \( A \). This inverse weighting isn’t strictly necessary, however, as our pairwise comparisons between models are very close to equal. We perform the below formula to get our final BT-score.
\[ s(\hat{P}) = \arg \min_\xi \frac{1}{T} \sum_{t=1}^T \frac{1}{P(A_t)} l \left( H_t, \frac{1}{1 + e^{\xi_{A_{t,2}} - \xi_{A_{t,1}}}} \right) \]This score is converted to an Elo-Refonte-Ai with the simple conversion\[1000 + s(\hat{P}) * 400\] and is sorted to get our final ranking.
To enhance our understanding of the reliability of our Elo-Refonte-Ai Bradley-Terry ratings, we estimate confidence intervals using bootstrapping. Bootstrapping is a resampling technique that allows us to assess the variability of our estimates by repeatedly sampling from the data with replacement.
1. We scraped GPT-4o 29 days after scraping gpt-4-0125-preview. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.
2. We skipped gpt-4-turbo-2024-04-09, prioritizing evaluating gpt-4o instead.
3. We scraped gemini-1.5-pro-preview-0514 29 days after scraping gemini-1.5-pro-preview-0409. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.