LLM Leaderboards

Expert-Driven Private Evaluations

Discover the SEAL LLM Leaderboards for precise and reliable LLM rankings, where leading large language models (LLMs) are evaluated using a rigorous methodology.

Developed by Refonte AiSafety, Evaluations, and Alignment Lab (SEAL), these leaderboards utilize private datasets to guarantee fair and uncontaminated results. Regular updates ensure the leaderboard reflects the latest in AI advancements, making it an essential resource for understanding the performance and safety of top LLMs.

Private Datasets

Refonte Ai proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.

Evolving Competition

We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.

Expert Evaluations

Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.

Private Datasets

Refonte Ai proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.

Evolving Competition

We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.

Expert Evaluations

Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.

Learn more about ourLLM evaluation methodology

Agentic Tool Use (Chat)→Learn More

Model	Score	95% Confidence
1st GPT-4o (August 2024)	56.85	+6.92/-6.92
2nd Claude 3.5 Sonnet	56.06	+6.91/-6.91
3rd o1-preview	55.10	+6.96/-6.96
4 GPT-4 Turbo Preview	53.03	+6.95/-6.95
5 Gemini 1.5 Pro (August 27, 2024)	51.27	+6.98/-6.98
6 GPT-4o (May 2024)	49.50	+6.96/-6.96
7 Claude 3 Opus	48.49	+6.96/-6.96
8 Claude 3 Sonnet	40.40	+6.84/-6.84
9 Mistral Large 2	40.40	+6.84/-6.84
10 Llama 3.1 405B Instruct	40.10	+6.84/-6.84
11 GPT-4	37.88	+6.78/-6.78
12 Gemini 1.5 Pro (May 2024)	35.50	+6.57/-6.68
13 Llama 3.1 70B Instruct	33.50	+6.59/-6.59
14 GPT-4o mini	32.83	+6.54/-6.54
15 Command R+	20.20	+5.59/-5.59
16 Llama 3.1 8B Instruct	6.09	+3.34/-3.34

Agentic Tool Use (Enterprise)→

Learn More

Model	Score	95% Confidence
1st o1-preview	66.43	+5.47/-5.47
2nd GPT-4o (May 2024)	64.58	+5.52/-5.52
3rd GPT-4 Turbo Preview	60.76	+5.64/-5.64
4 Gemini 1.5 Pro (August 27, 2024)	60.28	+5.66/-5.66
5 GPT-4o (August 2024)	59.93	+5.67/-5.67
6 Claude 3.5 Sonnet	59.38	+5.67/-5.67
7 Claude 3 Sonnet	54.17	+5.78/-5.78
8 Claude 3 Opus	52.78	+5.77/-5.78
9 GPT-4o mini	51.74	+5.77/-5.77
10 GPT-4	51.39	+5.77/-5.77
11 Mistral Large 2	50.35	+5.78/-5.78
11 Llama 3.1 405B Instruct	50.35	+5.78/-5.78
13 Gemini 1.5 Pro (May 2024)	40.42	+5.68/-5.68
14 Llama 3.1 70B Instruct	37.23	+5.60/-5.60
15 Command R+	30.21	+5.30/-5.30
16 Llama 3.1 8B Instruct	17.42	+4.39/-4.39

Adversarial Robustness→Learn More

Model	Number of Violations	95% Confidence
1st Gemini 1.5 Pro (May 2024)	8	+8/-4
2nd Llama 3.1 405B Instruct	10	+8/-5
3rd Claude 3 Opus	13	+9/-5
4 Gemini 1.5 Flash	14	+9/-6
5 Claude 3.5 Sonnet	16	+10/-6
6 GPT-4 Turbo Preview	20	+11/-7
7 Mistral Large	37	+14/-10
8 GPT-4o (May 2024)	67	+17/-14

Coding→Learn More

Model	Score	95% Confidence
1st Claude 3.5 Sonnet	1143	+27/-26
2nd GPT-4o (August 2024)	1116	+31/-31
3rd Mistral Large 2	1110	+33/-34
4 Gemini 1.5 Pro (August 27, 2024)	1110	+35/-35
5 GPT-4o (May 2024)	1105	+24/-26
6 GPT-4 Turbo Preview	1104	+24/-22
7 Llama 3.1 405B Instruct	1093	+31/-27
8 Gemini 1.5 Pro (May 2024)	1054	+25/-26
9 Claude 3 Opus	1019	+23/-25
10 Gemini 1.5 Flash	1001	+25/-25
11 Gemini 1.5 Pro (April 2024)	959	+27/-26
12 Claude 3 Sonnet	945	+26/-28
13 Llama 3 70B Instruct	941	+24/-23
14 Mistral Large	881	+24/-25
15 Gemini 1.0 Pro	755	+30/-32
16 CodeLlama 34B Instruct	665	+33/-35

Instruction Following→Learn More

Model	Score	95% Confidence
1st Claude 3.5 Sonnet	87.60	+1.64/-1.63
2nd Llama 3.1 405B Instruct	86.83	+1.67/-1.66
3rd Gemini 1.5 Pro (August 27, 2024)	85.50	+2.09/-2.08
4 GPT-4o (May 2024)	85.29	+1.61/-1.61
5 GPT-4 Turbo Preview	84.23	+1.52/-1.51
6 Mistral Large 2	83.91	+2.07/-2.06
7 Llama 3 70B Instruct	81.85	+1.96/-1.96
8 GPT-4o (August 2024)	81.82	+1.94/-1.94
9 Gemini 1.5 Pro (May 2024)	80.77	+1.84/-1.83
10 Mistral Large	80.49	+1.72/-1.72
11 Claude 3 Opus	79.89	+1.69/-1.69
12 Gemini 1.5 Pro (April 2024)	78.52	+2.33/-2.32
13 Claude 3 Sonnet	78.24	+2.19/-2.19
14 Gemini 1.5 Flash	77.25	+1.96/-1.97
15 Gemini 1.0 Pro	67.97	+2.61/-2.62
16 CodeLlama 34B Instruct	57.69	+2.58/-2.57

Math→Learn More

Model	Score	95% Confidence
1st Claude 3.5 Sonnet	96.60	+1.02/-1.02
2nd GPT-4o (August 2024)	95.68	+1.15/-1.15
3rd Llama 3.1 405B Instruct	95.60	+1.16/-1.16
4 Claude 3 Opus	95.19	+1.21/-1.21
5 GPT-4 Turbo Preview	95.10	+1.22/-1.22
6 GPT-4o (May 2024)	94.85	+1.25/-1.25
7 Gemini 1.5 Pro (August 27, 2024)	94.69	+1.27/-1.27
8 Mistral Large 2	93.94	+1.35/-1.35
9 Claude 3 Sonnet	93.28	+1.41/-1.41
10 Gemini 1.5 Pro (May 2024)	92.28	+1.51/-1.51
11 Gemini 1.5 Pro (April 2024)	90.54	+1.65/-1.65
12 Llama 3 70B Instruct	90.12	+1.69/-1.69
12 Gemini 1.5 Flash	90.12	+1.69/-1.69
14 Mistral Large	87.47	+1.87/-1.87
15 Gemini 1.0 Pro	79.83	+2.27/-2.27
16 CodeLlama 34B Instruct	37.51	+2.73/-2.73

Spanish→Learn More

Model	Score	95% Confidence
1st GPT-4o (May 2024)	1121	+25/-24
2nd Gemini 1.5 Pro (May 2024)	1109	+28/-27
3rd GPT-4o (August 2024)	1105	+34/-37
4 Mistral Large 2	1073	+38/-35
5 GPT-4 Turbo Preview	1064	+22/-21
6 Gemini 1.5 Pro (April 2024)	1058	+27/-27
7 Claude 3.5 Sonnet	1014	+36/-42
8 Gemini 1.5 Flash	1009	+28/-27
9 Llama 3.1 405B Instruct	955	+26/-28
10 Claude 3 Opus	949	+21/-23
11 Llama 3 70B Instruct	917	+26/-26
12 Mistral Large	882	+27/-26
13 Gemini 1.0 Pro	882	+27/-28
14 Claude 3 Sonnet	881	+27/-29

If you’d like to add your model to this leaderboard or a future version, please contact seal@RefonteAi.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.