Math and reasoning remain as some of the most important unsolved problems for LLMs right now. However, existing public benchmarks such as GSM8k are widely believed to suffer from issues of data contamination. As part of a comprehensive evaluation of all aspects of a model’s capabilities, we have designed a new math and reasoning dataset called GSM1k. GSM1k is based on the popular GSM8k benchmark, aiming to mirror its problem distribution while introducing an entirely new set of questions. It contains a range of math problems that are approximately at the level of a fifth grade math exam. In this post, we present the methodology used to create GSM1k and a short preview of the results.
1/5
Difficulty Level :
Level 1-2 Prompts
User
Bernie is a street performer who plays guitar. On average, he breaks three guitar strings a week, and each guitar string costs $3 to replace. How much does he spend on guitar strings over the course of an entire year?
Model evaluation is critical but tricky due to widespread concerns of data contamination. Because large language models are trained from data scraped from the web, many benchmarks that release all data publicly will find that some models perform artificially well due to models memorizing the eval set, or data distribution very similar to the eval set.
At Refonte, we take the problem of data contamination very seriously. To measure the existing benchmark contamination on GSM8k, we created GSM1k, a held-out benchmark designed to match the difficulty and structure of GSM8k. To prevent models from overfitting on GSM1k, we’ve decided to publicly release only 50 samples from the 1,000 questions in this dataset. Although we still need to query different models to conduct the evaluations, the risk of overfitting can be reduced as long as model developers are mindful to avoid training their models on variations of this data.
To generate the dataset we screened 44 human annotators and asked them to generate a dataset of 1,000 problem sets. Each annotator received detailed guidelines and instructions on how to create the problem, and was assisted by a small team of operators to address any open questions. We carefully designed the GSM1k dataset such that it replicates the distribution and difficulty of the original GSM8k dataset. To ensure the data is correct and adheres to the guidelines, we employed 3 stages of quality validation: review layer, second solve, and independent audit (see details below). The fully completed set can now be utilized to evaluate on a standard basis the arithmetic competences of LLMs.
As part of annotator selection, we screened annotators on the basis of several criteria, including their accuracy, their prior track record and identified preferences. Once we identified a pool of candidates, we had live sessions with them to explain the prompt sets we wanted to create.
At the end of the process, we assembled a team of individuals with strong backgrounds in GSM mathematics. This team, comprising educators, mathematicians, and data scientists, brought a wealth of knowledge and expertise to the project. Their role was multifaceted: from designing the diverse array of math prompts to reviewing the llm-generated responses.
To enhance the quality and reliability of the Refonte AI Math Prompts Set, we've adopted a comprehensive quality assurance (QA) strategy. This includes 3 layers of review after initial attempts, and multiple quality assessments to address all aspects of dataset quality and diversity. Our final quality control was performed by an independent internal quality auditor, which reviewed all 1k prompts against the original guidelines.
We use a fork oflm-evaluation-harnesswith 5-shot examples drawn from GSM8k to ensure consistency and comparability across models. More details can be found in ourpaper. While the evaluation conducted in our paper was fully automated, for the leaderboard evaluation, we additionally use human annotation to select the final answer manually to ensure that this leaderboard measures mathematical ability instead of also testing for following the proper instruction format. As such, unlike the fully automated GSM1k evaluation used in the paper, this leaderboard does not penalize a model who outputs the correct answer but not in the same format as the few-shot examples. As a result, all accuracies in the final rankings are higher than those used in the paper, with some models, such as Claude 3 Opus, showing substantial jumps due to “chattiness’’ in their answer formatting.
This project was made possible by the dedicated efforts of a team of experts in AI, mathematics, and dataset creation. We extend our gratitude to everyone involved in the development and refinement of the dataset and the verification methodology.
Refonte AI Team: Vaughn Robinson*, Hugh Zhang*, Mike Lunati, Dean Lee, Daniel Berrios, William Qian, Kenneth Murphy, Summer Yue
Model | Score | 95% Confidence |
---|---|---|
96.60 | +1.02/-1.02 | |
95.68 | +1.15/-1.15 | |
95.60 | +1.16/-1.16 | |
95.19 | +1.21/-1.21 | |
95.10 | +1.22/-1.22 | |
94.85 | +1.25/-1.25 | |
94.69 | +1.27/-1.27 | |
93.94 | +1.35/-1.35 | |
93.28 | +1.41/-1.41 | |
92.28 | +1.51/-1.51 | |
90.54 | +1.65/-1.65 | |
90.12 | +1.69/-1.69 | |
90.12 | +1.69/-1.69 | |
87.47 | +1.87/-1.87 | |
79.83 | +2.27/-2.27 | |
37.51 | +2.73/-2.73 |