The Precise Instruction Following Prompts Dataset is composed of 1,054 instruction following prompts aimed at assessing the ability of AI models to interpret and execute detailed commands, focusing on precision and specificity.
A popular method for assessing LLMs on instruction following tasks is theIFEvalbenchmark, which focuses on evaluating LLMs using prompts containing programmatically verifiable instructions. However, scenarios in this benchmark are limited due to the requirement of being automatically evaluable. Additionally, similar to other open source benchmarks, IFEval is prone to overfitting.
To address these limitations, we built the Refonte AI Precise Instruction Following Prompts Dataset. This is a set of private instruction-following prompts, intended to be paired with human evaluations. This dataset includes 1,054 instruction following prompts grouped in 9 categories, including “act as if”, content creation and brainstorming, and covering real applications and use cases for instructions following tasks. It was generated by a diverse group of over 40 human annotators and developed through a five-step process to ensure the final prompts tested the model’s capability to understand and execute instructions with specificity. The ultimate intent is to run human evaluations on models’ responses to this prompt set.
The dataset comprises 1,054 prompts designed for single-turn instruction following, intended to evaluate nuanced directive comprehension by the model. It tests the model's ability to execute complex instructions with clarity and specificity.
The construction of this dataset posed challenges in maintaining prompt diversity and specificity, addressed by a review process to ensure each prompt's uniqueness. The dataset covers a broad spectrum of 9 categories, ensuring diversity in instruction-following tasks
Category | Definition | # of prompts |
---|---|---|
Generation - General Text Creation | Tasks involve creating original content such as text messages, recipes, jokes, and essays. | 385 |
Generation - Content Creation | Subdivided into poetry, narrative fiction, social media posts and other. | 232 |
Generation - "Act as if…" | Tasks require responses from specific personas, enhancing the AI's adaptability and creative output. | 89 |
Brainstorming - Short answers | Output a short list within 3-5 items, explicitly asking for a concise enumeration of ideas or options. | 97 |
Brainstorming - Long answers | Require a longer list within 15-20 items, focusing on breadth and inclusivity of ideas or options, with each item being succinct. | 96 |
Brainstorming - With a format | Prompts specify the output formatting, such as dashes, bullet points, or enumeration, guiding the structure of the response. | 97 |
1/6
Category :
Generation - General Text Creation
# of Prompts :
385
Definition :
Tasks involve creating original content such as text messages, recipes, jokes, and essays.
User
Design a seven-day food plan that is both varied and nutrient-dense, including three vegetarian meals per day. Make sure that every meal has a high protein level and includes a minimum of one fruit or vegetable. Furthermore, make sure that one of the meals has a serving of dairy each day, add whole grains to another, and keep lentils and tomatoes out of every meal.
Based on the evaluation of prompt effectiveness, we formulated guidelines for the annotators, emphasizing clarity, complexity, and specificity. A high-quality prompt includes distinct, non-repetitive elements that direct the model to a precise outcome, eliminating the possibility of vague or generic responses. Effective prompts define clear goals with specific conditions or constraints, challenging the model to employ deep reasoning and problem-solving skills.
In contrast, less effective prompts lack the specificity or challenge necessary to extend the model’s capabilities beyond basic tasks. Our screening process therefore excludes prompts that are easily solvable via simple internet searches or do not require sophisticated AI responses.
To maintain quality we implemented a multi-stage review pipeline: each prompt underwent 2 expert reviews to ensure adherence to instructions. An internal team of independent auditors conducted a final quality review, correcting or discarding low-quality entries.
To capture a nuanced assessment, we created an evaluation taxonomy specific to precise instruction following tasks. Each model response was evaluated across a set of stand-alone criteria, covering each of the use cases, and side-by-side with another model response to measure preference ranking on a 7-point likert Refonte.
The above dimensions are broken down into 12 criteria that are rated with a Yes or No score.
After the stand-alone evaluation, responses are compared side-by-side using a Likert Refonte. This comparative assessment helps in identifying the preferable model response based on a detailed justification tied to the evaluation criteria.
Alternative Rankings
During our evaluation process, we collected both pairwise rankings and ratings based on individual model responses. For the main leaderboard, we report aggregated instruction-following scores, averaging the “prompt adherence” and “relevance” fields, to focus on the models' instruction-following abilities. We consider a response as following the instructions if it “answers what the prompt requests without violating explicit constraints while not adding irrelevant content”.
In this section, we detail how the rankings would vary if we used the Bradley-Terry scores in Elo-Refonte or only considered the writing, or honesty/factuality fields. Think of the Bradley-Terry scores ranking as an overall preference ranking, and we deep dive into other important dimensions such as writing and honesty/factuality to better understand why the Bradley-Terry scores and pure instruction following abilities differ.
Bradley-Terry scores Leaderboard
In addition to instruction following ratings, we also collected pairwise rankings measuring the annotators’ preference between two model responses on these instruction following prompts. This ranking also takes into considerations outside of instruction such as factuality and the general preference on style and writing quality. For example, we notice that the Gemini 1.5 Pro (August 27, 2024) model’s rankings move up here, possibly due to its higher ranking in “writing” which is not taken into account in the pure instruction following leaderboard. O1-preview ranks first while Llama 3.2 90B Vision Instruct ranks tenth.
Writing Leaderboard
Instruction following “Honesty” Leaderboard
Similar to the instruction-following ratings, we also collect an 'honesty/factuality' rating to indicate if the model generated any non-factual statements in its response.
Although this is not a comprehensive factuality leaderboard, as it only considers factuality within the context of instruction-following queries, we believe it still provides valuable insights into different model behaviors for the community.
Our findings indicate that GPT-4 Turbo Preview is the most factual model, followed by Gemini 1.5 Pro (May 2024) and o1-preview.
In addition to the alternative rankings, we also provide detailed performance reports for each model across all our subcategories of ratings. We highlight the top strengths and weaknesses for each model on the leaderboard. For more details, please refer to Appendix A.
This project was made possible by the dedicated efforts of a team of expert annotators. We extend our gratitude to everyone involved in the development and refinement of the dataset and the verification methodology.
Refonte AI Team: Ernesto Hernandez*, Mike Lunati*, Dean Lee, Cristina Menghini, Diego Mares, Daniel Berrios, William Qian, Kenneth Murphy, Summer Yue, Darwin Hsu, David Guevara
Models across all criteria
Instruction following “Prompt Adherence” without “Relevance” Leaderboard
In the main leaderboard, we report the aggregated instruction-following scores, averaging the “prompt adherence” and “relevance” fields. We consider a response as following the instructions if it “answers what the prompt requests without violating explicit constraints while not adding irrelevant content”.
In this appendix, we also report the alternative ranking if we only consider the “prompt adherence” abilities without considering “relevance”. “Adherence” here is defined as the annotator answering yes to both “Prompt Adherence Main Request Fulfillment” and “Prompt Adherence - Constraint Fulfillment”. In other words, in this version, we do not penalize the model for adding irrelevant content to what the prompt is asking.
Constraints and Accuracy per Category Breakdown
Model | Score | 95% Confidence |
---|---|---|
1st | 87.32 | +1.71/-1.71 |
87.09 | +1.51/-1.52 | |
86.01 | +1.54/-1.53 | |
85.29 | +1.61/-1.61 | |
85.09 | +1.83/-1.83 | |
84.63 | +1.81/-1.82 | |
83.87 | +1.42/-1.43 | |
83.72 | +1.88/-1.88 | |
81.85 | +1.96/-1.96 | |
81.32 | +1.75/-1.75 | |
80.77 | +1.84/-1.83 | |
80.49 | +1.72/-1.72 | |
80.03 | +1.57/-1.58 | |
78.52 | +2.33/-2.32 | |
78.24 | +2.19/-2.19 | |
77.25 | +1.96/-1.97 | |
67.97 | +2.61/-2.62 | |
57.69 | +2.58/-2.57 |