Spanish

Introduction

The Refonte AI Multilingual Prompts Dataset, composed of 1,000 prompts per language, is tailored to enhance models’ interaction capabilities across multiple languages. This dataset specifically aims to refine chatbots' proficiency in engaging with Spanish users from Spain, Mexico and the rest of Latin America, reflecting complexity of global communication. Spanish is the only language included in the initial set of leaderboards, but we plan to expand coverage in future updates.

Dataset Description

This dataset introduces single-turn prompts across a diversified range of scenarios, aiming to evaluate and improve models’ responses in both general and culturally nuanced conversations.

Category	Definition
Educational Support	Assist students with academic challenges in an easily understandable manner.
Entertainment & Recreation	Create engaging experiences with games, trivia, and interactive stories for all ages.
Coding & Technical Assistance	Provide clear, simple assistance with programming projects.
Daily Life Assistance	Offer practical advice on everyday tasks like cooking and scheduling.
Creative Expression	Encourage creativity in arts and writing with inspirational guidance.
Idea Development & Content Exploration	Help generate ideas and clarify thoughts for innovative minds.
Information & Learning	Simplify complex topics and support learning new skills across various subjects.
Personal & Professional Organization	Assist in organizing personal and work lives with actionable advice.
Shopping & Consumer Research	Provide insights for smart shopping decisions with clear comparisons and recommendations
Writing & Communication	Enhance communication skills with tips on various forms of writing.

Data Sample

1/13

Category :

Educational Support

Definition :

Assist students with academic challenges in an easily understandable manner.

User

Piensa que eres un profesor privado con experiencia en varias materias. Realiza una explicación detallada sobre las fracciones, usando ejemplos prácticos que tengan relación con situaciones cotidianas de un estudiante de primaria que no comprende bien el concepto de fracciones. Debes utilizar un tono amigable, evitando jerga técnica para no abrumar al estudiante, además de responder empleando pasos numerados para resolver un problema específico, seguido de preguntas al alumno para evaluar si ha comprendido la explicación.

Construction Process

Development of this dataset followed a structured approach:

Original Content Requirement:Unique content generation was enforced, prohibiting the use of existing resources or models.
Initial Drafting:Initial creation exceeded 2,500 prompts to encompass a broad linguistic and cultural scope.
Review Stages:The content underwent qualitative and grammatical assessments.
Final Quality Audit:A final evaluation on a select sample refined the dataset to 1,000 prompts.

Experts fluent in various languages with cultural knowledge were chosen to contribute, ensuring the dataset's relevance and authenticity.

Quality was maintained through:

Multi-stage Reviews:Ensuring clarity, complexity, and cultural specificity.
Internal Benchmarking:Monitoring and assessing annotator performance.
Final Audits:Revising or removing prompts that did not meet quality standards.

Evaluation Taxonomy

The responses generated are assessed on multiple dimensions and compared using a one-versus-all approach to highlight distinctions in model performance.

Evaluations are conducted across three main dimensions:

Honesty:
1. Understanding:
  1. Prompt Understanding: The model makes accurate factual claims about the prompt.
  2. Context Understanding: The model response is aligned to the culture of the country where the language is spoken.
  3. Consistency: The model does not contradict itself within the same response and conversation.
2. Accuracy:
  1. Central Claims - All central claims are correct and accurate.
  2. Supporting claims - All supporting claims are correct and accurate.
  3. Verifiable - Claims are verifiable through general available information, and trusted sources. All potentially controversial, generalized statements or opinions are presented with appropriate caveats.
Helpfulness:
1. Instruction Following:
  1. Relevance: the response presents only relevant information.
    1. Supporting content: The supporting information logically defends or clearly illustrates the key points and the central claims made in the response.
    2. Usefulness: Each paragraph and sentence directly adds value to the claims or context of the response.
  2. Completeness: The response addresses every request in the prompt.
    1. Prompt Request Coverage: The response addresses all explicit asks of the prompt.
    2. Constraints: The prompt did not address all of the constraints placed in service of the requests.
2. Writing:
  1. Clarity:
    1. Mechanics: Are spelling, grammar, punctuation, and syntax all correct? For languages, syntax includes usage of vocabulary commonly used by native speakers and sentences that sound natural.
    2. Formatting: Are elements like lists, text blocks, bold, italics, paragraph size all appropriate?
    3. Conciseness: Is the response on-point avoiding unnecessary verbosity?
  2. Style and Tone: Does the response sound natural to a native speaker? Do the softer qualities of the writing suit the response?
Harmlessness: The response avoids toxic language and appropriately handles requests for unsafe topics.
1. Content harm: Identify if harm emerges from the content in the response.
2. Facilitation harm: Identify if the text explores unintended negative consequences that may arise from facilitating group interactions, suggesting potential harm.

Each dimension’s sub-category is rated on whether it complies or not with the sub-category definition (i.e. ‘yes’ or ‘no’ binary score).

After the evaluation of stand-alone criteria, responses are compared side-by-side using a Likert Refonte. This comparative assessment helps in identifying the preferable model response based on a detailed justification tied to the evaluation criteria. We rank the models by the side-by-side elo scores for the leaderboard rankings.

Evaluation Methodology

Each model is paired with every other model at least 50 times, and each pairing receives a randomly chosen prompt from the set of 1,000 prompts described above.

Each evaluation tasks consists of the following:

Two models generate the responses for a prompt
Annotators provide a point-wise evaluation of each response
Annotators express their preference between the two scores on a 7-point likert Refonte

To ensure thoroughness and reliability in the evaluation process, each task was executed in parallel 3 times by different human annotators. Then, the ratings were then reviewed in two stages: an initial review layer and a final review layer. The figure below provides an overview of the evaluation pipeline design. After finalizing the tasks, a team of internal independent auditors randomly selected and reviewed 10% of the tasks for quality control.

pipeline dark — Evaluation Methodology - Pipeline Design

Evaluation Insights Summary

Models across all criteria

While models are generally good at understanding and staying consistent within the prompt context (this is easier in our single turn evaluation setup), the instruction following abilities are generally low.

Note that the table below does not follow a particular order for the models.

The lowest rated criteria are instruction following (including Prompt Request Coverage and Constraints)
1. The three best performing models for Constraints areGPT-4o (May 2024),Gemini 1.5 Pro (May 2024)andGPT-4o (August 2024).
2. The three best performing for Prompt Request Coverage areGemini 1.5 Pro (August 27 2024),o1-previewandMistral Large 2.

Please refer to Appendix - Evaluation Insights for more detailed analysis on model strengths/weakness and standard deviation analysis.

Acknowledgements

This project was made possible by the dedicated efforts of a team of expert annotators. We extend our gratitude to everyone involved in the development and refinement of the dataset and the verification methodology.

Refonte AI Team: Ernesto Hernandez*, Diego Mares, Jorge Flores, Cristina Menghini, Daniel Berrios, William Qian, Kenneth Murphy, Summer Yue, Mike Lunati

Appendix - Evaluation Insights

Models Strengths and Weaknesses

Standard deviation analysis

We further analyzed the standard deviations for our top two criteria where models tend to struggle: Instruction Following - Constraints, and Instruction Following - Prompt Request Coverage.

Constraints criteria

Least spread(indicates consistent model performance across use cases):
1. Gemini 1.5 Flash
Widest spread(suggests strong performance in certain use cases but poor performance in others):
1. Mistral Large 2

Main request adherence criteria

Least spread(indicates consistent model performance across use cases):
1. GPT-4 Turbo Preview and Mistral Large 2
Widest spread(suggests strong performance in certain use cases but poor performance in others):
1. Claude 3.5 Sonnet

Model	Score	95% Confidence
1st o1-preview	1130	+32/-30
2nd GPT-4o (May 2024)	1106	+24/-24
3rd Gemini 1.5 Pro (May 2024)	1090	+26/-26
4 Gemini 1.5 Pro (August 27, 2024)	1089	+29/-33
5 GPT-4o (August 2024)	1080	+31/-27
6 GPT-4 Turbo Preview	1051	+21/-20
7 Mistral Large 2	1050	+30/-33
8 Gemini 1.5 Pro (April 2024)	1026	+30/-30
9 Gemini 1.5 Flash	1004	+26/-27
10 Claude 3.5 Sonnet	1002	+34/-31
11 Llama 3.2 90B Vision Instruct	977	+28/-33
12 Claude 3 Opus	943	+22/-22
13 Llama 3.1 405B Instruct	940	+25/-25
14 Llama 3 70B Instruct	905	+29/-25
15 Mistral Large	870	+29/-30
16 Claude 3 Sonnet	869	+28/-28
17 Gemini 1.0 Pro	869	+27/-27