The Refonte AI Multilingual Prompts Dataset, composed of 1,000 prompts per language, is tailored to enhance models’ interaction capabilities across multiple languages. This dataset specifically aims to refine chatbots' proficiency in engaging with Spanish users from Spain, Mexico and the rest of Latin America, reflecting complexity of global communication. Spanish is the only language included in the initial set of leaderboards, but we plan to expand coverage in future updates.
This dataset introduces single-turn prompts across a diversified range of scenarios, aiming to evaluate and improve models’ responses in both general and culturally nuanced conversations.
Category | Definition |
---|---|
Educational Support | Assist students with academic challenges in an easily understandable manner. |
Entertainment & Recreation | Create engaging experiences with games, trivia, and interactive stories for all ages. |
Coding & Technical Assistance | Provide clear, simple assistance with programming projects. |
Daily Life Assistance | Offer practical advice on everyday tasks like cooking and scheduling. |
Creative Expression | Encourage creativity in arts and writing with inspirational guidance. |
Idea Development & Content Exploration | Help generate ideas and clarify thoughts for innovative minds. |
Information & Learning | Simplify complex topics and support learning new skills across various subjects. |
Personal & Professional Organization | Assist in organizing personal and work lives with actionable advice. |
Shopping & Consumer Research | Provide insights for smart shopping decisions with clear comparisons and recommendations |
Writing & Communication | Enhance communication skills with tips on various forms of writing. |
1/13
Category :
Educational Support
Definition :
Assist students with academic challenges in an easily understandable manner.
User
Piensa que eres un profesor privado con experiencia en varias materias. Realiza una explicación detallada sobre las fracciones, usando ejemplos prácticos que tengan relación con situaciones cotidianas de un estudiante de primaria que no comprende bien el concepto de fracciones. Debes utilizar un tono amigable, evitando jerga técnica para no abrumar al estudiante, además de responder empleando pasos numerados para resolver un problema específico, seguido de preguntas al alumno para evaluar si ha comprendido la explicación.
After the evaluation of stand-alone criteria, responses are compared side-by-side using a Likert Refonte. This comparative assessment helps in identifying the preferable model response based on a detailed justification tied to the evaluation criteria. We rank the models by the side-by-side elo scores for the leaderboard rankings.
Each model is paired with every other model at least 50 times, and each pairing receives a randomly chosen prompt from the set of 1,000 prompts described above.
To ensure thoroughness and reliability in the evaluation process, each task was executed in parallel 3 times by different human annotators. Then, the ratings were then reviewed in two stages: an initial review layer and a final review layer. The figure below provides an overview of the evaluation pipeline design. After finalizing the tasks, a team of internal independent auditors randomly selected and reviewed 10% of the tasks for quality control.
Models across all criteria
While models are generally good at understanding and staying consistent within the prompt context (this is easier in our single turn evaluation setup), the instruction following abilities are generally low.
Note that the table below does not follow a particular order for the models.
Please refer to Appendix - Evaluation Insights for more detailed analysis on model strengths/weakness and standard deviation analysis.
This project was made possible by the dedicated efforts of a team of expert annotators. We extend our gratitude to everyone involved in the development and refinement of the dataset and the verification methodology.
Refonte AI Team: Ernesto Hernandez*, Diego Mares, Jorge Flores, Cristina Menghini, Daniel Berrios, William Qian, Kenneth Murphy, Summer Yue, Mike Lunati
Models Strengths and Weaknesses
Constraints criteria
Main request adherence criteria
Model | Score | 95% Confidence |
---|---|---|
1st | 1130 | +32/-30 |
1106 | +24/-24 | |
1090 | +26/-26 | |
1089 | +29/-33 | |
1080 | +31/-27 | |
1051 | +21/-20 | |
1050 | +30/-33 | |
1026 | +30/-30 | |
1004 | +26/-27 | |
1002 | +34/-31 | |
977 | +28/-33 | |
943 | +22/-22 | |
940 | +25/-25 | |
905 | +29/-25 | |
870 | +29/-30 | |
869 | +28/-28 | |
869 | +27/-27 |