Building Benchmark Datasets to Drive Innovation in Education
Progress in AI is often discussed through technical breakthroughs — the transformer, the power of scaling through growing amounts of data and compute, and breakthrough moments like AlphaGo and ChatGPT.
But a hidden lever that has driven this AI progress has been the use of well-designed public benchmarks that allow effective “hill climbing” — the ability to continuously iterate and test model improvement against a pre-defined target.
As seen in projects like SQuAD and ImageNet, these benchmarks and their underlying datasets can motivate artificial intelligence (AI) researchers and developers to produce cutting-edge techniques and state-of-the-art algorithms. Take ImageNet, for instance. In 2007, Professor Fei-Fei Li started assembling a massive dataset of 14 million pictures labeled with the objects that appeared in those images. What began as a published open dataset quickly morphed into an annual competition for creating algorithms to identify objects with the lowest error rate. In 2017, the final year of the competition, the accuracy of the winning algorithm outperformed humans at 97.3 percent.
The development of ImageNet and the corresponding annual challenge proved that high-quality datasets and benchmarks can drive significant algorithmic innovation. It also showed what is possible when datasets are open and accessible to diverse groups.
Benchmark datasets in education hold similar transformative potential. A large, well-structured collection of real-world data on issues such as learners’ writing, mathematics, or speech development can be testing grounds for innovative AI models that can do tasks that can meaningfully assist learners, parents, and teachers. They can spur the growth of personalized learning tools that adjust to individual student needs. Ultimately, educational benchmarks can make the learning experience more technologically innovative, inclusive, and effective.
We interviewed Kumar Garg, President of Renaissance Philanthropy, to explore benchmark datasets — what distinguishes them from other datasets, how to create them, and why they matter.
Benchmark datasets improve AI-powered innovations in teaching and learning. | Photo by Allison Shelley for EDUimage
What are benchmark datasets?
Benchmark datasets are the gold standard for testing AI models. They are standardized data collections that help us evaluate how well an AI model performs on a specific task. We can use benchmarks to compare different models, ensuring they are reliable, generalizable, and up to the task. This approach drives innovation forward and ensures that new AI models are advancing towards real-world value.
Why are they important?
Benchmark datasets are essential for AI models, especially in educational tools and platforms. They help set a baseline for performance so we can track how well or poorly a model performs over time. This is key for developing AI systems that actually improve learning outcomes and tackle real-world educational challenges. Plus, they promote transparency. Anyone can see how the models are performing and the data they are trained on.
Benchmark datasets also help illuminate critical issues, like bias — both in the models themselves and in human judgments. They can uncover equity gaps and ensure that AI solutions work for everyone, not just a select group. On a broader scale, benchmarks set the agenda for where we need to go next in education, spurring progress in specific areas and helping the field of learning engineering grow and evolve. In short, benchmarks are the foundation that can guide and shape the future of AI in education.
Can you give some examples?
In the world of educational data science, I’ve been involved in several benchmark datasets that are pushing the boundaries of what AI can do in the classroom.
One such dataset from the Readability Prize challenges traditional methods of measuring text readability. Older formulas like the Flesch-Kincaid Grade Level have been used for years, but they don’t quite capture the true complexity of texts. The solution was the CommonLit Ease of Readability (CLEAR) Corpus — a dataset containing around 5,000 reading passages for grades 3-12. This dataset was used in a data science competition where AI models explained about 90% of readability scores, far surpassing traditional methods that only explained about 25%. This kind of work is helping to build accurate tools for measuring how difficult a text is for students to understand - which can then be used in a variety of learning tools.
Another exciting benchmark comes from the Feedback Prize project, the PERSUADE corpus, which focuses on improving writing feedback tools. I helped develop this project with the team at The Learning Agency, and today the PERSUADE corpus features tens of thousands of argumentative essays from U.S. students in grades 6-12. Human experts have annotated these essays to highlight the argumentative features of essay writing and overall quality. The dataset was used in a data science competition to develop algorithms that can automatically identify discourse markers in argumentative writing. The AI models in this competition achieved about 75% accuracy, opening the door to more sophisticated feedback tools for students.
Another dataset focused on the tricky issue of Personally Identifiable Information (PII) in student writing. This benchmark contains around 22,000 essays from students in a massively open online course. This benchmark was the basis of a data science competition to create models to identify and remove sensitive information from data before they’re publicly released. The winning algorithms achieved impressive precision and accuracy, demonstrating how data science can help ensure student privacy while enabling educational research.
These benchmark datasets can help fuel AI innovation and provide real-world solutions to some of education’s biggest challenges.
How are they different from other datasets?
In the digital age, data in the form of text, images, audio, and video is being collected all the time. Learning management systems and digital tutoring platforms are constantly capturing precise data points on students' experience engaging in instructional activities or assessments. Academic researchers are embarking on innovative data collection projects, from capturing student emotions during classroom activities to recording how students interact with their teachers. These datasets can be useful for understanding the classroom environment.
However, if they aren’t thoughtfully designed or sampled to test an AI model, like benchmark datasets are, they won’t help us measure how well these models are advancing or pushing the field forward. Unlike other datasets, benchmark datasets have a clear task or target for the AI algorithm and are large enough to train it. Benchmark datasets are specifically designed to track progress, compare models, and ensure we’re moving in the right direction with learning engineering.
How does one build a benchmark dataset?
If you’re considering building your benchmark dataset, it’s important to keep a few things in mind. First and most important, think about the “common task” that the benchmark is trying to evaluate performance, how you would measure the quality of the output (e.g. can the model accurately transcribe what the student is saying) and data source you can use to elicit that output. You can read more about common task selection here.
Second, be sure to follow ethical data collection practices and secure the necessary permissions through contract agreements so the data can be released publicly. For instance, establish a data-sharing agreement with clear guidelines on how the data can be used, especially if you want it shared publicly and used for research or commercial purposes. PII must be removed to comply with student privacy laws.
It’s also best to gather relevant and unique data. Benchmark data collections are most useful if they cover a novel topic or offer more detailed information that hasn’t been collected. The more granular the data, the better — this approach allows researchers and AI developers to build innovative models.
Additionally, consider fairness and equity when developing your dataset. It’s important to include data representing a wide range of backgrounds and experiences. Benchmark datasets need to serve real-world learning needs, whether for students, teachers, or parents. It’s ideal to include learners’ demographic information, such as their economic background, race, gender identity, language status, and special needs, to ensure representation of historically marginalized groups.
Benchmark datasets should be large enough to train and evaluate machine learning models accurately. Consider both the “length” (number of samples) and “width” (number of variables) of the data. If you want to enhance an existing dataset, try making yours larger or more detailed to offer something new. You can also consider annotating the data to provide more precise, accurate information from which the AI model can learn. Clear, high-quality annotations are key for developing strong AI models because they provide reliable training signals, reduce noise in the data, and improve the model's ability to generalize to new scenarios.
Finally, consider withholding a portion of your benchmark data from public release. Keeping some data hidden benefits the field by maintaining the integrity and validity of model evaluations. For instance, if you plan to conduct ongoing evaluations of popular AI models on your benchmark, you’d need unseen data to evaluate these models fairly and ensure the results are not contaminated by prior exposure.