@inproceedings{kiela-etal-2021-dynabench,
    title = "Dynabench: Rethinking Benchmarking in {NLP}",
    author = "Kiela, Douwe  and
      Bartolo, Max  and
      Nie, Yixin  and
      Kaushik, Divyansh  and
      Geiger, Atticus  and
      Wu, Zhengxuan  and
      Vidgen, Bertie  and
      Prasad, Grusha  and
      Singh, Amanpreet  and
      Ringshia, Pratik  and
      Ma, Zhiyi  and
      Thrush, Tristan  and
      Riedel, Sebastian  and
      Waseem, Zeerak  and
      Stenetorp, Pontus  and
      Jia, Robin  and
      Bansal, Mohit  and
      Potts, Christopher  and
      Williams, Adina",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.naacl-main.324",
    doi = "10.18653/v1/2021.naacl-main.324",
    pages = "4110--4124",
    abstract = "We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.",
}
}