Adelani, David Ifeoluwa, Masiak, Marek, Azime, Israel Abebe, Alabi, Jesujoba Oluwadara, Tonja, Atnafu Lambebo, Mwase, Christine, Ogundepo, Odunayo, Dossou, Bonaventure F. P., Oladipo, Akintunde, Nixdorf, Doreen, Emezue, Chris Chinenye, al-azzawi, Sana Sabah, Sibanda, Blessing K., David, Davis, Ndolela, Lolwethu, Mukiibi, Jonathan, Ajayi, Tunde Oluwaseyi, Ngoli, Tatiana Moteu, Odhiambo, Brian, Owodunni, Abraham Toluwase, Obiefuna, Nnaemeka C., Mohamed, Muhidin, Muhammad, Shamsuddeen Hassan, Ababu, Teshome Mulugeta, Abdullahi, Saheed Salahudeen, Yigezu, Mesay Gemeda, Gwadabe, Tajuddeen, Abdulmumin, Idris, Bame, Mahlet Taye, Awoyomi, Oluwabusayo Olufunke, Shode, Iyanuoluwa, Adelani, Tolulope Anu, Kailani, Habiba Abdulganiy, Omotayo, Abdul-Hakeem, Adeeko, Adetola, Abeeb, Afolabi, Aremu, Anuoluwapo, Samuel, Olanrewaju, Siro, Clemencia, Kimotho, Wangari, Ogbu, Onyekachi Raphael, Mbonu, Chinedu E., Chukwuneke, Chiamaka I., Fanijo, Samuel, Ojo, Jessica, Awosan, Oyinkansola F., Guge, Tadesse Kebede, Sari, Sakayo Toadoum, Nyatsine, Pamela, Sidume, Freedmore, Yousuf, Oreen, Oduwole, Mardiyyah, Tshinu, Kanda Patrick, Kimanuka, Ussen, Diko, Thina, Nxakama, Siyanda, Nugussie, Sinodos G., Johar, Abdulmejid Tuni, Mohamed, Shafie Abdi, Hassan, Fuad Mire, Mehamed, Moges Ahmed, Ngabire, Evrard, Twagirayezu, Jules, Ssenkungu, Ivan and Stenetorp, Pontus (2023). MasakhaNEWS: News Topic Classification for African languages. IN: Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (ACL).
Abstract
African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
Publication DOI: | https://doi.org/10.48550/arXiv.2304.09972 |
---|---|
Divisions: | College of Business and Social Sciences > Aston Business School > Operations & Information Management Aston University (General) |
Additional Information: | This arXiv preprint version of this paper is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). |
Uncontrolled Keywords: | cs.CL |
Last Modified: | 29 Oct 2024 16:57 |
Date Deposited: | 05 May 2023 09:28 |
Full Text Link: | |
Related URLs: |
https://aclanth ... ijcnlp-main.10/
(Publisher URL) https://github. ... /masakhane-news (Related URL) https://hugging ... ane/masakhanews (Related URL) |
PURE Output Type: | Conference contribution |
Published Date: | 2023-11 |
Accepted Date: | 2023-09-04 |
Submitted Date: | 2023-04-19 |
Authors: |
Adelani, David Ifeoluwa
Masiak, Marek Azime, Israel Abebe Alabi, Jesujoba Oluwadara Tonja, Atnafu Lambebo Mwase, Christine Ogundepo, Odunayo Dossou, Bonaventure F. P. Oladipo, Akintunde Nixdorf, Doreen Emezue, Chris Chinenye al-azzawi, Sana Sabah Sibanda, Blessing K. David, Davis Ndolela, Lolwethu Mukiibi, Jonathan Ajayi, Tunde Oluwaseyi Ngoli, Tatiana Moteu Odhiambo, Brian Owodunni, Abraham Toluwase Obiefuna, Nnaemeka C. Mohamed, Muhidin Muhammad, Shamsuddeen Hassan Ababu, Teshome Mulugeta Abdullahi, Saheed Salahudeen Yigezu, Mesay Gemeda Gwadabe, Tajuddeen Abdulmumin, Idris Bame, Mahlet Taye Awoyomi, Oluwabusayo Olufunke Shode, Iyanuoluwa Adelani, Tolulope Anu Kailani, Habiba Abdulganiy Omotayo, Abdul-Hakeem Adeeko, Adetola Abeeb, Afolabi Aremu, Anuoluwapo Samuel, Olanrewaju Siro, Clemencia Kimotho, Wangari Ogbu, Onyekachi Raphael Mbonu, Chinedu E. Chukwuneke, Chiamaka I. Fanijo, Samuel Ojo, Jessica Awosan, Oyinkansola F. Guge, Tadesse Kebede Sari, Sakayo Toadoum Nyatsine, Pamela Sidume, Freedmore Yousuf, Oreen Oduwole, Mardiyyah Tshinu, Kanda Patrick Kimanuka, Ussen Diko, Thina Nxakama, Siyanda Nugussie, Sinodos G. Johar, Abdulmejid Tuni Mohamed, Shafie Abdi Hassan, Fuad Mire Mehamed, Moges Ahmed Ngabire, Evrard Twagirayezu, Jules Ssenkungu, Ivan Stenetorp, Pontus |