ALEXSIS-PT:A New Resource for Portuguese Lexical Simplification

Abstract

Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.

Divisions: College of Engineering & Physical Sciences > School of Computer Science and Digital Technologies
College of Engineering & Physical Sciences > School of Computer Science and Digital Technologies > Applied AI & Robotics
Funding Information: We would like to thank the anonymous COLING reviewers and Matthew Shardlow for their insightful feedback. We further thank Daniel Ferrés and Horacio Saggion, the creators of ALEXSIS, for all the information and resources they shared.
Additional Information: Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.
Uncontrolled Keywords: Computational Theory and Mathematics,Computer Science Applications,Theoretical Computer Science
Publication ISSN: 2951-2093
Last Modified: 02 May 2024 07:27
Date Deposited: 24 Oct 2023 12:22
Full Text Link: https://arxiv.o ... /abs/2209.09034
Related URLs: http://www.scop ... tnerID=8YFLogxK (Scopus URL)
https://aclanth ... tion%20metrics. (Publisher URL)
PURE Output Type: Conference article
Published Date: 2022-10-17
Accepted Date: 2022-10-01
Authors: North, Kai
Zampieri, Marcos
Ranasinghe, Tharindu (ORCID Profile 0000-0003-3207-3821)

Download

[img]

Version: Published Version

License: Creative Commons Attribution

| Preview

Export / Share Citation


Statistics

Additional statistics for this record