Biographical Semi-Supervised Relation Extraction Dataset

Abstract

Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.

Publication DOI: https://doi.org/10.1145/3477495.3531742
Additional Information: Copyright © 2022, Association for Computing Machinery. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
Event Title: 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Event Type: Other
Event Dates: 2022-07-11 - 2022-07-15
Uncontrolled Keywords: biographical information extraction,relation extraction,transformers,Computer Graphics and Computer-Aided Design,Information Systems,Software
ISBN: 9781450387323
Last Modified: 18 Apr 2024 07:31
Date Deposited: 24 Jan 2023 16:06
Full Text Link:
Related URLs: https://dl.acm. ... 3477495.3531742 (Publisher URL)
https://arxiv.o ... /abs/2205.00806 (Author URL)
http://www.scop ... tnerID=8YFLogxK (Scopus URL)
PURE Output Type: Conference contribution
Published Date: 2022-07-07
Authors: Plum, Alistair
Ranasinghe, Tharindu (ORCID Profile 0000-0003-3207-3821)
Jones, Spencer
Orasan, Constantin
Mitkov, Ruslan

Download

Export / Share Citation


Statistics

Additional statistics for this record