A parser for news downloads

Abstract

This paper presents the Download Parser, a tool for handling text downloads from large online databases. Many universities have access to full-text databases which allow the user to search their holdings and then view and ideally download the full text of relevant articles, but there are important problems in practice in managing such downloads, because of factors such as duplication, unevenness of formatting standards, lack of documentation. The tool under discussion was devised to parse downloads, clean them up and standardise them, identify headlines and insert suitably marked-up headers for corpus analysis.

Publication DOI: https://doi.org/10.1590/0102-445083054975354211
Divisions: College of Business and Social Sciences > School of Social Sciences & Humanities
Additional Information: This content is licensed under a Creative Commons Attribution License, which permits unrestricted use and distribution, provided the original author and source are credited.
Uncontrolled Keywords: Building sub-corpora,Corpus clean-up,Duplicate texts,News corpus,Linguistics and Language
Publication ISSN: 1678-460X
Last Modified: 30 Sep 2024 11:36
Date Deposited: 09 Jul 2018 10:25
Full Text Link:
Related URLs: http://www.scop ... tnerID=8YFLogxK (Scopus URL)
http://www.scie ... &lng=en&tlng=en (Publisher URL)
PURE Output Type: Article
Published Date: 2018-03-01
Accepted Date: 2017-03-15
Authors: Scott, Mike

Download

[img]

Version: Published Version

License: Creative Commons Attribution

| Preview

Export / Share Citation


Statistics

Additional statistics for this record