The Children and Young People’s Books Lexicon (CYP-LEX): A large-scale lexical database of books read by children and young people in the United Kingdom

Abstract

This article introduces the Children and Young People’s Books-Lexicon (CYP-LEX), a large-scale lexical database derived from books popular with children and young people in the United Kingdom. CYP-LEX includes 1,200 books evenly distributed across three age bands (7–9, 10–12, 13+) and comprises over 70 million tokens and over 105,000 types. For each word in each age band, we provide its raw and Zipf-transformed frequencies, all parts-of-speech in which it occurs with raw frequency and lemma for each occurrence, and measures of count-based contextual diversity. Together and individually, the three CYP-LEX age bands contain substantially more words than any other publicly available database of books for primary and secondary school children. Most of these words are very low in frequency, and a substantial proportion of the words in each age band do not occur on British television. Although the three age bands share some very frequent words, they differ substantially regarding words that occur less frequently, and this pattern also holds at the level of individual books. Initial analyses of CYP-LEX illustrate why independent reading constitutes a challenge for children and young people, and they also underscore the importance of reading widely for the development of reading expertise. Overall, CYP-LEX provides unprecedented information into the nature of vocabulary in books that British children aged 7+ read, and is a highly valuable resource for those studying reading and language development.

Publication DOI: https://doi.org/10.1177/17470218241229694
Divisions: College of Health & Life Sciences > School of Psychology
College of Health & Life Sciences > Aston Institute of Health & Neurodevelopment (AIHN)
College of Health & Life Sciences
Aston University (General)
Funding Information: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by a research grant from the Economic and Social Research Council (ES/W002310/1).
Additional Information: Copyright © Experimental Psychology Society 2024. This article is distributed under the terms of the Creative Commons Attribution 4.0 Lficense (https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).
Publication ISSN: 1747-0226
Data Access Statement: The CYP-LEX database and code for all reported analyses are available on this project’s OSF website (https://doi.org/10.17605/OSF.IO/SQU49).<br/>The CYP-LEX database is provided in six .csv files, with two files for each of the three age bands (7–9, 10–12, 13+):<br/> • Main file, titled “main_cyplex[Age Band].csv”, including, for each word form, its raw and Zipf-transformed frequencies and most likely lemma; part-of-speech-dependent frequencies and part-of-speech-dependent lemmas; measures of count-based contextual diversity (number and percentage of sentences and books this word appears in); whether this word was observed in the other two CYP-LEX age bands, the CPWD corpus, and in the Cbeebies, CBBC, and adult SUBTLEX-UK subcorpora, and the respective frequencies, on the raw and Zipf scale (for CPWD, log-transformed);<br/> • A file with the term-document matrix, where each term is a lemma, each document is a book, and each cell is a tf-idf value associated with each lemma in each book. This file is titled “tdm_cyplex[Age Band].csv”.<br/>These six .csv files are accompanied by two README files (one for the main files and one for the term-document matrices) which provide detailed information about the structure of the .csv files.
Last Modified: 06 Feb 2026 08:06
Date Deposited: 05 Feb 2026 15:48
Full Text Link:
Related URLs: https://journal ... 470218241229694 (Publisher URL)
PURE Output Type: Article
Published Date: 2024-12-01
Published Online Date: 2024-01-23
Accepted Date: 2023-12-29
Authors: Korochkina, Maria (ORCID Profile 0000-0002-8017-7855)
Marelli, Marco
Brysbaert, Marc
Rastle, Kathleen

Download

[img]

Version: Published Version

License: Creative Commons Attribution


Export / Share Citation


Statistics

Additional statistics for this record