2  Language

MARC: 041,a

2.1 Description

The polish_languages function is designed to standardize and harmonize language information in a dataset. The process starts by isolating unique language entries, ensuring that each distinct combination of languages is processed only once, which improves efficiency and avoids redundant computations. A MARC reference list of recognized language abbreviations and names is then used to map language codes to their standardized forms. Each language entry is analyzed to identify multiple languages and to detect any unrecognized terms.

The entries are standardized by converting them to their recognized forms while eliminating duplicates and filtering out unrecognized languages. Empty cells in the dataset are marked as NA to indicate missing information. Once standardized, all valid languages are aggregated to create a structured data frame. This data frame includes the total number of languages in each entry, a flag (TRUE/FALSE) indicating whether the entry contains multiple languages (including those that are originally coded as mul = Multiple language), the cleaned and harmonized list of languages, and the primary language, which is defined as the first listed language in each entry. The result is a cleaned and standardized dataset that facilitates accurate analysis of multilingual data.

Additionally, an error list is generated, consisting of unrecognized language information and the corresponding IDs. This error list helps librarians identify mistakes in the original data and provides context to either correct the errors or explain why certain entries were discarded by the function.

2.2 Complete Dataset Overview

Unique languages: 189

Unique primary languages: 155

1096354 single-language entries (93.27%)

79147 multilingual entries , accounting for 6.73% of the total. This includes entries explicitly coded as “mul” (Multiple languages) as well as those with more than one language listed for a single book.

There are 930 single-language entries marked as only “Undetermined”, coded as “und”, accounting for (0.08%) of the total.

There are 54424 missing values in the dataset,accounting for (4.42%) of the total.

Unrecognized languages provides details of languages that were discarded, in total: 24. Additionally, the Error list contains ID numbers of entries associated with these discarded languages, intended for librarian review.

Conversions from raw to preprocessed language entries

New custom abbreviations can be added in this table.

Download language harmonized dataset

2.3 Subset Analysis: 1809-1917

Unique languages (1809-1917): 56

Unique primary languages (1809-1917): 40

61916 single-language entries (93.65%)

4197 multilingual entries , accounting for 6.35% of the total. This includes entries explicitly coded as “mul” (Multiple languages) as well as those with more than one language listed for a single book.

There are 93 entries marked as “Undetermined”.

There are 778 missing values in the dataset,accounting for (1.16%) of the total.

Download language harmonized dataset (1809-1917)

2.3.1 Top languages for 1809-1917

Number of titles assigned with each language (top-10). For a complete list, see accepted languages (1809-1917).

Language Entries (n) Fraction (%)
Finnish 33987 50.8
Swedish 19997 29.9
German 2212 3.3
Russian 2123 3.2
Finnish;Swedish 2072 3.1
Latin 1879 2.8
French 653 1
English 417 0.6
Swedish;Latin 200 0.3
Swedish;Finnish 187 0.3

Title count per language (including multi-language documents; note the log10 scale):