
9 Publisher
MARC: 260b
9.1 Publisher field preprocessing
The publisher field (MARC 260b) has been harmonized through a rule-based cleaning pipeline to address substantial variation in cataloging practices across time. The raw data contain inconsistent formatting, multiple languages, abbreviations, punctuation noise, and compound entries combining publishers, distributors, and institutional bodies.
The preprocessing consists of several stages. First, all entries are normalized to lowercase and stripped of extraneous punctuation, brackets, and formatting artifacts (e.g., OCR noise, delimiters and inconsistent separators). Multi-value entries are standardized using a semicolon (;) as a separator to preserve cases where multiple publishers or institutional collaborators are listed.
Second, common cataloging abbreviations are resolved. In particular, sine nomine ([s.n.]) is mapped to “kustantaja tuntematon” (unknown publisher), while other Latin abbreviations such as sine loco and sine anno are removed as they do not describe publisher identity.
Third, semantic normalization is applied. Distributor-related expressions (e.g., “distributed by”, “jakaja”, “distr.”) are harmonized to a consistent label (“jakelija”). Self-published works are identified through patterns such as “författaren” or “tekijä” and recoded accordingly. Common institutional and corporate forms (e.g., osakeyhtiö, aktiebolag) are standardized, and frequent abbreviations (e.g., WSOY, VTT, SKS) are expanded or unified.
Fourth, historically variable publisher names and OCR variants are mapped to canonical forms using pattern-based rules (e.g., Söderström, Gummerus, Frenckell). This step reduces fragmentation caused by spelling variation, language differences, and inconsistent cataloging.
Finally, entries that cannot be interpreted as meaningful publisher information after cleaning (e.g., empty strings, punctuation-only values) are set to missing (NA). For transparency, all discarded original values are recorded separately, allowing inspection of what was removed during preprocessing.
9.2 Complete Dataset Overview
130431 accepted unique publishers
49977 records contain multiple publishers.
28421 records are marked as unknown publisher.
8853 records are identified as self-published / author-published.
949058 documents have unambiguous publisher information (72%). This includes documents identified as self-published; the author name is used as the publisher in those cases (if known).
9.3 Subset Analysis: 1809-1917
691 records contain multiple publishers.
8019 records are marked as unknown publisher.
2110 records are identified as self-published / author-published.
51686 documents have unambiguous publisher information (70.8%). This includes documents identified as self-published; the author name is used as the publisher in those cases (if known).
The 20 most common publishers from 1809 to 1917 are shown with the number of documents.