Fennica metadata conversions: statistical monitoring and analysis

Author

Turku Data Science Group

Published

June 10, 2026

1 Preface chapter

It is imperative to underscore that this bookdown project is an evolving work-in-progress, and several additional fields will be incorporated into the dataset as they undergo the rigorous processes of data cleaning and harmonization.

Within the FIN-CLARIAH Metadata Harmonization and Analysis work package, we leverage the Finnish national bibliography (FNB) Fennica dataset to develop a harmonized dataset, serving research purposes and laying the groundwork for further infrastructure iterations. The project’s outcomes will be instrumental in supporting the DHL-FI project funded by the Research Council of Finland. The FNB encompasses metadata for over a million documents, including books, newspapers, maps, etc., with records spanning from 1488 to the present. For more details about Fennica, visit The National Finnish Library website.

Currently, the bookdown project comprises a few distinct chapters. Notably, the harmonization process has been executed through the establishment of dual pipelines: Complete FNB Pipeline and 1809 to 1917 Period Pipeline These chapters are dedicated to the specific metadata categories.

You could include something like this in the dataset documentation.

1.0.1 Dataset variables

The choice of fields to harmonize is guided by the needs of the Digital History for Literature in Finland (2022-2026) project funded by the Research Council of Finland. The project aims to use digital collections and methods to significantly expand the prevailing understanding of Finnish literature history.

The harmonized Fennica dataset contains bibliographic metadata, harmonized classifications, publication information, author information, and physical description fields. The variables are:

Variable Description
melinda_id Unique Melinda record identifier
data_element MARC 008 data element values
genre_008 Harmonized genre classification derived from MARC 008/33
record_type MARC record type
biblio_level Bibliographic level
publication_status Publication status code
author_name Harmonized author name
author_birth Author birth year
author_death Author death year
author_age Author lifespan or age information
title Harmonized title
title_length Number of characters in the title
title_word Number of words in the title
title_remainder Remainder of title (subtitle information)
title_remainder_length Number of characters in the remainder of title
title_remainder_word Number of words in the remainder of title
title2 Combined title and remainder of title
title2_length Number of characters in the combined title
title2_word Number of words in the combined title
language Harmonized language name(s)
language_primary Primary publication language
language_multi Indicator for multilingual publications
lang_orig Original language field from source metadata
publication_year_from Earliest publication year extracted from the record
publication_year_till Latest publication year extracted from the record
publication_year Harmonized publication year
publication_decade Publication decade
publication_place Harmonized place of publication
publication_country Country of publication
publisher Harmonized publisher name
signum Library call number
udk_orig Original UDC notation
udk_harm Harmonized UDC notation
udk_aux UDC auxiliary notation
udk Converted UDC classification
udk_primary Primary UDC class
udk_multi Indicator for multiple UDC classes
genre_655 Harmonized genre/form terms derived from MARC 655
id2 Other system identifier(s)

In addition, the dataset includes physical description variables extracted from MARC 300 fields:

Variable Description
physical_dimension Physical dimensions of the publication
pagecount Total page count
volcount Number of volumes
volnumber Volume number
parts Number of parts
pagecount_arabic Arabic numeral pages
pagecount_roman Roman numeral pages
pagecount_sheet Sheet count
pagecount_plate Plate count
pagecount_squarebracket Pages reported in square brackets
pagecount_multiplier Page multiplier information
pagecount_page_info Original page count statement

Download harmonized_fennica.csv.

Harmonized data is also available for a subset of years 1809-1917:

Download harmonized_fennica19.csv.

For more information contact Julia Matveeva (yulmat@utu.fi).

How to reproduce the workflow