top of page



The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public domain subset of scholarly commentaries on Sophocles' Ajax. Its main goal is to enable the evaluation of the OCR quality on printed materials that contain a mix of Latin and polytonic Greek scripts. It consists of five 19C commentaries written in German, English, and Latin, for a total of 3,356 GT lines.


Person of contact: Matteo Romanello (UNIL)


This repository contains pre-trained Kraken models for the OCR of historical classical commentaries. These models were trained on ground truth (GT) data from two sources:

  1. the Polytonic Greek Training Data from Historic Texts (Pogretra) dataset v1.0 (31,972 lines)

  2. the OCR GT for Historical Commentaries dataset (3,356 lines)

Person of contact: Matteo Romanello (UNIL)

 Gasparo Sardi Toponomasia HTR data

This is a digital edition of the Codex_174 of the Burgerbibliothek in Bern. This repository contains all the documents created for this project. The HTR folder contains the dataset, the models and the scripts used for the automatic transcription of the document. The OCR-TEI_files folder contains all the XML-ALTO files produced for editing, the TEI Schema(ODD and RNG files), the ALTO to TEI transformation XLS file, a jupyter notebook for creating a masterfile with all the XML-ALTO files, and the masterfile.xml file used for publishing on TEI-publisher. The web-application folder contains the web application made with TEI-publisher.

Persons of contact: Pauline Jacsont, Mittenhuber Florian (UNIGE)


The SNSF project HumaReC (2016-2018) is a two-year SNSF project led by Claire Clivaz (SIB), which tested the continuous data publishing on the unique New Testament manuscript in Arabic, Greek and Latin, GA 460, with the editing of four Pauline letters and analysis on the HumaReC platform. The 7 datasets produced in this project containing transcriptions of Paul letters in GA 460 (Marciana Gr Z11(379), in Greek, Latin and Arabic can be accessed here

Persons of contact: Claire Clivaz (SIB)

Latin tragedies of the republican period: encoding and lemmatized

This dataset consists of the encoding of fragments of Latin tragedies with their translation. This dataset consists of the complete lemmatized fragments of Latin tragedies.

Person of contact: Pauline Jacsont (UNIGE)

Layout Ground Truth for Historical Commentaries

GT4HistCommentLayout contains layout annotations for ca. 370 pages sampled from 8 public domain classical commentaries, published in the 19th century in English, German and Latin. The commentaries concern Ancient Greek and Latin works from prose and poetry (caveat: AGreek poetry is slightly over-represented).

Person of contact: Matteo Romanello (UNIL)

Modality maps

In this dataset, published on SWISSUbase, the user will find a collection of 76 interactive diachronic maps of the lexical modal markers analyzed in the framework of the WoPoss project. The drawing of the maps is based on the descriptions of the lemmas in the Thesaurus Linguae Latinae, combined with the personal analysis of the relevant Latin passages carried out by the authors. The maps are in json format and can be visualized on the Pygmalion platform. 

Person of contact: Francesca dell Oro (UNINE)


The H2020 project OPERAS-P (2019-2021) was the second step of development of the RI OPERAS. The group DH+ (SIB) has participated to the WP 6.5, which produced a collection of 25 datasets

Person of contact: Claire Clivaz (SIB)

SNFS MARK16: eTalks datasets

This is a collection of 11 datasets, developped within the SNSF PRIMA MARK16 project (2018-2023). They are linked to the eTalks API

Person of contact: Claire Clivaz (SIB)

SNFS MARK16: Manuscript datasets

This is a collection of 55 datasets of Mk 16 in 55 manuscripts in ten ancient languages, developped within the SNSF PRIMA MARK16 project (2018-2023). They are linked to the Manuscript Room API, developped in partnership with the New Testament Virtual Manuscript Room (INTF, Münster).

Person of contact: Claire Clivaz (SIB)


This dataset, containing a topological analysis of server logs, was created in a project that aimed at documenting the behavior of scientists on online platforms by making sense of the digital trace they generate while navigating. The repository contains the Jupyter notebook that was run on the cluster, its aim was to construct the sessions from the large data provided by Gallica user navigations, the Jupyter notebook that contains topological data analysis and cluster visualizations and the final report of the project.

Persons of contact: Simon Dumas Primbault; Bayrem Kaabachi (EPFL)

bottom of page