The Dutch Language Institute (INT) develops data models, methods and workflows for documenting Dutch through the ages, in all its variety. Furthermore, the INT hosts several historical corpora, the historical dictionaries of Dutch and historical lexica. Most of the tools and services described below can also be found here.
Historical corpora
The INT makes historical corpora available through searchable web applications. Some corpora are available as datasets (see Datasets). The following corpora are currently available:
- The Corpus of Old Dutch is the collection of all Dutch texts from before 1200;
- The Gysseling Corpus is the collection of all 13th century Dutch texts;
- The Corpus of Middle Dutch is a collection of texts from 1300-1550;
- The Couranten Corpus comprises a collection of 17th century Dutch newspapers;
- The Brieven als Buit, Brieven als Buit-2 and the Gekaapte Brieven corpora contain Dutch sailing letters from the 17th to the 19th century;
- The Corpus Juridisch Nederlands contains a collection of Dutch legal texts from the 19th and 20th century.
Historical lexicography and etymology
The INT makes the following historical dictionaries available through a single online search application. Together, these dictionaries describe the Dutch vocabulary from around 500 to 1976, and the Frisian vocabulary from 1800 to 1975. They are also available at the European dictionary portal:
- Oudnederlands Woordenboek (ONW, Dictionary of Old Dutch) 500-1200
- Vroegmiddelnederlands Woordenboek (VMNW, Dictionary of Early Middle Dutch) 1200-1300
- Middelnederlandsch Woordenboek (MNW, Dictionary of Middle Dutch) 1250-1550
- Woordenboek der Nederlandsche Taal (WNT, Dictionary of the Dutch Language) 1500-1976
- Woordenboek der Friese Taal (WFT, Dictionary of the Frisian Language) 1800-1975.
Furthermore, the INT also hosts the platform Etymologiebank, which contains the most important etymological publications on Dutch.
Historical lexica
The lemmata and quotations of the historical Dutch dictionaries are stored in a central database, GiGaNT, which contains a historical component (Hilex) and a modern component (Molex). The historical lexicon is linked to the modern lexicon, so that the history of words can be traced through time.
Importantly, this lexicon enables users to search for spelling variants of words in corpora that do not contain lemma or PoS tags. As a semantic layer on top of GiGaNT, DiaMaNT is a computational semantic lexicon that offers diachronic semantic information.
Linguistic annotation
The INT has developed two powerful tools for the automatic enrichment and manual enhancement of linguistic annotation in historical corpora:
- GaLAHaD (Generating Linguistic Annotations for Historical Dutch) provides a flexible environment for automatic enrichment and evaluation of different taggers;
- LAnCeLoT (Linguistic Annotation Corpus Laundry Tool) allows researchers to manually correct and refine enrichments.
The refinements with regards to lemma and PoS have been made in accordance with principles for lemmatization (see lemmatization principles – in Dutch) and PoS tagging (see Tagset voor Diachroon corpusmateriaal van het Nederlands (TDN)).
Datasets
Datasets with data from the 13th up until the 19th century have been linguistically enriched in the context of the CLARIAH+-project. These datasets serve as a golden standard, and they have also been used to train several Huggingface and PIE taggers on a variety of historical data. These datasets are publicly available at GitHub.
Additionally, some of the aforementioned historical corpora are also available as a dataset: Brieven als Buit Golden Standard, Corpus Gysseling (Data) and Corpus Middelnederlands (Data).