English version

Corpus compilation

A corpus is a large collection of texts in written or spoken language, created in order to study and describe a language. Corpora provide the basic material for dictionaries, computational lexica, grammars and all kinds of research.

The structure of a corpus depends on the purpose of the corpus. The same goes for the metadata that are added to it. Metadata means ‘data about data’: information pertaining to a text, such as its author, date, genre, language, etc.

The dictionaries of the Dutch Language Institute are based on corpus material. The Dictionary of the Dutch Language (WNT) and the Dictionary of Middle Dutch (MNW) are based on an analogue (‘paper’) corpus of citations from a carefully selected collection of texts. The Dictionary of Old Dutch (ONW), the Dictionary of Early Middle Dutch (VMNW), and the Dictionary of Contemporary Dutch (ANW) are based on digital corpora.

Contemporary Dutch

Contemporary Dutch is nowadays monitored by the Dutch Language Institute by means of the Corpus of Contemporary Dutch (CHN). The corpus has been linguistically enriched with word classes and lemmas, and been syntactically annotated.

A substantial part of these materials is made available to researchers. Furthermore, the Dutch Language Institute makes several corpora of third parties available (see our language materials on our website Taalmaterialen).

Historical Dutch

The historical corpora, established at the Dutch Language Institute or with the Dutch Language Institute as a partner, have been annotated linguistically with word classes and contemporary Dutch lemmas. High quality corpora are of the utmost importance in linguistic research. To achieve this high quality, not only the method of transcription is important, but also the availability of proper metadata, especially the date and localization of the texts.

Historical corpora created by the Dutch Language Institute are:

Gysseling Corpus (source for the Dictionary of Early Middle Dutch)
Corpus of Old Dutch (source for the Dictionary of Old Dutch)
Corpus of Middle Dutch (collection of rhyming texts and prose from the period of 1300-1550)

Furthermore, the Dutch Language Institute has collaborated in Letters as Loot (Dutch letters from the 17th and 18th centuries, a selection from the over 38,000 so-called Sailing Letters from the British Archives) and in Eindhoven Corpus (first collection of Dutch written and transcribed spoken texts, from the period of 1960-1973).

Gold standard

Some linguistically annotated INT corpora provide a gold standard. A gold standard is a data set of which the annotations have been manually verified and are therefore considered to be correct. Gold standards are used to train tools to perform an automatic and consistent annotation of text materials.

Format

All INT corpora come in TEI XML format. TEI (the Text Encoding Initiative) is a consortium that develops and maintains a standard for the rendering of texts in digital form.

Search system

INT corpora have been made searchable by BlackLab. This corpus retrieval engine, developed by the Dutch Language Institute, provides a fast and feature-rich search interface.

Tags: corpus, hedendaags, historisch

Contact: Katrien Depuydt,Vincent Vandeghinste

Een corpus is een verzameling teksten in geschreven of gesproken taal, aangelegd om een taal te kunnen bestuderen en beschrijven. Corpora leveren het basismateriaal voor woordenboeken, computationele lexica, grammatica’s en allerlei soorten onderzoek.

De samenstelling van een corpus hangt af van het doel van het corpus. Hetzelfde geldt voor de metadata die eraan worden toegevoegd. Metadata betekent eigenlijk ‘data over data’: gegevens met betrekking tot een tekst, zoals auteur, datering, genre, taal etc.

De woordenboeken van het Instituut voor de Nederlandse Taal (INT) zijn gebaseerd op corpusmateriaal. Het Woordenboek der Nederlandsche Taal en het Middelnederlandsch Woordenboek zijn gebaseerd op een analoog (‘papieren’) corpus van citaten uit een weloverwogen verzameling teksten. Het Oudnederlands, Vroegmiddelnederlands en Algemeen Nederlands Woordenboek zijn gebaseerd op digitale corpora.

Hedendaags Nederlands

Het hedendaags Nederlands wordt door het INT tegenwoordig gemonitord aan de hand van het Corpus Hedendaags Nederlands (CHN). Het corpus is taalkundig verrijkt met woordsoort en lemma en syntactisch geannoteerd.

Een substantieel deel van dit materiaal wordt voor onderzoekers beschikbaar gesteld. Daarnaast stelt het INT diverse corpora van derden ter beschikking (zie daarvoor onze website Taalmaterialen).

Historisch Nederlands

De historische corpora, tot stand gekomen op het INT of met het INT als een van de partners, zijn taalkundig verrijkt met woordsoort en hedendaags Nederlands lemma. Voor taalkundig onderzoek zijn corpora van hoge kwaliteit noodzakelijk. Daarbij is de transcriptiemethode belangrijk, maar ook goede metadata, en dan met name de datering en de lokalisering van het tekstmateriaal.

Door het INT gemaakte historische corpora zijn:

Corpus Gysseling (bron van het Vroegmiddelnederlands Woordenboek)
Corpus Oudnederlands (bron van het Oudnederlands Woordenboek)
Corpus Middelnederlands (verzameling rijm- en prozateksten uit de periode 1300-1550)

Het INT heeft daarnaast meegewerkt aan Brieven als Buit en Brieven als Buit-2 (Nederlandse brieven uit de 17e en 18e eeuw, een selectie uit de ca. 40.000 zogeheten Sailing Letters uit de Britse archieven), Corpus Juridisch Nederlands, Couranten Corpus, Eindhoven-corpus (eerste verzameling Nederlandstalige geschreven en getranscribeerde gesproken teksten, uit de periode 1960-1973), Gekaapte brieven en aan Nederlab.

Gouden standaard

Van sommige taalkundig verrijkte INT-corpora is een gouden standaard voorhanden. Een gouden standaard is een dataset waarvan de annotatie handmatig is geverifieerd en daarom geacht wordt correct te zijn. Gouden standaarden worden gebruikt om tools mee te trainen om dezelfde annotaties automatisch te kunnen toepassen op tekstmateriaal.

Formaat

Het gebruikte formaat voor alle INT-corpora is TEI XML. TEI (the Text Encoding Initiative) is een consortium dat een standaard ontwikkelt en onderhoudt voor de weergave van teksten in digitale vorm.

Zoeksysteem

De INT-corpora worden online doorzoekbaar gemaakt met BlackLab: door het INT ontwikkelde corpuszoeksoftware met een bijbehorende snelle zoekapplicatie.