Eindhoven Corpus

The Eindhoven Corpus – also known as the Corpus Uit den Boogaart (1975) – is the first collection of Dutch written and (transcribed) spoken texts made for research purposes. The corpus dates from the period between 1960 and 1973.

The written part contains fragments with a total of 600,000 words, dating from the period between 1964 and 1971. The spoken part is considerably smaller and consists of ca 120,000 words.

In 1989, the Renkema Corpus, a corpus containing text fragments of correspondences between the government and the States General from the parliamentary year 1975-1976, was added to the Eindhoven Corpus.

Version history

The original corpus was used as a source for the book Woordfrequenties in geschreven en gesproken Nederlands (‘Word frequencies in written and spoken Dutch’), which was published in 1975. Version 2.0.1, available as one of our language materials, was created at the Vrije Universiteit Amsterdam (VU). In this VU version, the Renkema Corpus was added, the rendition of the morphosyntactic encoding was altered and several corrections were made. During the development of the WOTAN 2 tagset, Hans van Halteren upgraded parts of the Eindhoven Corpus.

Version 2.5

The Dutch Language Institute has created a new version of the Eindhoven Corpus, for which the corpus has been converted to TEI-XML and been provided with structured metadata. Apart from this, the lemmatization has been completed, and with the aid of Van Halteren’s version and GiGaNT-Molex, a reconstruction has been made of the use of capital letters and diacritics. The PoS tagging (annotation with part of speech) was converted to a tagset closely related to the CGN tagset, with the automatic addition of certain characteristics, followed by a partially manual correction.

Tags: corpus, historisch

Contact: Katrien Depuydt

Applications/products

Data set VU version (2.0.1)

Versiegeschiedenis

Het oorspronkelijke corpus is gebruikt als bron voor het in 1975 verschenen Woordfrequenties in geschreven en gesproken Nederlands. De versie 2.0.1, beschikbaar via de taalmaterialenwebsite van het Instituut voor de Nederlandse Taal (INT), is aan de Vrije Universiteit Amsterdam (VU) tot stand gekomen. In de VU-versie is niet alleen het Renkema-corpus toegevoegd, ook is de weergave van de morfosyntactische codering gewijzigd en zijn er correcties uitgevoerd. Bij de ontwikkeling van de WOTAN 2-tagset heeft Hans van Halteren gedeeltes van het Eindhoven-corpus een upgrade gegeven.

Versie 2.5

Het INT heeft een nieuwe versie van het Eindhoven-corpus gemaakt, waarbij het corpus is omgezet naar TEI-XML en van gestructureerde metadata is voorzien. Daarnaast is de lemmatisering aangevuld en is er een reconstructie van het hoofdlettergebruik en de diakritische tekens gedaan met behulp van de Van Halterenversie en GiGaNT-Molex. De PoS-tagging (verrijking met woordsoort) is omgezet naar een met de CGN-tagset nauw verwante tagging, waarbij sommige kenmerken automatisch zijn toegevoegd, en daarna weer gedeeltelijk handmatig gecorrigeerd.

Tags: corpus, historisch

Laatst gewijzigd: 23/09/2025

Contactpersoon:

Lic. Katrien Depuydt

Applicaties/producten

Dataset VU-versie (2.0.1)

Eindhoven Corpus

Version history

Version 2.5

Applications/products

Further reading

Applicaties/producten

Meer lezen

Agenda

Eindhoven Corpus

Version history

Version 2.5

Applications/products

Further reading

Versiegeschiedenis

Versie 2.5

Applicaties/producten

Meer lezen

INT-nieuwsbrief

Terminologienieuwsbrief