Corpora as a valuable tool for the Dutch language
Corpora constitute an essential part of the infrastructure of the Dutch language. They provide an empirical foundation for linguistic research, facilitating systematic analysis of language change and language variation, and the development of practically-oriented grammars, dictionaries and educational materials. Aside from their importance in fundamental linguistic research, corpora are also indispensable for a variety of language-technological applications, such as terminology extraction, automatic translation and the development of language models that can understand and generate human language.
Corpus infrastructure and the INT
The Dutch Language Institute (INT) plays an important part in the development and maintenance of the corpus infrastructure of the Dutch language, both historical and contemporary. The corpora share a common metadata model, are formatted as TEI-XML and are searchable online through the BlackLab corpus search engine developed by the INT. The workflow of the Corpus of Contemporary Dutch (CHN), which has as its core a monitor corpus of newspapers, has been fully automated. This means that the latest newspaper data are updated and made searchable on a weekly basis.
The INT’s corpus infrastructure does not exclusively consist of corpora, but also offers a wide variety of tools for data development, linguistic enrichment and data distribution. These tools allow researchers and developers to more efficiently search, filter, analyze and export data. A large part of the institute’s work on corpora is primarily focused on the description of the Dutch vocabulary – one of the institute’s key activities – but the versatile infrastructure supports research in various disciplines. The infrastructure is made accessible to the entire research community via the international CLARIN network. The INT thus contributes to a sustainable and accessible research environment for linguists, language technologists and other interested parties.
The INT is also actively involved in a variety of projects developing new corpora. The institute contributes to these projects not only by providing substantive expertise, but also technical and infrastructural support for the creation, distribution and provision of corpus materials.