Language software development

The Dutch Language Institute (INT) develops software to support the development of language materials and to enable researchers to work with these materials more efficiently. One of the institute’s aims is to create generic reusable applications and components.

Improving access to language resources

The INT’s software supports researchers in different ways. For example, the BlackLab corpus search engine built on top of Apache Lucene supports token based querying and querying relations, which enables searches for syntactic dependencies in syntactically annotated corpora and alignment in parallel corpora.

The institute has also developed (mostly Lucene- or Solr-based) search applications for its dictionaries, such as the Dictionary of Contemporary Dutch (Algemeen Nederlands Woordenboek or ANW), and dictionary portals, including the historical dictionary portal, the translation vocabulary portal and the Database of the Southern Dutch Dialects (DSDD). More recently, we have developed a customizable Solr-based dictionary publication platform that can be used to efficiently develop dictionary search applications.

Other examples of software developed by the INT with the aim of improving access to language resources include woordpeiler, a lexical trend analysis application, and woordenlijst.org, an application that provides spelling, pronunciation and syllabification information targeted at a general audience.

Supporting the development of language resources

The INT supports the development of language resources by working on applications and systems for manual data entry and correction, and data enrichment and conversion.

The following tools support manual data entry and correction:

  • Lex’it, a Rapid Application Development platform to create web-based environments for the manual processing of structured data;
  • SwingLex, a dictionary writing system used for the Woordcombinaties application, modern Dutch dictionaries ANW and WNW, as well as modern Frisian dictionary NFW;
  • LAnCeLoT (Linguistic Annotation Corpus Laundry Tool) enables researchers to manually correct and refine annotations, which is essential for high-quality corpus analysis.

The INT also supports data enrichment and conversion through linguistic annotation systems.

  • GaLAHaD (Generating Linguistic Annotations for Historical Dutch) provides a flexible environment for automatic enrichment and evaluation. The system can automatically add part-of-speech tags and lemmas (or other linguistic annotations), after which users can inspect the results and analyze the performance of various tools against a given golden standard.
  • The institute prefers not to develop new annotation software from scratch, but we maintain a fork of the PIE tagger and a wrapper around the transformers library developed for GaLaHaD [int-huggingface-tagger].

Contact persons:

Jan Niestadt and Jesse de Does