Programme – clin 31

09:00 – 10:15

Oral Session I – 3 parallel sessions

10:15 – 10:45

Coffee Break

10:45 – 11:00

CLIN Plenary Session

11:00 – 12:00

Keynote by Roman Klinger

12:00 – 13:00

Lunch (+ “Meet our sponsors!”)

13:00 – 14:15

Oral Session II – 3 parallel sessions

14:15 – 15:15

Poster Session I

15:15 – 15:30

Coffee Break

15:30 – 16:45

Oral Session III – 3 parallel sessions

16:45 – 17:45

Poster Session II

17:45 – 18:00

Closing

18:00 – 20:00

Social event (optional)

Detailed Programme / Abstracts

Click the presentation title to see its abstract

09:00 – 10:15

NLP Applications I – Room A (Chair: Eva Vanmassenhove)

09:00 – 09:25

BETTER-Mods: Better informing citizens about current debates
Liesje van der Linden, Cedric Waterschoot, Ernst van den Hemel, Florian Kunneman, Antal van den Bosch and Emiel Krahmer

This abstract introduces the Better-MODS project, which aims to create tools for moderating and summarizing online discussions. Our goal is therefore to develop and evaluate digital tools for improving the quality and comprehensiveness of online discussions in response to news articles. In order to do so, we will make use of a large dataset containing discussion threads from the NuJij platform of the Dutch newssite Nu.nl. Our focus will on the one hand be on the automatic detection of constructive posts and on the other hand on the automatic summarization of the discussions. In the following sections we will subsequently elaborate on both components of our research.

09:25 – 09:50

Pictograph translation technologies as a tool for facilitating communication and integration in migration settings
Bram Bulté, Vincent Vandeghinste, Leen Sevens, Ineke Schuurman and Frank Van Eynde

In this study, we investigate the potential of pictograph translation technologies for facilitating communication and integration in migration settings. We incorporate a new pictograph set in an existing text-to-pictograph translation system and carry out evaluations on two sets of authentic data. We also evaluate whether a component targeting named entities can increase the coverage of the system.

09:50 – 10:15

Negation Detection in Dutch Spoken Human-Computer Conversations
Tom Sweers and Iris Hendrickx

Proper recognition and interpretation of negation signals in text or communication is crucial for any form of full natural language understanding. It is also essential for computational approaches to natural language processing. In this study we focus on negation detection in Dutch spoken human-computer conversations. Since there exists no Dutch (dialogue) corpus annotated for negation we have annotated a small Dutch corpus to evaluate our method for negation detection. We use transfer learning and applied NegBERT (an existing BERT implementation used for negation detection), with multilingual BERT as basis, fine-tuned on English data. Our results that adding in-domain training material improves the results. We show that we can detect both negation cues and scope in Dutch dialogues with high precision and recall.

09:00 – 10:15

COVID-19: Stance and Opinion – Room B (Chair: Gilles Jacobs)

09:00 – 09:25

Monitoring Anti-Vaccination Discourse in Dutch Twitter and Facebook Comments
Jens Lemmens, Tim Kreutz, Jens Van Nooten, Ilia Markov and Walter Daelemans

We present preliminary experiments and results for an online tool that monitors COVID-19 vaccination stance and fine-grained anti-vaccination topics on Dutch Twitter and Facebook. The tool provides live updates, statistics and qualitative insights into current vaccination opinions. An annotation task was set up to create training data for automatically detecting stance and, with a focus on anti-vaccination sentiments, narratives that are used to propagate vaccine hesitancy. Additionally, we process the content of messages related to vaccines by applying named entity recognition, sentiment and emotion analysis, and author profiling. Users of the tool will be able to request monthly reports in PDF format.

09:25 – 09:50

Measuring Shifts in Attitudes Towards COVID-19 Measures in Belgium Using Multilingual BERT
Kristen M. Scott, Pieter Delobelle and Bettina Berendt

We classify seven months’ worth of Belgian COVID-related tweets using multilingual BERT and a manually labeled training set. The tweets are classified by Belgian what government COVID measure they refer to as well as by their stated opinion towards the specific measure of curfew (too strict, ok, too loose). We examine the change in topics discussed and views expressed over time and in reference to dates of related events such as implementation of new measures or COVID-19 related announcements in the media.

09:50 – 10:15

Transfer Learning for Stance Analysis in COVID-19 Tweets
Erik Tjong Kim Sang, Marijn Schraagen, Shihan Wang and Mehdi Dastani

Governments invoked a large collection of measures to fight the effects of the COVID-19 pandemic. One way of measuring the public response to these measures is to study social media messages related to the pandemic, for different policies/topics such as face mask use, social distancing, or vaccine willingness. However, computational analysis requires large efforts of data annotation for each new topic. In this abstract we explore the technique of transfer learning, predicting the response to one pandemic policy measure with a machine learning model trained on annotated data related to another measure.

09:00 – 10:15

(Socio)Linguistic Variation – Room C (Chair: A. Seza Doğruöz)

09:00 – 09:25

Keeping up with the neighbours – An agent-based simulation of the divergence of the standard Dutch pronunciations in the Netherlands and Belgium
Anthe Sevenants and Dirk Speelman

While the Netherlandic standard Dutch pronunciation norm around 1930 was still very much like the Belgian norm, it shifted considerably in the course of the 20th century (Van de Velde 1996, Van de Velde et al. 2010). In Belgium, no such evolution occurred, which caused the pronunciation of both language varieties to diverge. As of yet, there is no conclusive evidence as to why this divergence has happened. Because there is not enough data to investigate the divergence empirically, it is examined using an agent-based simulation model in Python. Though we cannot ‘prove’ that the mechanisms described in the theories from the literature actually happened in reality, we can test their plausibility by checking whether the effects described in the theories also appear in our model which attempts to mimic real-world circumstances. Four research questions based on theories found in the literature are tested: 1. Is it plausible that reduced contact between speakers from the Netherlands and Belgium resulted in a divergence between the standard pronunciations in both countries? 2. Is it possible that an increased pace of language change in Dutch speakers caused a divergence between the standard pronunciations of the Netherlands and Belgium? 3. Can we relate increased ethnocentrism in Belgian speakers to less adoption of Netherlandic innovations or even divergence? 4. Is it likely that increased media influence amplified the existing tendencies for language change (acceleration or inhibition) in Belgium? The results of the simulations are interpreted with the help of a linear regression model when possible. The results show that a lack of contact between both countries can indeed lead to divergence in the model, but only if abroad travel is at least 5000 times less likely than domestic travel. The pace of language change in the Netherlands does not have a sizeable impact on convergence or divergence tendencies in Belgium in the model. High values for ethnocentrism in Belgian agents are able to lead to divergence in the model, as long as these high values are shared by the entire population. If ethnocentrism decreases along with how close agents live to the border, it has little effect. Media receptiveness in agents always kickstarts convergence in the model and it accelerates this convergence as well. Since media influence is implemented as a powerful force in the simulation, this result must be interpreted from the viewpoint of media having a sizeable impact on language change.

09:25 – 09:50

“Vaderland”, “Volk” and “Natie”: Semantic Change Related to Nationalism in Dutch Literature Between 1700 and 1880 Captured with Dynamic Bernoulli Word Embeddings
Marije Timmermans and Eva Vanmassenhove

In this paper the semantic shift of the Dutch words “natie” (nation), “volk” (people) and “vaderland” (fatherland) is researched over a period that is known for the rise of nationalism in Europe: 1700-1880 (Jensen, 2016). The semantic change is measured by means of Dynamic Bernoulli Word Embeddings which allow for comparison between word embeddings over different time slices. The word embeddings are generated based on Dutch fiction literature divided over different decades. From the analysis of the absolute drifts, it appears that the word “natie” underwent a relatively small drift. However, the drifts of “vaderland” and “volk” show multiple peaks, culminating around the turn of the nineteenth century. To verify whether this semantic change can indeed be attributed to nationalistic movements, a detailed analysis of the nearest neighbours of the target words is provided. From the analysis, it appears that “natie”, “volk” and “vaderland” became more nationalistically-loaded over time.

09:50 – 10:15

On the Importance of Function Words in Cross-Domain Age Detection
Jens Van Nooten, Ilia Markov and Walter Daelemans

In this paper, we examine the importance of part-of-speech (POS) information for the age detection task in both in-domain (tweets) and cross-domain (training on tweets, testing on user reviews) settings by conducting POS-ablation experiments with word n-grams.

Due to the abstract nature of function words, we hypothesise that they will be the most informative features in the cross-domain setting. The results showed, on the one hand, that nouns and proper nouns were the most informative features in the in-domain setting, and, as expected, that mostly function words were informative in the cross-domain setting. With these results, we aim to highlight the risk of domain overfitting when using content-based features.

10:15 – 10:45

Coffee Break

10:45 – 11:00

CLIN Plenary Session – Keynote Room

11:00 – 12:00

Keynote by Roman Klinger – Keynote Room (Chair: Orphée de Clercq)
Title: Show-don’t-tell — how emotions are communicated in text and how psychological theories can help us in computational emotion analysis

12:00 – 13:00

Lunch (+ “Meet our sponsors!”)

13:00 – 14:15

NLP Applications II – Room A (Chair: Suzan Verberne)

13:00 – 13:25

SnomedTranslate.nl: a high-quality clinical Dutch to English translation model based on Transformers
François Remy, Peter De Jaeger and Kris Demuynck

Clinical notes remain by far the best source of information available to physicians when taking decisions for their patients, and it is therefore no surprise that the harnessing of their power is an active area of development in the NLP world. Most of the state of the art in the domain however focuses on a select few languages, among which English is prominent. To harness these recent models, translation is often the preferred approach; but as recent studies discovered, even state of the art translation models perform unsatisfactorily when it comes to medical text, where precision is of extreme importance.

In this technical report, the training procedure of our Transformer-based Dutch-to-English translation model will be outlined in details. This model was trained exclusively using freely available datasets, and yet achieves a performance level vastly superior to the current state of the art in the domain. We show this by comparing the precision and recall of MetaMap’s medical concepts extraction when executed on a set of Dutch clinical notes after translation to English, as well as by using more conventional quality metrics for translation models (BLEU…).

13:25 – 13:50

Smoking Status Detection with a Pre-Trained Transformer and Weak Supervision in Dutch Electronic Patient Files
Myrthe Reuver, Iris Hendrickx and Jeroen Kuijpers

Smoking status (whether a patient is a smoker, non-smoker, or ex-smoker) is a relevant clinical variable for both public health statistics and personal patient treatment plans. We present experiments on smoking status classification of Dutch-language primary care Electronic Medical Records.
One problem in clinical NLP is a lack of labelled training data for supervised machine learning. We aim to combat this problem by labelling unlabelled items with a weakly supervised approach with a data programming paradigm with SNORKEL (Ratner et. al., 2017). We use a pre-trained Transformer language model and fine-tune with only the labelled dataset. We find that a weakly supervised SNORKEL approach does not considerably improve performance compared to a classification pipeline without SNORKEL, but does improve some within-class performances. We conclude that state-of-the-art Transfer learning methods are interesting for specific (clinical) NLP problems with sparsely labelled data.

13:50 – 14:15

SASTA: Semi-Automatic Analysis of Spontaneous Language
Jan Odijk

This presentation describes an application (Sasta) derived from the CLARIN-developed tool GrETEL for the automatic assessment of transcripts of spontaneous Dutch language, and it provides the results achieved by the application. The techniques described here, if successful, (1) have important societal impact (enabling analysis of spontaneous language in a clinical setting, which is considered important but takes a lot of effort), (2) are interesting from a scientific point of view (various phenomena get a linguistically interesting treatment), and (3) may benefit researchers since SASTA enables a derivative program that can improve the quality of the annotations of Dutch data in CHAT-format (CHILDES data).

13:00 – 14:15

Text Analytics – Room B (Chair: Cynthia Van Hee)

13:00 – 13:25

Investigating Cross-Document Event Coreference Resolution for Dutch
Loic De Langhe, Orphée De Clercq and Véronique Hoste

Event coreference resolution is a task in which different text fragments that refer to the same real-world event are automatically linked together. This task can be performed not only within a single document but also across different documents and can serve as a basis for many useful natural language processing (NLP) applications. Resources for this type of research however, are extremely limited. We compiled the first large-scale dataset for cross-document event coreference resolution in Dutch, comparable in size to the most widely used English event coreference corpora. As data for event coreference is notoriously sparse, we took additional steps to maximize the number of coreference links in our corpus. Due to the complex nature of event coreference resolution, many algorithms consist of pipeline architectures which rely on a series of upstream tasks such event detection, event argument identification and argument coreference. We tackle the task of event argument coreference to both illustrate the potential of our compiled corpus and to lay the groundwork for a Dutch event coreference resolution system in the future. Results show that existing well-understood NLP algorithms can be easily retrofitted to contribute to the subtasks of an event coreference resolution pipeline system.

13:25 – 13:50

Automatic humor detection on Jodel
Manuela Bergau

We investigate the influence of humor on the success of a post on the Jodel platform. Do Jodel users prefer humor posts over other content? Therefore we trained an SVM classifier for humor detection on hand labeled Dutch posts. We achieved a precision of 0.7 which is in line with previous work on English data. The number of user votes on the individual post was used to divide the dataset into humorous and non-humorous. Our results show that humor has a high influence on the success of a posting.

13:50 – 14:15

The more the better? The effect of domain-specific dataset on entity extraction from Dutch criminal records
Amber Norder, Gizem Soğancıoğlu and Heysem Kaya

The Dutch police force generates very high amounts of documents such as transcripts of interrogations, evidence findings, statements of people involved, all of which need to be read and processed by analysts. Automating the entity extraction in the documents would greatly help the police force. Neural network-based approaches using contextual word embeddings are considered the current state-of-the-art approach to tackle the named entity recognition (NER) problem in the Dutch. There are available domain-independent NER datasets in the literature well as pre-trained NER models. However, earlier studies show that domain-independent models do not work well for domain-specific tasks. As annotation is highly costly, in this study, we train a set of BERTje embeddings based NER models with the varying size of police dataset in addition to the domain-independent set to observe the effect of the domain-specific dataset in the training. We follow a training, validation, and test split to ensure a proper experimental protocol. We observe that the slope of the performance increase is decreasing with the number of target domain documents in the training set and stabilizes on the validation set around 250-300 documents. The NER system has a better performance on the held-out test set (85% macro-average F1 score over five entity categories) compared to the validation set, showing the generalization power of the investigated framework.

13:00 – 14:15

Computational Models – Room C (Chair: Joke Daems)

13:00 – 13:25

Dual-Route Model in Auditory Word Recognition
Hanno Maximilian Müller, Mirjam Ernestus and Louis ten Bosch

Dual-route models assume parallel competing processing routes. In linguistics, they have been utilized successfully for modeling the processing of morphologically complex words in visual, but not yet in auditory word recognition. When listeners hear a morphologically complex word such as ‘persons’, they can either a) search for a match between the acoustic signal and a lexical representation in the mental lexicon (look-up model), b) for a match between the portions of the signal, that either entail the stem ‘person’ or the plural suffix ‘-s’ (decomposition model), or c) execute both look-up and decomposition route in parallel and whichever route first leads to the recognition of the target word will determine the response time (dual-route model).

Here, we model reaction times from the Biggest Auditory Lexical Decision Task Yet (BALDEY, Ernestus & Cutler, 2015) with a full look-up, a full decomposition, and a dual-route model and explore which model predicts the data most accurately. We show that Baayen & Schreuder’s dual-route model (1997) that was originally developed to account for visual lexical decision response times can also be applied to auditory lexical decision data despite crucial differences between visual and auditory word recognition. However, the model seems not capable of accurately predicting response latencies to the BALDEY stimuli that better reflect natural language than those from the 1997-experiment. We present an approach to refine the model so that it can better predict latencies in BALDEY and we find that the refined model makes more accurate predictions than a full look-up or a full decomposition model. Our results suggest that the model can be improved even further when the assumption is relaxed that the fastest route determines the response time.

Our results challenge current models of human auditory word recognition, the majority of which lack a parsing mechanism for morphologically complex words.

13:25 – 13:50

Garden-path sentences in Transformers
Floor van Heerwaarden and Stefan Frank

Transformer networks currently outperform recurrent neural networks in NLP applications. However, it is still unclear how plausible Transformers are as cognitive models. LSTMs have been proven to be capable of showing a garden-path eect similar to what is seen in human reading times. By replicating the LSTM garden-path simulation by Frank and Hoeks (2019) but using a Transformer model instead, we found that the Transformer did not show the garden-path effect. It might therefore not be suitable as a cognitive model.

13:50 – 14:15

Transformer-based prediction of verbal cluster order in Dutch
Tim Van de Cruys

In Dutch, verbs tend to cluster together at the end of the clause, and the verbs within that final verb cluster are subject to word order variation. In this research, we investigate whether a neural language model is able to predict that verbal word order (as it is attested in the corpus) automatically, based solely on the sentential context. Specifically, we make use of a state of the art neural transformer model for Dutch that constructs two sentence representations — one for each possible verb order — in order to predict which order is the most felicitous one. The results indicate that the final model is indeed able to make accurate predictions with regard to verb order.

14:15 – 15:15

Poster Session I – Poster Room I

Vocabulary Demands for Different Types of Dutch Easy Language (1)
Vincent Vandeghinste and Elke Peters

Corpus analyses of English-language TV programs have shown that English language learners need to be familiar with the 3,000 most frequent word families in English to understand 95% of the running words. However, for Dutch, little is known about the vocabulary demands for TV programs, such as the youth news. Further, it remains unclear whether the lexical demands are comparable to those for easy language input, like Wablieft. If you are learning Dutch as a foreign language, how many different words do you actually need to know to understand Dutch? And which words? This depends on the target group for whom certain texts or TV programs are intended. As a foreign language learner, is it best to start with the youth news, or with newspapers in so-called clear or plain language?
In this study, we compared the vocabulary demands for several types of easier language corpora and compared these to the demands for reference corpora.

The corpora we compared are the Wablieft corpus (Vandeghinste et al. 2019), with the Karrewiet corpus, a new collection of 300,000 words of youth news of the Flemish Broadcasting Company, VRT, and with the media-part of the Basilex corpus and as reference corpora we compare with the Spoken Dutch Corpus (Corpus Gesproken Nederlands – CGN 2014) and the SoNaR corpus (SoNaR-corpus 2015). We collected and processed (pos-tagging, lemmatization) the Karrewiet and Wablieft language corpora and calculated how many different lemma-pos combinations you need for a lexical coverage of 95%. We also checked how high the lexical coverage is if you know certain numbers of words (e.g. 3000). We ran the analyses with and without proper nouns.

Results showed, as expected, that the vocabulary demands are lower on the easy language corpora, compared to the reference corpora. The Wablieft corpus requires the least words for a 95% coverage without proper nouns, but the difference with the Karrewiet corpus is small. The Basilex media corpus requires a much larger (27%) vocabulary for the same coverage percentage. If we look at how much of the vocabulary is covered with the 1000 most frequent words, we see that the largest coverage is reached in the Spoken Dutch Corpus (87.1%), with very similar results for the three easier corpora (between 85.9 and 86.3%). When looking at the top 2000 words, we see a coverage between 91 and 91.9% for all corpora, apart from SoNaR (80.8%).

With this work we hope to stimulate more evidence based language learning, with vocabulary lists that maximize readability skills.

Automatic Detection and Annotation of Spelling Errors in the BasiScript Corpus (2)
Wieke Harmsen, Catia Cucchiarini and Helmer Strik

BasiScript is a corpus that contains Dutch handwritten texts produced by primary school children, which has enormous potential for quantitative research into the development of children’s spelling abilities. To make this research possible, it is essential to detect and annotate spelling errors and to link them to the specific spelling principles that are violated. Unfortunately, this is a very time-consuming task to carry out manually. In this study, we report on research that was aimed at developing and evaluating an algorithm to detect and annotate spelling errors automatically.

The BasiScript corpus contains two typed versions of each handwritten text: the target text (the intended text, without spelling errors) and the original text (what the child actually wrote). The algorithm first aligns these two text versions and splits them into words. After that, each target word is annotated with multiple layers of information, like the phonetic transcription, lemma, morphemes, and POS-tags. Using the phonetic transcription, we can group graphemes into Phoneme-Corresponding Units (PCUs). A PCU is a sequence of graphemes that corresponds to one phoneme. We detect spelling errors as PCUs that are deleted, inserted or substituted in an alignment of the target and original PCU segmentations.

The spelling errors are then annotated with the spelling principles that are violated. For this purpose, we use a Dutch linguistic annotation scheme. Most spelling principle annotations in this scheme can also be used to annotate correctly spelled PCUs. In this way, we can determine how often a spelling principle is applied incorrectly with respect to the total frequency of a spelling principle.

In the presentation we will describe the algorithm and its evaluation in more detail, we will discuss the results and possible limitations of our study and address future avenues of research.

Word sense disambiguation for specific purposes: an example sentence-based methodology for Intelligent Computer-Assisted Language Learning (3)
Jasper Degraeuwe and Patrick Goethals

In this poster, we will present the main challenges of word sense disambiguation (WSD) for the specific purpose of Intelligent Computer-Assisted Language Learning, and report the results of a first experiment. The starting point is that applying WSD could considerably enrich existing language learning and teaching resources, as it would, for instance, enable querying corpora for usage examples of specific semantic uses of vocabulary items. However, most existing WSD methods are based on WordNet and BabelNet sense distinctions, even though their very fine-grained nature actually makes them unsuitable for many NLP applications (Hovy et al., 2013). Moreover, it is argued that a single set of word senses is unlikely to be appropriate for different NLP applications, since “different corpora, and different purposes, will lead to different senses” (Kilgarriff, 1997).

In other words, WSD for specific purposes is an open problem which requires research into tailoring the sense inventory to the particularities of the specific purpose, and into designing methodologies which require little human-curated input (Degraeuwe et al., in press). For this latter challenge, word embeddings could be a key factor, since the surge of neural networks and along with it the introduction of static (e.g. word2vec [Mikolov et al., 2013]) and especially contextualised word embedding models (e.g. BERT [Devlin et al., 2019]) meant an important breakthrough for the WSD task, pushing performance levels to new heights (Loureiro et al., 2021). In a first experiment (focused on Spanish as a foreign language), we developed a customised, coarse-grained sense inventory in which the senses are represented by prototypical usage examples, and then used a pretrained BERT model to convert those sentences into “sense embeddings” and predict the sense of unseen ambiguous instances through cosine similarity calculations. On a 25-item lexical sample, this methodology achieves a promising average F1 score of 0.9.

Assessing Dutch as a Second Language Learners’ Writing Through Syntactic Complexity (4)
Nafal Ossandón Hostens, Sarah Bernolet and Orphée De Clercq

The study of the automatic assessment of writing quality and development has been carried out extensively in the context of English as a Second Language (ESL), less so in other languages. This exploratory study is an attempt to open up the discussion in the field of Dutch as a Second Language (DSL) and of less proficient language learners.

Syntactic complexity features have shown to be informative of writing quality for ESL learners. The question then arises whether similar features can be used to assess a DSL text’s quality and development as well. We developed a tool to automatically assess a Dutch text’s syntactic complexity inspired on the fine-grained features identified by Crossley and McNamara (2014) in their study of ESL writing development. This system extracts several T-Scan (Pander Maat et al. 2014) and also relies on information from the Dutch dependency parser Alpino Van Noord and others (2006).

An exploratory study was carried out to test whether these features are capable of distinguishing between a text written by native speakers and early DSL learners relying on both statistical analysis and machine learning techniques. Our findings suggest that syntactic variability and indices of nominal complexity as computed by our system are indicative of text quality and could therefore be used as a proxy when assessing the quality of academic-oriented DSL learners.

Annotation of a Dutch essay corpus with argument structures and quality indicators (5)
Liqin Zhang, Howard Spoelstra and Marco Kalz

This paper presents the compilation of a Dutch essay corpus with annotations of argumentation structures and quality indicators. Following an annotation scheme derived from previous studies, we describe the argument structure by identifying the argument components as well as their relations. A pre-defined rubric is also used to score the quality of the argument components. Four annotators annotated 30 real-life non-worked Dutch essays. The annotation procedure and compilation method ensure the reliability of the corpus for future application in relevant machine learning tasks, even though we achieved only relatively low inter-annotator agreement. The corpus will be made publicly available for future research on argument analysis and quality assessment on essays in Dutch.

Exploring Discourse on Cancer Screening Through the Embedding Space (6)
Anne Fleur van Luenen, Gert-Jan de Bruijn, Enny Das, Suzan Verberne, Hanneke Hendriks and Johannes Brug

Previous media studies have shown that the way cancer research is portrayed in the news may affect beliefs about cancer prevention, possibly decreasing peoples willingness to engage in preventive actions and cancer screening programs (Niederdeppe et al., 2010). In this study we will explore the Dutch news media
landscape from 2010 to 2018 with regard to cancer and cancer screening. We train temporal Word2Vec embeddings with a compass (Carlo et al., 2019) on the five biggest Dutch newspapers (AD, NRC, Telegraaf, Trouw, Volkskrant). Using nearest neighbours, topic modeling and comparing cosine distances, we describe how cancer has been portrayed in the aforementioned period. We aim to identify the narratives on cancer in news media, such as cancer as a preventable or an unavoidable disease, or cancer as a treatable or fatal disease. During our presentation, we will present our first results on the identification of cancer and cancer screening narratives.

This study is the first step in a bigger project called SENTENCES: Social Media Analysis to Promote Cancer Screening. In this project, we aim to gain understanding of the influence of news- and social media on perceptions of the Dutch cancer screening programme. The decision-making process on cancer screening involves a careful weighing of benefits and drawbacks, that may be compromised or stimulated by media attention. During this project, we will create classifiers to recognise factors that play a role in this process. The purpose is that the Dutch health authorities can use this to recognise when to intervene on social media or in news media.

Evaluation of Surus: a Named Entity Recognition Model for Unstructured Text of Interventional Trials (7)
Casper Peeters, Koen Vijverberg, Marianne Pouwer and Suzan Verberne

The processes of systematic literature review (SLR) and meta-analysis are crucial for medical decision-making. These processes involve a systematic selection of all medical evidence from trials and studies relevant to a specific research question. Selection and extraction of study elements are manual, labour-intensive processes, which may involve screening of >3000 scientific abstracts. Several deep-learning approaches for automation of these processes have been proposed but the task has proven challenging, in part due to the variability in scientific reporting.

We present Surus, a BERT-based model which can identify and extract important study elements, including Population, Intervention, Comparison, Outcome and Study design (PICOS) from unstructured trial publication records. Extraction of PICOS elements is common practice for reviewers to identify records relevant to their prespecified search criteria. Through automated identification of PICOS elements, we speed up the systematic reviewing process without the loss of any relevant publications during screening. Because Surus recognizes important study elements in context, the number of articles included in manual screening is dramatically reduced.

Surus is a state-of-the-art, BERT-based named entity recognition model. It is fine-tuned using a densely annotated, high-quality training set of >800 scientific abstracts describing pharmaceutical interventions for 7 different disease indications. The dataset is manually annotated by experts in the biopharmaceutical field. The model was validated, using a 90:10 train-test split of the dataset, for recognition of 26 different entity categories, relevant for extraction of PICOS elements.

Overall, Surus achieves an average F1 score of 0.926 (±0.0016) over three different 90:10 splits. We are currently in the process of evaluating Surus for the task of literature screening by reviewers performing a SLR. The first results indicate that Surus greatly improves the efficiency of literature screening, saving valuable time.

A CLINically useful topic model for patient experience questionnaires (8)
Marieke van Buchem, Olaf Neve, Erik Hensen and Hileen Boosman

Value-based Health Care (VBHC) aims to increase patient value by improving outcomes and lowering costs. Patient outcomes are a combination of health outcomes and patient experiences. The latter is often measured using structured questionnaires, called patient-reported experience measures (PREMs). Although structured questionnaires are easy to analyze and can lead to valuable insights, they narrow the scope of patient experiences covered. Many PREMs therefore include open-ended questions, giving patients the opportunity to share their unique perspective in more depth. Previous studies have shown that these open-ended questions can complement the structured PREMs, but manual analysis is time-intensive. Automated analysis would overcome this problem, however patient experiences often contain multiple sentiments and topics, complicating current analyses. Natural language processing (NLP) could be a good technique to address both the identification of topics and the associated sentiments within patient responses to open-ended questions.

We created a new open-ended PREM suitable for analysis with NLP and built a NLP tool to analyze patients’ responses, aiming for a more efficient way to capture a broad spectrum of patient experiences. The combination of the questionnaire and the NLP tool are called the AI-PREM (artificial intelligence – patient reported experience measure). The NLP tool consists of two models: a sentiment analysis model using a BERT classifier; and a topic model using Non-Negative Matrix Factorization (NMF). During our presentation we will present the results from the BERT classifier and topic model and show how the AI-PREM leads to broader insights into the patient’s experience. Furthermore, we would like to share our preliminary work on visualizing the AI-PREM output to make the AI-PREM a valuable tool in a clinical setting.

ICD-10 extraction from medical notes using the ICD structure and question answering (9)
Sander Puts, Rithesh Sreenivasan, Karen Zegers, Inigo Bermejo and André Dekker

The International Statistical Classification of Diseases and Related Health Problems (ICD) system is a widely used diagnosis system maintained by the World Health Organisation (WHO). The ICD system contains codes organized into chapters, blocks and codes. ICD-10 provides optional sub-classifications for specificity regarding the cause, manifestation, location, severity, and type of injury or disease.

ICD-10 contains approximately 100.000 codes (sub-classifications included). Trained medical coders manually extract ICD codes from the medical text for health statistics and reimbursement. The aim of this study is to develop a natural language processing (NLP) solution to automatically extract ICD codes from medical text.

The proposed NLP approach makes use of the hierarchy and structure of the ICD system. During pre-processing, all ICD codes in the dictionary are labelled by chapter, block and optional sub-classifications. Chapter and block labels can be extracted from the code structure. The base code terms and sub-classifications are extracted from the code description, using named entity recognition. In the second pre-processing step, the datasets are annotated with ICD-10 base and sub-classifications. During classification, sections of text are classified into chapters and blocks. Next, base codes are extracted from the clinical text, using n-grams and word embeddings. A question-answering (QA) classifier is trained to determine the optional code sub-classifications. Finally, the results of the previous steps are combined to match the medical text to ICD-10 dictionary entries. Two publicly available datasets MIMIC-IV (52722 documents; English) and CodiEsp (1000 documents; Spanish) are used to train and evaluate the approach.

The proposed approach is unique as it combines various techniques, from rule-based to state-of-the-art transformer approaches used for question answering. In this study, an NLP solution is proposed to automatically extract ICD-10 codes from clinical free text. In a later stage an inhouse dataset will be used to evaluate the performance in Dutch.

The Role of Domain Specific Language when Modelling Dutch Hospital Notes with Transformers (10)
Stella Verkijk and Piek Vossen

This master thesis project involves creating a language model for Dutch medical text. It depicts Dutch hospital notes as a specific domain worthy of its own language model, it explains what things should be considered when making model architecture choices, it describes how to deal with the difficulty of having privacy sensitive data as training data and proposes a two-step method to anonymize language models, it shows that a competitive domain-specific language model can be built with limited computational resources and it presents a new way of testing Dutch medical language models.

Experiments for the adaptation of Text2Picto to French (11)
Magali Norré, Vincent Vandeghinste, Pierrette Bouillon and Thomas François

The Dutch Text2Picto system (Sevens, 2018; Vandeghinste et al., 2015) aims to automatically translate text into pictographs for people with an intellectual disability in the context of Augmentative and Alternative Communication (AAC). The AAC technologies are used by disabled people to help them to communicate in daily life and to be more independent in their interactions with others improving the social inclusion (Beukelman and Mirenda, 1998).

In the framework of the Able to Include European project, the Text2Picto tool has been designed for three source languages (Dutch, English, Spanish) and two target pictographic languages (Sclera, Beta). In this communication, we propose two main contributions: (1) to adapt this system to French and (2) to extend it to a third pictograph set, namely Arasaac, that becomes more and more popular within AAC. We describe how we automatically linked the pictographs and their metadata to synsets of two French WordNets (WOLF and WoNeF) so that the translation engine uses the semantic relations between concepts. In order to preprocess the source text in French, we also have adapted the shallow linguistic analysis carried out in Text2Picto to French. Finally, we evaluated our system, whose parameters were optimized using a hill climbing algorithm, on three corpora representing several use cases: social media, books for children, and medical communication between doctors and patients. The results of our experiments are in line with those of the Dutch Text2Picto for social media and of a similar text-to-picto system for French children’s books (Vaschalde et al., 2018). For medical use case, we also carried out a manual evaluation.

As a result of our experiments, we present a fully-fledged system easily extensible to other languages and we discuss the challenges we are faced with when evaluating these technologies for adaptation to other use cases.

Literary translators and technological advances: combining perspectives from two different frameworks (12)
Joke Daems and Paola Ruffo

Recent advances in translation technology have led to an increased interest in its potential for the translation of literary texts. Attempts to storm the so-called “last bastion of human translation” (Toral & Way, 2014) are well under way, with research focusing on the application of MT to the translation of poetry and prose (Besacier and Schwartz 2015; Tezcan, Daems, and Macken 2019) and post-editing of MT output (Toral, Wieling, and Way 2018; Murchú 2019). Perhaps surprisingly, relatively little attention has been given to the guards of the bastion: the literary translators themselves. Seeing how attitudes can impact translators’ interactions with technology (Bundgaard 2017), it is crucial to hear the translators’ own voice in this debate.

In this poster presentation, we combine findings from two surveys that were designed to gain an understanding of literary translators’ awareness of and relationship with modern technology. The first survey adopted the Social Construction of Technology (SCOT) framework (Pinch and Bijker, 1984), and reached a diverse set of literary translators, leading to the identification of controversies related to technological innovation in their profession, and some ideas for rebalancing the relationship between materiality and immateriality in the field. The second survey used the Unified Theory of Acceptance and Use of Technology as a framework and focused specifically on literary translators working from or into Dutch. Interestingly, both surveys uncover similar reasons for literary translators (not) to use (translation) technology. They also identify comparable ways to improve translation technology for literary translators, as well as the relationship between practitioners and other social groups involved in the process of technological innovation.

It’s all in the eyes: an eye-tracking experiment to assess the readability of machine translated literature (13)
Toon Colman, Margot Fonteyne, Joke Daems and Lieve Macken

With the arrival of neural machine translation (NMT) systems, translation quality has improved enormously. Despite these quality improvements, especially for more creative text types such as literary texts, remarkable differences can be observed when comparing human and machine translations. Webster et al. (2020) compared the modern Dutch human translations of four classic novels with their machine translated versions generated by Google Translate and DeepL and found that not only a large proportion of the machine translated sentences contained errors, but they also observed a lower level of lexical richness and local cohesion in the NMT output compared to the human translations. The most frequent errors observed in their data set were mistranslations (37%), coherence (32%), and style & register (13%) errors. This top three corresponds to previous research by Tezcan et al. (2019) and Fonteyne et al. (2020) who both discussed the quality of Agatha Christie’s The Mysterious Affair at Styles, translated by Google’s neural machine translation system from English into Dutch. In this poster presentation, we report on the experimental design of an eye-tracking study in which participants read the full novel (Agatha Christie’s The Mysterious Affair at Styles) in Dutch, alternating between a machine translation (MT) and a human translation (HT). We aim to compare the reading process of participants reading both versions, and analyse to what extent MT impacts the reading process. As a human annotator has marked and classified all errors in the machine translated version of the novel (Fonteyne et al. 2020) we will also be able to study which errors impact this reading process most. The data set expands the Ghent Eye-Tracking Corpus (Cop et al., 2017), which contains eye-tracking data of participants reading Agatha Christie’s novel in English and in its Dutch (human) translation.

A Novel Pipeline for Domain Detection and Selecting In-domain Sentences in Machine Translation Systems (14)
Javad Pourmostafa Roshan Sharami, Dimitar Shterionov and Pieter Spronck

General-domain corpora are becoming increasingly available for Machine Translation (MT) systems. However, using those that cover the same or comparable domains allow achieving high translation quality of domain-specific MT. It is often the case that domain-specific corpora are scarce and cannot be used in isolation to effectively train (domain-specific) MT systems.

This work aims to improve in-domain MT by (i) a novel unsupervised pipeline for identifying distributions of different domains within a corpus and (ii) a data selection technique that leverages in-domain monolingual or parallel data to select domain-specific sentences from general corpora according to the distribution defined in (i). To do so, either a list with domain-specific keywords or an external lexical resource is fed into the pipeline to identify similar input data within the general domain. Furthermore, the suggested pipeline can determine the target domain of any corpus. That is, MT practitioners can prepare their training data, based on the target domain demanded by customers or industry. This idea is not only effective in terms of specifying frequent words in the corpus for the DA tasks but also in being able to inform the MT practitioners of insight into data (an informative feature).

The main idea of this work is related to Topic Modeling (TM) in the sense that a sentence is a distribution over hidden topics, and a topic is a distribution over words. Therefore, there is a high probability that similar sentences contain similar single words. In this way, we can select in-domain sentences if their top n-words match with general corpora’s top-words. Our pipeline encapsulates several modules such as TM, sentence embedding, dimensionality reduction, clustering, domain detection, post-processing, and a matching function. To test the effectiveness of our approach, the proposed method is applied on an English/French corpus, fitted and evaluated in the context of DA aiming to address the lack of in-domain data. Our empirical evaluation shows that more training data is not always better, and the best results are attainable via a proper domain-relevant data selection.

The SignON project: a Sign Language Translation Framework (15)
Dimitar Shterionov, Vincent Vandeghinste, Horacio Saggion, Josep Blat, Mathieu De Coster, Joni Dambre, Henk Van den Heuvel, Irene Murtagh, Lorraine Leeson and Ineke Schuurman

Any communication barrier is detrimental to society. In order to reduce such barriers in the communication between the deaf and hard-of-hearing (D/HoH) community and the hearing community, the SignON project researches machine translation to translate between sign and non-sign languages. SignON is a Horizon 2020 project funded by the European Commission, which commenced 01.01.2021 and lasts until 31.12.2023. Within the SignON project, we develop a free and open-source framework for translation between sign language video input, verbal audio input and text input and sign language avatar output, verbal audio output or text output. Such a framework includes the following components: (1) input recognition components, (2) a common representation and translation component, and (3) output generation components.

(1) The input side can consist of video containing a message in a sign language, in which case the meaning of the message in this specific sign language (Irish, British, Dutch, Flemish or Spanish Sign Language) needs to be recognized. Another input modality could be speech or text (English, Irish, Dutch or Spanish).
(2) We will use a common representation for mapping of video, audio and text into a unified space that will be used for translating into the target modality and language. This representation will serve as the input for the output generation component.
(3) The output component is concerned with delivering the output message to the user. In the simplest case the output is plain text; but it could also be speech, in which case a commercial text-to-speech (TTS) system will be used, or it could be that the requested output should be signed in one of the specified sign languages. In that case, the message will first be translated into a computational, formal representation of that specific sign language (Sign_A), which will then be converted into a series of behavioral markup language (BML) commands to steer the animation and rendering of a virtual signer (aka avatar).

SignON will incorporate machine learning capabilities that will allow (i) learning new sign, written and spoken languages; (ii) style-, domain- and user-adaptation and (iii) automatic error correction, based on user feedback.

The SignON framework will be distributed on the cloud where the computationally intensive tasks will be executed. A light-weight mobile app will interface with the SignON framework to allow communication between signers and non-signers through common mobile devices.
During the development of the SignON application, collaboration with end user focus groups (consisting of deaf, hard of hearing and hearing people) and an iterative approach to development will ensure that the application and the service meet the expectations of the end users.

Using open-source automatic speech recognition for the annotation of Dutch infant-directed speech (16)
Anika van der Klis, Frans Adriaans, Mengru Han and Rene Kager

Infant-directed speech (IDS) is a highly variable speech register used by adults when addressing infants. To advance our understanding of IDS, we must study it across many languages and speakers. Manually transcribing speech is time-consuming work. It is therefore essential that we develop a tool to automate this process. This study assesses the performance of a state-of-the-art automatic speech recognition (ASR) system at extracting target words in IDS. Twenty-one Dutch mothers read a picture book to elicit target words to their 18-month-old infants and the experimenter, and again six months later. We compared the automatic annotations of the target words to manual annotations. The results indicate that IDS and a higher mean pitch negatively affect recognition performance. This constitutes evidence that new tools need to be developed specifically for the automatic annotation of IDS. Nevertheless, the existing ASR system can already find more than half the target words which is highly promising.

Creating a Parliamentary Corpus of the EU using Automatic Speech Recognition (17)
Hugo de Vos and Suzan Verberne

Many meetings of the EU parliament are not available in transcript but only as audio/video. This makes it hard and labor intensive for political researchers to study those meetings, as a lot of time is spent with listening to long meetings.

In order to facilitate (political) research with this unexplored data source, we created a pipeline with speaker diarization and Automatic Speech Recognition (ASR) to automatically transcribe these meetings. The purpose of speaker diarization is to split the audio file in segments according to who is speaking. For speaker diarization we used the pyannote (Bredin et al., 2020) package and for ASR we used the novel Wav2Vec2 (Baevski et al., 2020). We processed 202 files from the LIBE committee meetings, constituting around 600 hours of recording. The output of the ASR results in a total of about 3 milion words of transcribed text.

An analysis of the output shows that there is room for improvement, especially in the transcription of domain-specific terminology and names. Our current work therefore involves further training of the models to adapt to the domain as well as post-processing steps to correct mistakes.

In our poster presentation, we will give an overview of the generated dataset, provide insights in the content using topic modeling and error analysis. The eventual goal is to create a rich dataset where the transcriptions are aligned with metadata such as minutes and agendas.

A Hybrid ASR system for Southern Dutch (18)
Bob Van Dyck, Bagher Babaali and Dirk Van Compernolle

Classical hybrid models for automatic speech recognition were recently outperformed by end-to-end models on popular benchmarks such as LibriSpeech. However, hybrid systems can prevail in many real-life situations due to independent training, optimization and tuning of acoustic models and language models. This is generally the case with limited resources (a few hundred or fewer hours of data) or significant language model mismatch between train and test data.

In this work, we implemented a state-of-the-art hybrid system for Southern Dutch. For the acoustic model, we use an HMM-TDNN using a rather standard Kaldi recipe, using 155 hrs of the Corpus Gesproken Nederlands (CGN) as training data. As reference language models, we reused models developed during the N-Best project. We further investigated the effect of language model order and size on WER for various test sets (held out data from CGN, N-Best dev and test sets). Best results, 12.85% WER on the N-Best test set, are obtained with a 400k lexicon and a 4-gram language model (with 231M parameters). For these amounts of data and test sets, this hybrid system outperforms our older HMM-GMM N-Best system by over 40% and outperforms our current best end-to-end system by some 15%. Pruning away 90% of the LM parameters yields a compact model suitable for small-scale real-time apps while only taking a 10% relative hit on performance.

A Deep Learning Model for Arabic Text-to-Speech System (19)
A. Mutawa

Text-To-Speech (TTS) is a software program that transforms text into waveforms. In terms of consistency and naturalness, TTS implementation is steadily improving. TTS study in Arabic continues to be complicated. The TTS method for Arabic based on pronunciation was explored in this report. Owing to the fact that Arabic is the most commonly spoken language on the globe, geographic pronunciations vary. This paper uses Tacotron to apply an end-to-end deep learning algorithm, as well as a pretrained network to train the model. The dataset contains 2.41 hours of reported speech with all sentences diacritized. The pronunciation errors of the two models are assessed using a subjective examination from native speakers.

Wizard of Oz (WOZ) experiments to collect customer service dialogues for fine-grained emotion detection (20)
Sofie Labat, Thomas Demeester and Véronique Hoste

Customer service is increasingly taking place in online environments, such as social media or private chat channels. In these settings, the automatic detection of fine-grained emotion trajectories can (i) help human operators in their daily tasks, (ii) assist in automatically modelling the quality of service or general customer satisfaction, and (iii) form a key component in the development of automated customer service chatbots. Many small and medium-sized enterprises (SMEs) have, however, not the necessary data available to train such systems. To this purpose, we investigate how Wizard of Oz techniques can be leveraged to create an open-source customer service dialogue corpus. During such experiments, participants believe to be talking to an automated chatbot, which in fact is a human operator pretending to be a conversational agent. We also collect profile data on the participants (age, gender, personality) with the aim of modelling interpersonal emotional variance in future research.

In this pilot study, we conducted experiments with 16 participants who each had 12 text-based conversations. Each of these conversations is grounded in a predefined event description linked to a commercial sector (i.e., e-commerce, telecommunication, tourism) and a sentiment trajectory (i.e., positive to negative, negative to positive, neutral to positive, neutral to negative). The resulting participant turns are annotated with emotion labels by both the participants and the experiment leader, while the latter also provides additional valence, arousal, and dominance scores. The resulting corpus contains 192 conversations and 3089 turns. Analysis indicates that 42,6% of the participant turns have a neutral sentiment, while 38,6% contain a clear negative sentiment with annoyance and disapproval as most frequent emotions. The remaining 18,7% participant turns are positively connotated with gratitude as the most frequent emotion. In future research, we will demonstrate the usability of our dataset by training a machine learning classifier on it.

	A Spoken Dutch Conversational Agent for Stimulating Self-management of Happiness (21) Jelte van Waterschoot and Iris Hendrickx
We present our experiences and challenges in developing a conversational agent who talks to people in Dutch about their daily life and happiness. We briefly discuss why we are interested in agents who talk with people about their happiness. Next we give an overview of the research conducted so far and our concrete plans for the future.

15:15 – 15:30

Coffee Break

15:30 – 16:45

Natural Language Inference – Room A (Chair: Iris Hendrickx)

15:30 – 15:55

SICK-NL: A Dataset for Dutch Natural Language Inference
Gijs Wijnholds and Michael Moortgat

We present SICK-NL (read: signal), a dataset targeting Natural Language Inference in Dutch. SICK-NL is obtained by translating the SICK dataset of (Marelli et al., 2014) from English into Dutch. Having a parallel inference dataset allows us to compare both monolingual and multilingual NLP models for English and Dutch on the two tasks. In the paper, we motivate and detail the translation process, perform a baseline evaluation on both the original SICK dataset and its Dutch incarnation SICK-NL, taking inspiration from Dutch skipgram embeddings and contextualised embedding models. In addition, we encapsulate two phenomena encountered in the translation to formulate stress tests and verify how well the Dutch models capture syntactic restructurings that do not affect semantics. Our main finding is all models perform worse on SICK-NL than on SICK, indicating that the Dutch dataset is more challenging than the English original. Results on the stress tests show that models don’t fully capture word order freedom in Dutch, warranting future systematic studies.

15:55 – 16:20

Does Logic-based Reasoning Work for Natural Language Inference in Dutch?
Lasha Abzianidze and Konstantinos Kogkalidis

We compose a pair of syntactic parsers, in the form of Typelogical Grammar provers, with a Natural Logic Tableau theorem prover to create the first dedicated logic-based Natural Language Inference system for Dutch.
The system predicts inferences based on the formal proofs that are completely transparent from an explanation point of view.

Its evaluation on the recently translated Dutch Natural Language Inference dataset shows promising results for a logic-based system, remaining only within a $4\%$ performance margin to a strong neural baseline.
Additionally, during the training, the theorem prover learns lexical relations new to the Dutch WordNet. We also found aspects in which the Dutch dataset is more challenging than the original English one.

16:20 – 16:45

N/A

15:30 – 16:45

Language Resources – Room B (Chair: Ayla Rigouts Terryn)

15:30 – 15:55

DALC: The Dutch Abusive Language Corpus
Tommaso Caselli, Arjan Schelhaas, Marieke Weultjes, Folkert Leistra, Hylke van der Veen, Gerben Timmerman and Malvina Nissim

The Dutch Abusive Language Corpus (DALC) is the first corpus of tweets for abusive language detection in Dutch. It has been created by applying three data collection methods that attempt to minimize developers’ bias and it comprises 8,156 tweets which have been manually annotated. The annotation adopts a multi-layer scheme distinguishing the explicitness of the message and the target. Preliminary experiments using different architectures on the classification of the explic- itness layer yield good results (best score 0.76 macro-F1 in a binary setting and best score 0.56 macro-F1 in a three-way setting).

15:55 – 16:20	Common Language Resources Infrastructure in Flanders: new prospects Vincent Vandeghinste, Els Lefever, Walter Daelemans, Tim Van de Cruys and Sally Chambers
We describe the creation of CLARIN Belgium (CLARIN-BE) and, associated with that, the plans of the CLARIN-VL consortium within the CLARIAH-VL infrastructure for which funding was secured for the period 2021-2025.

16:20 – 16:45

An annotation tool to create a Dutch FrameNet with rich conceptual and referential annotations
Levi Remijnse, Marten Postma, Sam Titarsolej and Piek Vossen

We present the Dutch FrameNet annotation tool. This tool offers the functionality of combining FrameNet annotation with coreferential annotation within and across documents. The resulting annotation scheme displays variation in framing of structured data about a particular real-world event. Furthermore, the tool accommodates annotation of Dutch corpora. We implemented markable correction of Dutch endocentric compounds to enable frame annotation of their constituents. In preprocessing the texts for the tool, dependecies between the separate components of Dutch phrasal verbs are created, making it possible to annotate those components as one token. All Dutch annotations in the tool contribute to the development of a Dutch FrameNet lexicon. The coreferential corpora and corresponding structured data are acquired using data-to-text methods.

15:30 – 16:45

Language Modelling – Room C (Chair: Bram Vanroy)

15:30 – 15:55

Fast and accurate dependency parsing for Dutch and German
Daniël de Kok and Patricia Beer

Statistical dependency parsers that use transformer-based (Vaswani et al. 2017) pretrained language models such as BERT (Devlin et al. 2019) and XLM-R (Conneau et al. 2020) show large improvements (Wu and Dredze 2019, Kondratyuk and Straka 2019, He and Choi 2019) over prior models. However, such deep transformer models are large and relatively inefficient, making them unfit for constrained environments such as mobile devices and CPU prediction on cloud virtual machines. In this work, we explore distillation (Hinton et al. 2015) as a means to compress XLM-R-based syntax annotators for Dutch and German. We start with a finetuned XML-R model as the baseline, showing that models can be made up to 10 times smaller and 4 times faster with a loss of at most 1.1 in labeled attachment score (LAS).

15:55 – 16:20

RobBERTje: a Distilled Dutch BERT Model
Pieter Delobelle, Thomas Winters and Bettina Berendt

Pre-trained large-scale language models like BERT have gained a lot of attention thanks to their outstanding performance on a wide range of natural language tasks. However, due to their large number of parameters, they are resource-intensive both to deploy and to finetune. As such, researchers have created several methods for distilling language models into smaller ones to increase efficiency, with a small performance trade-off. In this paper, we create a distilled version of the state-of-the-art Dutch RobBERT model, and call it RobBERTje. We found that while shuffled datasets perform better, concatenating sequences from the non-shuffled dataset to improve the dataset sequence length distribution before shuffling further improves the performance for tasks with longer sequences, and makes the model faster to distill. Upon comparing architectures, we found that the larger DistilBERT architecture worked significantly better than the BORT hyperparametrization. Since smaller architectures decrease the time to finetune, these models allow for more efficient training and more lightweight deployment of many Dutch downstream language tasks.

16:20 – 16:45

Probing for Dutch Relative Pronoun Choice
Gosse Bouma

We propose a linguistically motivated version of the relative pronoun probing task for Dutch (where a model has to predict whether a masked token is either ‘die’ or ‘dat’), collect realistic data for it using a parsed corpus, and probe the performance of four context-sensitive BERT-based neural language models. The task turns out to be much harder than the original version, which simply masked all occurrences of ‘die’ and ‘dat’. Models differ considerably in their performance, but a monolingual model trained on a heterogeneous corpus appears to be most robust.

16:45 – 17:45

Poster Session II – Poster Room II

Recurrence quantification analysis for measuring coherence of dream descriptions (1)
Katarina Laken and Iris Hendrickx

Several scholars have suggested to approach language from a dynamic complex systems theory (see for example Cameron and Larsen-Freeman (2007) and Tolston, Riley, Mancuso, Finomore, and Funke (2019)). A dynamic complex system is a system with many components that all change in time and that influence each other. This research analyses dream descriptions as time series generated by a complex system with recurrence quantification analysis (RQA). Recurrent points are defined as points in the recurrence matrix where the cosine similarity of the word embeddings of the words on the x- and the y-axis is over 0.20. I considered the metrics recurrence rate, determinism, laminarity and entropy. It was found that entropy was the strongest predictor of human judgments of coherence of a text. Recurrence rate was the best predictor of human ratings of bizarreness. Laminarity did not seem to predict human judgments on any of the variables (bizarreness, logic, and coherence) examined. Thus, these metrics are promising but need to be perfectioned in order to be useful for practical and research purposes. These results indicate that viewing linguistic data through the lense of complex systems theory is indeed fruitful, opening a whole new perspective on the analysis of language.

Does it make sense? Analyzing coherence in longer fictional discourse on a syntactic and semantic level (2)
Laura Scheuter and Judith van Stegeren

Discourse coherence has been the topic of research for years, yet the subject seems to be untouched when it comes to longer discourse. We therefore raise the question: how can longer fictional discourse of at least 50.000 words be assessed in terms of coherence?

To analyze coherence of long texts, we make the distinction between syntactic coherence (concerning grammar) and semantic coherence (concerning the meaning of the text). Using Barzilay and Lapata’s Entity Grid model [1] for the assessment of syntactic coherence, we create a feature vector per document representing the probabilities of the sentence-to-sentence transitions of the syntactic roles of the available entities. For the assessment of semantic coherence, we use Global Vectors (GloVe) [2] to create a vector representation of a document. We then compute the semantic similarity of adjacent sentences using the cosine similarity score. Both the feature vector and the cosine similarity score then are the input to two models: a logistic regression model and a random forest. We evaluate coherence in three ways: using only the feature vectors, using only the similarity scores and finally using both the feature vectors and the similarity scores.

We apply our coherence analysis to three datasets, starting with the Grammarly Corpus of Discourse Coherence (GCDC) [3]. This dataset is rated in terms of coherence on a 3 point scale but only consists of ~9 sentences per document. Thereafter, we expand our method to longer fictional discourse, for which we use generated books as a ground truth for low coherent discourse, and the Harry Potter books by J.K. Rowling as a ground truth for highly coherent discourse. We start by assessing coherence of snippets of ~9 sentences, to resemble the GCDC data, and we expand to increasingly longer snippets of the generated books and the Harry Potter books.

Neural Coreference Resolution for Dutch Parliamentary Meetings (3)
Ruben van Heusden, Maarten Marx and Jaap Kamps

The task of coreference resolution revolves around the linking of words and phrases describing real world entities to their correct antecedents in text. This task is an integral part of Natural Language Processing as it encapsulates a variety of aspects of the field, such as Named Entity Recognition, reading comprehension and the need for external knowledge, as well as a variety of others. In this paper we evaluate the current state-of-the-art models for coreference resolution on a newly annotated dataset of Dutch Parliamentary meetings. We expand the annotation scheme of (Schoen et al., 2014) for Dutch, to also include references to treaties, documents and laws,and introduce a new entity type specific to this dataset, the ‘parliamentary speech acts’. We perform an error analysis, and demonstrate that the current state-of- the-art, the neural e2e model for Dutch (Kuppevelt, Attema 2019) has difficulties in capturing the characteristics of this dataset, even after fine-tuning on the parliamentary data, owing to some very interesting characteristics of this data. Furthermore we propose several slight alterations to the e2e model regarding the addition of external knowledge to further improve the performance of the model on the parliamentary dataset, in particular the resolution of pronouns and the correct clustering of parliamentary speech acts.

OpenBoek: A Corpus of Coreference and Entities in Dutch Literature (4)
Frank van den Berg, Esther Ploeger, Menno Robben, Pauline Schomaker, Robin Snoek, Remi Thüss and Andreas van Cranenburgh

The literary domain presents unique challenges for NLP tasks such as coreference resolution. Addressing these challenges requires annotated data of sufficient quantity and quality for training and evaluating models. Recent work introduced coreference datasets for classic English novels (Litbank; Bamman et al., 2020) and contemporary Dutch novels (RiddleCoref; van Cranenburgh, 2019). Unfortunately, RiddleCoref is encumbered by copyright; i.e., the annotated texts cannot be made available.

We address this by annotating a corpus of Dutch public domain novels which we will release under an open license. The OpenBoek corpus currently consists of 9 fragments of Project Gutenberg texts, both translated and original Dutch novels. We annotated the full text of the novellas by Nescio, while the other fragments are initial segments of 10k tokens. The documents are therefore considerably longer than those in typical coreference corpora (SoNaR-1: 1k, Litbank: 2k). This fragment length was chosen with the aim of evaluating and tackling the particular challenges of long-document coreference resolution.

Annotation proceeded with the same semi-automatic method and annotation guidelines 2 as RiddleCoref: the texts were parsed by Alpino and coreference output of dutchcoref (van Cranenburgh, 2019) was manually corrected by multiple annotators. Mentions are manually corrected and exclude non-referring
expressions.

In addition to coreference, we annotated the gender (neuter, female, male, gendered but mixed/unknown) and number (singular, plural) of all entities in RiddleCoref and OpenBoek. The gender attribute also distinguishes person and non-person entities. Although the syntactic gender, number available in Alpino parse trees is already informative, we annotated the semantic gender and number; e.g., the grammatically neuter het meisje is annotated as female and the singular de groep is annotated as plural due to being a collective noun. These annotations are useful for training models to detect mention features for coreference resolution, among other possible applications.

Detecting Factually Erroneous References in Abstractive Summarisation Using Coreference Resolution (5)
Thomas Rood

Abstractive summarization algorithms are prone to generating summaries that are factually inconsistent with respect to the source text. In line with this problem, recent work has introduced automatic metrics for evaluating the factual consistency of summaries. However, the proposed top-down machine-learning-based approaches are limited in the inter-annotator agreement on their validation data and their explainability. In addition, the extend to which these models cover all types of factual errors is currently unknown. In this exploratory work, we propose a bottom-up detection algorithm that focuses on detecting a single type of error, erroneous references, to complement these top-down approaches, with the goal of constituting a simple, explainable and reliable module for detecting this error type. The proposed approach, based on the novel idea of comparing coreference chains between source and summary text, showed moderate performance on validation data and poor performance on test data. The limited performance was attributed to sub-optimal mention linking between source and summary texts, coreference annotation issues and limitations in the evaluation. Despite the lack of promising quantitative results, the approach yielded valuable insights on characteristics of the erroneous references and their possible causes, as well as opening up new avenues for future research.

Sensitivity of long short-term memory networks to animacy in relative pronoun garden path sentences (6)
Floris Cos and Stefan Frank

In recent years, surprisal values estimated by long short-term memory networks (LSTMs) trained on non-annotated text have repeatedly been shown to mirror human reading times on several types of garden path sentences. Even though surprisal on its own exhibits a weak effect compared to reading times (Van Schijndel & Linzen, in press), the models have been found to be sensitive to these syntactic structures (Van Schijndel & Linzen 2018) as well as thematic (Frank & Hoeks, 2019) and even pragmatic information (Davis & Van Schijndel, 2020).

The current study continues this trend for a yet untested type of Dutch garden path sentence with relative pronouns. An example is ‘Gisteren is de voetballer die de supporters afgeweerd hebben, …’ (‘Yesterday the football player, who was warded off by the supporters…’). When the relative clause starts, ‘de voetballer’ (‘the football player’) is ambiguous between subject and object of the subordinate clause. This is disambiguated by number information at the verb (in italics). One LSTM was trained and then tested on items from Mak et al. (2002; 2006). Mak et al. (2006) found that humans prefer subject relative clauses and animate subjects. If these variables conflict, choosing a structural interpretation is delayed until the disambiguating auxiliary is reached.

The LSTM’s surprisal values were lower for subject-relative than for object-relative clauses, mirroring human preference. An effect of animacy on surprisal was found only on items of Mak et al. (2006), albeit subordinate to a preference for subject relative clauses. This means that thematic information such as animacy is not guaranteed to be processed in the same way by LSTMs and humans. However, the model was sensitive to number information on nouns and verbs, and to syntactical structure in general.

Can BERT Dig It? – Named Entity Recognition for Information Retrieval in the Archaeology Domain (7)
Alex Brandsen, Suzan Verberne, Karsten Lambers and Milco Wansleeben

The amount of archaeological literature is growing rapidly. Until recent, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection (~658 Million words). In archaeological IR, domain-specific entities such as locations, time periods, and artefacts, play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities.

In this paper, we present ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts. We compare the model’s quality and output on a Named Entity Recognition task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using Conditional Random Fields (CRF).

We find that ArcheoBERTje outperforms both the multilingual and Dutch model significantly with a smaller standard deviation between runs, reaching an average F1 score of 0.735. The model also outperforms ensemble methods combining the three models. Combining ArcheoBERTje predictions and explicit domain knowledge from the thesaurus did not increase the F1 score. We quantitatively and qualitatively analyse the differences between the vocabulary and output of the BERT models on the full collection and provide some valuable insights in the effect of fine-tuning for specific domains.

Our results indicate that for a highly specific text domain such as archaeology, further pre-training on domain-specific data increases the model’s quality on NER by a much larger margin than shown for other domains in the literature, and that domain-specific pre-training makes the addition of domain knowledge from a thesaurus unnecessary.

Generalised Minimalist Grammars (8)
Meaghan Fowlie

Minimalist Grammars, Stabler’s (1997) formalisation of Chomsky’s (1995) Minimalist Program, are a mildly context sensitive grammar formalism. While Minimalist Grammars (MGs) have been in use for over two decades, with grammars generating strings, trees, and graphs, no unifying definition for MGs has yet been established: what qualifies these as MGs? What else could be an MG? What links their generative capacities?

To address these questions, I define Generalised Minimalist Grammars. I build on the “two-step” view of MGs, in which we consider separately the feature calculus, which determines which operations are defined when, and the algebra, which defines what those operations are and what they output. To define MGs in the general case, I describe constraints on the feature calculus and its mapping to what I call Generalised Minimalist Algebras (GMAs). I define GMAs to generalise the principles of Merge and Move, which are at the core of Minimalism, to arbitrary objects. In this way MGs can be defined to generate objects of any type, as long as they are built with an algebra that conforms to the definition of GMAs.

Results include:

1. A unifying definition for Minimalist Grammars, including a straightforward way to define synchronous MGs

2. For algebra objects with string yields, such as trees, a generalised MG is mildly context sensitive iff certain internal homomorphisms and its string yield function are linear and non-deleting.

3. Existing algebras from other formalisms can be used in MGAs to combine their insights. For instance, Dutch crossing dependencies are elegantly explained by Tree Adjoining Grammars; GMAs provide a simple way to combine TAGs with MGs, yielding a grammar that elegantly explains both crossing dependencies and Verb-Second phenomena.

WN-BERT: integrating WordNet and BERT for lexical semantics in Natural Language Understanding (9)
Mohamed Barbouch, Suzan Verberne and Tessa Verhoef

We propose an integration of BERT and WordNet for improving natural language understanding (NLU). Although BERT has shown its superiority in several NLU tasks, its performance seems to be limited for higher level tasks involving abstraction and inference. We argue that the model’s implicit learning in context is not sufficient to infer required relationships at this level. We represent the semantic knowledge from WordNet as embeddings using path2vec and wnet2vec, and integrate this with BERT following two different strategies: external combination, using a top multi-layer perceptron, and internal inclusion, inspired by VGCN-BERT. We evaluate the performance on three Sentiment Analysis and Sentence Similarity datasets; SST-2, SST-2 GLUE, and STS-B. We found that our combined model does not outperform the state of the art for SST-2 and STS-B. Limitations related to WordNet coverage and word sense disambiguation could have affected this. However, we achieved a slightly better accuracy on the GLUE version of SST-2. We found instances in the data where WordNet has a positive impact. Furthermore, analysing attention heads has shown WordNet and BERT embeddings to be mutually influential, facilitating direct improvements in subsequent work.

Breaking BERT: Understanding its Vulnerabilities for Named Entity Recognition (10)
Anne Dirkson, Suzan Verberne and Wessel Kraaij

Both generic and domain-specific BERT models are widely used for natural language processing (NLP) tasks. In this paper we investigate the vulnerability of BERT models to variation in input data for Named Entity Recognition (NER) and perform an extensive analysis of which changes BERT models are most vulnerable to. In order to do so, we propose a method for adversarial attack on BERT-based NER models. We also assess to what extent adversarial training can make NER models less vulnerable. Experimental results show that the original as well as the domain-specific BERT models are highly vulnerable to entity replacement: They can be fooled in 89.2 to 99.4% of the cases to mislabel previously correct entities when an entity is replaced with one of the same type. Specifically, BERT models are easily fooled by replacing the entity with a less frequent one. BERT models are also vulnerable to variation in the entity context with 20.2 to 45.0% of entities predicted completely wrong when context words are changed and another 29.3 to 53.3% of entities predicted wrong partially. Often only a single change is necessary to fool the model. BERT models seem most vulnerable to changes in the local context of entities. Of the two domain-specific BERT models, the vulnerability of BioBERT is comparable to the original BERT model whereas SciBERT is even more vulnerable. To analyze how specific these problems are to BERT, we also applied adversarial attack to feature-based Conditional Random Fields (CRF) models. We found that CRF models are equally vulnerable to entity replacement but often less vulnerable to variation in entity context. Adversarial training did not manage to reduce the vulnerability of BERT models. Our results chart the vulnerabilities of BERT models for NER and emphasize the importance of further research into uncovering and reducing these weaknesses.

An Event-Driven Multilingual Football Corpus for Detecting Perspectives and Polarity (11)
Albert Jan Schelhaas, Gosse Minnema, Tommaso Caselli and Malvina Nissim

We describe an ongoing project to automatically extract perspectives and polarity from textual data. We pick football matches as an ideal test bed to investigate these phenomena, since they allow to study the relationship between quasi-objective event data and narrative news text. We introduce a dataset consisting of structured and highly detailed match event data from recent UEFA Champions League matches, along with a linked corpus of newspaper articles in Dutch, German, and English reporting on these matches. Based on a manual pilot annotation study, we propose an annotation scheme for labeling perspective-taking (which team has the focus of attention of the narrative?) and polarity (in which team’s favour is the narrative written?) on the level of single subevents such as goals, fouls, and penalties. Finally, we outline possible strategies for computationally addressing this problem.

Aspect-based Sentiment Analysis for German: Analyzing ‘Talk of Literature’ Surrounding Literary Prizes on Social Media (12)
Lore De Greve, Gunther Martens, Cynthia Van Hee, Pranaydeep Singh and Els Lefever

Since the rise of social media, the authority of traditional professional literary critics has been supplemented – or undermined, depending on the point of view – by technological developments and the emergence of community-driven online layperson literary criticism. So far, relatively little research (Allington 2016, Kellermann et al. 2016, Kellermann and Mehling 2017, Bogaert 2017, Pianzola et al. 2020) has examined this layperson user-generated evaluative “talk of literature” instead of addressing traditional forms of consecration. In this paper,1 we examine the professional and layperson literary criticism pertaining to a prominent German-language literary award: the Ingeborg-Bachmann-Preis, awarded during the Tage der deutschsprachigen Literatur (TDDL).

We propose an aspect-based sentiment analysis (ABSA) approach to discern the evaluative criteria used to differentiate between ‘good’ and ‘bad’ literature. To this end, we collected a corpus of German social media reviews retrieved from Twitter, Instagram and Goodreads and enriched it with manual ABSA annotations: aspects and aspect categories (E.g., TEXT Motifs Themes, JURY Discussion Valuation), sentiment expressions and named entities. In a next step, the manual annotations are used as training data for our ABSA pipeline including 1) aspect term extraction, 2) aspect term category prediction and 3) aspect term polarity classification. Each pipeline ncomponent is developed using state-of-the-art pre-trained BERT models.

Two sets of experiments were conducted for the aspect polarity detection: one where only the aspect embeddings were used and another where an additional context window of five adjoining words in either direction of the aspect was considered. We present the classification results for the aspect category and aspect sentiment prediction subtasks for the Twitter corpus as well as the next steps to tackle aspect term extraction. These preliminary experimental results show a good performance and accuracy for the aspect category classification, with an F1-score of 0.81, and for the aspect sentiment subtask, which uses an additional context window, with an F1-score of 0.72.

Hungry for language data? Introducing a large Dutch corpus of restaurant reviews (13)
Jan Engelen and Emiel Krahmer

We introduce the Iens corpus, a dataset of over 684,000 Dutch restaurant reviews posted on the website iens.nl between 2012 and 2017. As such, it represents a large-sized language dataset for the Dutch language. While similar corpora exist for English (e.g., the Yelp dataset or the Amazon review corpus), there is a lack of easily available, high-quality data for low-resource languages. The Iens corpus is intended to fill this gap. In addition, the Iens corpus has several unique properties that make it a valuable resource for computational linguistics research. In this paper, we describe the construction and contents of the corpus, discuss its distinguishing features, and present some of its possible applications in computational linguistics. The corpus is freely available for research purposes via Dataverse.

A Rule-Based Approach for Detecting Vaccination Hesitancy in Online Comments (14)
Tess Dejaeghere, Tim Kreutz, Ilia Markov and Walter Daelemans

We present our preliminary work on the analysis of vaccine hesitancy in Dutch online comments. The goal of our work is to develop annotation guidelines to identify vaccine hesitancy in online comments and a rule-based approach based on linguistic markers and social media features that will be able to identify hesitant messages in online comments contributing to the creation of a silver-labelled dataset for training supervised systems.

Fine-grained implicit sentiment processing of polar economic events (15)
Gilles Jacobs and Véronique Hoste

We present experiments on implicit sentiment processing based on a fine-grained event-with-sentiment dataset for the economic domain. In information extraction, events encode factual information about real-world occurrences reported in news text, while sentiment analysis concerns expressions of opinions and subjectivity.
We contribute to the under-researched task of `polar fact’ or `implicit sentiment’ processing (Wilson (2008), Toprak et al. (2010)) by combining extraction of events and sentiment to automatically assign common-sense connotational opinion to facts.

The SENTiVENT dataset of English economic news articles contains token-level event annotations (event annotation described in Jacobs and Hoste (forthc.)) on which we manually annotate aspect-based sentiment consisting of sentiment expressions (the words expressing implicit/explicit sentiment), targets (the entity to which the sentiment is directed), and polarities (‘positive’, ‘negative’, and ‘neutral’ investor opinion). The sentiment annotations are validated in an inter-annotator agreement study and a series of polarity classification experiments on coarse- and fine-grained sentiment-target representations. For the coarse-grained experiments, we cast implicit polarity categorization as a text-classification task of sentiment expressions with target spans using large-scale pretrained language model embeddings. We fine-tune several models on the classification task and experiment with the integration of existing lexicons comparing performance of in-domain and general sentiment lexicons. For fine-grained experiments, we apply sentiment-target-polarity triplet extraction to test the feasibility of token-level extraction of implicit sentiment. To this end, we apply a state-of-the-art Grid Tagging end-to-end model (Wu et al. 2020) and compare performance to benchmark explicit sentiment datasets.

The SENTiVENT dataset fills the need for a manually annotated dataset in financial text mining applications while also being useful for implicit sentiment processing. These experiments validate the data resource and the results suggest best-practices for the creation and natural language engineering of domain-specific implicit sentiment applications.

Combining Transformers and Affect Lexica for Dutch Emotion Detection: Insights from the New EmotioNL Dataset (16)
Luna De Bruyne, Orphée De Clercq and Véronique Hoste

Although emotion detection has become a crucial research direction in NLP, the focus is primarily on English resources and data. The main obstacles for more specialized emotion detection are the lack of annotated data in smaller languages and the limited emotion taxonomy. In a first step to bridge this gap, we present EmotioNL, an emotion dataset consisting of 1,000 Dutch tweets and 1,000 captions from reality TV-shows, manually annotated with the emotion categories anger, fear, joy, love, sadness and neutral and the dimensions valence, arousal and dominance.

In previous work (De Bruyne et al., 2021), we combined the transformer models BERTje and RobBERT with lexicon-based methods and evaluated these models on EmotioNL. Two architectures were proposed: one in which lexicon information is directly injected into the transformer model and a meta-learning approach where predictions from transformers are combined with lexicon features. We found that directly adding lexicon information to transformers does not improve performance, but that lexicon information does have a positive effect on BERTje in the meta-learning approach (although not on RobBERT).

As results were not conclusive, we extend our methodology in the current study in four ways. Firstly, we include more lexica. Secondly, we use cross-validation instead of an ordinary train-test split to be able to draw more general conclusions. Further, we fine-tune the random seed that controls the weight initialization of transformer models, as the choice of random seed can heavily influence performance. Finally, we introduce a new metric, which we call ‘cost-corrected accuracy’. This metric takes the severity of a false prediction (the ‘cost’) into account. For example, misclassifying an instance of ‘joy’ as ‘love’ is a less severe mistake than misclassifying that instance as ‘anger’. This allows for a fairer evaluation of the models.

Viewpoints in the news: claim detection for diverse news recommendation (17)
Myrthe Reuver, Suzan Verberne, Roser Morante and Antske Fokkens

Why do filter bubble effects in news recommendation harm democracy? And how can computational linguistics help in this complex societal problem? In order to answer these questions, we connect ideas from different (sub)disciplines (Recommender systems (RecSys), computational linguistics, and social science).

We did claim detection to operationalize and detect the idea of “viewpoints” on issues in the news. We reproduced the experiments by Reimers et. al. (2019), who use the UKP Sentential Argument Mining Corpus (Stab et. al., 2018), and classify claims from internet texts on current debates in 3 classes (pro, neutral, and con a debate topic) with a pre-trained Transformer BERT model. With leave-one-topic-out evaluation, we found results similar to the original study (obtaining a weighted F1 of 62.47%).

We connect our results to the theoretical model of deliberative democracy in social science, which claims a healthy democracy requires deliberation between different societal viewpoints (Habermas, 2006). News recommender systems, if they want to support a deliberative democracy, should provide such diverse viewpoints to their users (Helberger, 2019). In a recent paper we connected this theoretical model to several different established Natural Language Processing (NLP) tasks, including opinion mining and claim detection, to determine their use for increasing viewpoint diversity in news recommendation (Reuver et. al., 2021). The theoretical model of a deliberative democracy also allows us to think about some nuanced aspects of the problem that go beyond current NLP task definitions, but that we also need to address.

Additionally, Reimers et. al. (2019) cluster similar claims, while we are interested in finding diverse (dissimilar) claims on the same topic. We also added to this reproduction by analyzing which claims are difficult to classify. In our presentation we will show the results of this study, and put them in the broader context of diverse news recommendation.

The automatic cross-lingual identification of equivalent term pairs extracted from comparable corpora (18)
Anneleen Dill and Ayla Rigouts Terryn

Automatic terminology extraction (ATE) refers to the process of extracting specialised vocabulary from corpora of domain-specific texts. Multilingual ATE can be used to extract bilingual term pairs from parallel corpora. However, parallel corpora are not available for all domains and all language combinations, so recent methodologies for multilingual ATE aim to work with comparable corpora, where monolingual ATE is performed on the texts in different languages and the extracted candidate terms are linked cross-lingually. This latter step is the most challenging in automatic term extraction from comparable corpora (ATECC). Consequently, the aim of this pilot study was to explore the automatic cross-lingual linking of terms extracted from comparable corpora, specifically using features based on the information collected during the monolingual ATE step. The study was conducted on a trilingual corpus (English/French/Dutch) with comparable texts on the topic ‘heart failure’, for which a gold standard had already been constructed. Using this gold standard, classification algorithms were trained on labelled term pairs, using a combination of the monolingual features of both terms, to predict whether the two terms were valid equivalents or not. The classification performance was improved by incorporating additional features, such as the Levenshtein distance and a feature indicating whether both terms start with the same character. An error analysis revealed that none of the features from the monolingual ATE are as informative as the Levenshtein distance, but that the combination of all monolingual features still contributes relevant information for equivalent detection. Among false positives, cases of hyponymy-hypernymy and co-hyponymy were identified. Future research is needed to explore more informative features like word embeddings, but the current findings illustrate how re-using features from monolingual ATE can benefit the cross-lingual identification of equivalent term pairs.

Automatically Identifying Shortages During the Covid-19 Pandemic from Publicly Available Articles (19)
Daphne Theodorakopoulos, Esmee Peters, Louise Knight, Gwenn Englebienne and Shenghui Wang

During the Covid-19 pandemic, severe shortages in supply chains were caused by the sudden need for certain products, such as personal protective equipment. Previous research shows that knowing about upcoming shortages helps to mitigate them and reduce their impact, for instance by producing more substitutes. Therefore, an early warning system capable of identifying shortages in potential future pandemics would be helpful.

Procurement experts often use keyword searches to find relevant documents. Recent research emphasizes using qualitative big data, including text, to study supply chains. We propose to analyze “news articles”, as they reported on ongoing shortages and “medical literature” because it might show an early signal of an upcoming increased usage of particular types of products. We used two datasets, a random sample of 100K news articles from Covid-19 news ranging from November 2019 to July 2020, and a selection of English Covid-19 publications, containing 300K medical publications from November 2019 to May 2021, in this study.

Our analysis includes Topic Modeling, term weighting and context analysis. We used topic modeling to select documents relevant to shortages and applied several different weighting schemes – e.g. frequency, topic relevance, TFIDF – to retrieve shortages.

The models with the highest overlap with generic shortage terms
(e.g. “scarcity”, “shortage”, “need”) were evaluated by a human expert. Preliminary findings show no significant difference in timeliness between the two datasets, but that our method can retrieve both the shortages known by the expert, such as “n95 respirator” and “mechanical ventilation”, as well as less obvious shortages, such as “nasal swab” and “staff shortage”. The best results, according to the expert, were obtained by training a topic model on the news dataset, using it to select documents in the medical dataset, and retrieving the most frequent terms in a 20-word window around the word “shortage”.

17:45 – 18:00

Closing – Keynote Room

18:00 – 20:00

Social event – Game Room (optional)