Technical publications written by Günter Neumann

Publications (full list here)

Cennet Oguz, Simon Ostermann, and Günter Neumann (2026) Entity-Grounded Procedural Knowledge Graphs for Executable Task Understanding from Instructional Videos, Proceedings of 2nd German Robotics Conference (GRC-2026), Germany, 2026.

Instructional videos contain rich procedural knowledge that could support robotic task execution. However, most existing video understanding approaches produce free-form captions or high-level action labels that lack the explicit, entity-centric semantics required for robotic planning. We present Entity-Grounded Procedural Knowledge Graphs (EGPKGs), a neuro-symbolic representation that decomposes instructional videos into explicit entity-level transformations with grounded preconditions and effects. EGPKGs integrate language-based action schemas, vision-based entity grounding, and symbolic state transitions to produce executable task representations suitable for AI-powered robotic systems.

Download
Julian Schlenker, Jenny Kunz, Tatiana Anikina, Günter Neumann, and Simon Ostermann (2025) Only for the Unseen Languages, Say the Llamas: On the Efficacy of Language Adapters for Cross-lingual Transfer in English-centric LLMs, Proceedings of the 63st Annual Meeting of the Association for Computational Linguistics (Student Research Workshop). ACL Student Research Workshop (ACL-IJCNLP-SRW-2025), located at ACL (ACL-IJCNLP-SRW-2025), Italy, 2025.

Most state-of-the-art large language models (LLMs) are trained mainly on English data, limiting their effectiveness on non-English, especially low-resource, languages. This study investigates whether language adapters can facilitate cross-lingual transfer in English-centric LLMs. We train language adapters for 13 languages using Llama 2 (7B) and Llama 3.1 (8B) as base models, and evaluate their effectiveness on two downstream tasks (MLQA and SIB-200) using either task adapters or in-context learning. Our results reveal that language adapters improve performance for languages not seen during pre-training, but provide negligible benefit for seen languages. These findings highlight the limitations of language adapters as a general solution for multilingual adaptation in English-centric LLMs.

Download
Muhammad Umer Tariq Butt; Stalin Varanasi, and Günter Neumann (2025) Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO, Proceedings of 47th European Conference on Information Retrieval (ECIR 2025), Italy, 2025.

As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset through machine translation.We establish baseline results through zero-shot learning for IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset. Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant improvements over zero-shot results and showing the potential for expanding IR access for Urdu speakers. By bridging access gaps for speakers of low-resource languages, this work not only advances multilingual IR research but also emphasizes the ethical and societal importance of inclusive IR technologies. This work provides valuable insights into the challenges and solutions for improving language representation and lays the groundwork for future research, especially in South Asian languages, which can benefit from the adaptable methods used in this study.

Download
Noon Pokaratsiri, Saadullah Amin, and Günter Neumann (2024) Towards Understanding Attention-based Reasoning through Graph Structures in Medical Codes Classification, Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing at ACL-2024 (TextGraphs 2024), Bangkok, Thailand, 2024.

A common approach to automatically assigning diagnostic and procedural clinical codes to health records is to solve the task as a multi-label classification problem. Difficulties associated with this task stem from domain knowledge requirements, long document texts, large and imbalanced label space, reflecting the breadth and dependencies between medical diagnoses and procedures. Decisions in the healthcare domain also need to demonstrate sound reasoning, both when they are correct and when they are erroneous. Existing works address some of these challenges by incorporating external knowledge, which can be encoded into a graph-structured format. Incorporating graph structures on the output label space or between the input document and output label spaces have shown promising results in medical codes classification. Limited focus has been put on utilizing graph-based representation on the input document space. To partially bridge this gap, we represent clinical texts as graph-structured data through the UMLS Metathesaurus; we explore implicit graph representation through pre-trained knowledge graph embeddings and explicit domain-knowledge guided encoding of document concepts and relational information through graph neural networks. Our findings highlight the benefits of pre-trained knowledge graph embeddings in understanding model's attention-based reasoning. In contrast, transparent domain knowledge guidance in graph encoder approaches is overshadowed by performance loss. Our qualitative analysis identifies limitations that contribute to prediction errors.

Download
Tanja Bäumel, Soniya Vijayakumar, Josef van Genabith, Günter Neumann, and Simon Ostermann (2023) Investigating the Encoding of Words in BERT's Neurons Using Feature Textualization, Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. (BlackboxNLP-2023), Singapore, 2023.

Pretrained language models (PLMs) form the basis of most state-of-the-art NLP technologies. Nevertheless, they are essentially black boxes: Humans do not have a clear understanding of what knowledge is encoded in different parts of the models, especially in individual neurons. A contrast is in computer vision, where feature visualization provides a decompositional interpretability technique for neurons of vision models. Activation maximization is used to synthesize inherently interpretable visual representations of the information encoded in individual neurons. Our work is inspired by this but presents a cautionary tale on the interpretability of single neurons, based on the first large-scale attempt to adapt activation maximization to NLP, and, more specifically, large PLMs. We propose feature textualization, a technique to produce dense representations of neurons in the PLM word embedding space. We apply feature textualization to the BERT model to investigate whether the knowledge encoded in individual neurons can be interpreted and symbolized. We find that the produced representations can provide insights about the knowledge encoded in individual neurons, but that individual neurons do not represent clear-cut symbolic units of language such as words. Additionally, we use feature textualization to investigate how many neurons are needed to encode words in BERT.

Download
Stalin Varanasi, Muhammad Umer Butt, and Günter Neumann (2023) AutoQIR: Auto-Encoding Questions with Retrieval Augmented Decoding for Unsupervised Passage Retrieval and Zero-shot Question Generation, Proceedings of Recent Advances in Natural Language Processing (RANLP-2023), Bulgaria, 2023.

Dense passage retrieval models have become state-of-the-art for information retrieval on many Open-domain Question Answering (ODQA) datasets. However, most of these models rely on supervision obtained from the ODQA datasets, which hinders their performance in a low-resource setting. Recently, retrieval-augmented language models have been proposed to improve both zero-shot and supervised information retrieval. However, these models have pre-training tasks that are agnostic to the target task of passage retrieval. In this work, we propose Retrieval Augmented Auto-encoding of Questions for zeroshot dense information retrieval. Unlike other pre-training methods, our pre-training method is built for target information retrieval, thereby making the pre-training more efficient. Our method consists of a dense IR model for encoding questions and retrieving documents during training and a conditional language model that maximizes the question’s likelihood by marginalizing over retrieved documents. As a by-product, we can use this conditional language model for zero-shot question generation from documents. We show that the IR model obtained through our method improves the current state-of-the-art of zero-shot dense information retrieval, and we improve the results even further by training on a synthetic corpus created by zero-shot question generation.

Download
Saadullah Amin, Pasquale Minervini, David Chang, Pontus Stenetorp, and Günter Neumann (2022) MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction, Proceedings of The 29th International Conference on Computational Linguistics (Coling-2022), October 12-17, 2022, Gyeongju, Republic of Korea

Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%. Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough evaluation with domain-specific language models, we also conduct experiments validating general domain relation extraction findings to biomedical relation extraction.

Download
Ioannis Dikeoulias, Saadullah Amin, and Günter Neumann (2022) Temporal Knowledge Graph Reasoning with Low-rank and Model-agnostic Representations , Proceedings of the 7th Workshop on Representation Learning for NLP. ACL-2022, RepL4NLP May 2022, Pages 111-120 ACL 5/2022 (RepL4NLP-2022), May, 2022.

Temporal knowledge graph completion (TKGC) has become a popular approach for reasoning over the event and temporal knowledge graphs, targeting the completion of knowledge with accurate but missing information. In this context, tensor decomposition has successfully modeled interactions between entities and relations. Their effectiveness in static knowledge graph completion motivates us to introduce Time-LowFER, a family of parameter-efficient and time-aware extensions of the low-rank tensor factorization model LowFER. Noting several limitations in current approaches to represent time, we propose a cycle-aware time-encoding scheme for time features, which is model-agnostic and offers a more generalized representation of time. We implement our methods in a unified temporal knowledge graph embedding framework, focusing on time-sensitive data processing. The experiments show that our proposed methods perform on par or better than the state-of-the-art semantic matching models on two benchmarks.

Download
Saadullah Amin, Noon Pokaratsiri, Morgan Wixted, Alejandro García-Rudolph, Catalina Martínez-Costa, and Günter Neumann (2022) Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts , Proceedings of the 21st Workshop on Biomedical Language Processing. ACL-2022 BioNLP, May 22-27, Pages 200-211 ACL 5/2022. (BioNLP-2022), May, 2022.

Despite the advances in digital healthcare systems offering curated structured knowledge, much of the critical information still lies in large volumes of unlabeled and unstructured clinical texts. These texts, which often contain protected health information (PHI), are exposed to information extraction tools for downstream applications, risking patient identification. Existing works in de-identification rely on using large-scale annotated corpora in English, which often are not suitable in real-world multilingual settings. Pre-trained language models (LM) have shown great potential for cross-lingual transfer in low-resource settings. In this work, we empirically show the few-shot cross-lingual transfer property of LMs for named entity recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke domain. We annotate a gold evaluation dataset to assess few-shot setting performance where we only use a few hundred labeled examples for training. Our model improves the zero-shot F1-score from 73.7% to 91.2% on the gold evaluation set when adapting Multilingual BERT (mBERT) (CITATION) from the MEDDOCAN (CITATION) corpus with our few-shot cross-lingual target corpus. When generalized to an out-of-sample test set, the best model achieves a human-evaluation F1-score of 97.2%.

Download
Stalin Varanasi, Saadullah Amin and Günter Neumann (2021) AutoEQA: Auto-Encoding Questions for Extractive Question Answering, The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP-2021), Nov. 2021.

There has been a significant progress in the field of extractive question answering (EQA) in the recent years. However, most of them rely on annotations of answer-spans in the corresponding passages. In this work, we address the problem of EQA when no annotations are present for the answer span, i.e., when the dataset contains only questions and corresponding passages. Our method is based on auto-encoding of the question that performs a question answering (QA) task during encoding and a question generation (QG) task during decoding. Our method performs well in a zeroshot setting and can provide an additional loss that boosts performance for EQA.

Download
Saadullah Amin and Günter Neumann (2021) T2NER: Transformers based Transfer Learning Framework for Name Entity Recognition. The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), demo session, 2021.

Recent advances in deep transformer models have achieved state-of-the-art in several natural language processing (NLP) tasks, whereas named entity recognition (NER) has traditionally benefited from long-short term memory (LSTM) networks. In this work, we present a Transformers based Transfer Learning framework for Named Entity Recognition (T2NER) created in PyTorch for the task of NER with deep transformer models. The framework is built upon the Transformers library as the core modeling engine and supports several transfer learning scenarios from sequential transfer to domain adaptation, multi-task learning, and semi-supervised learning. It aims to bridge the gap between the algorithmic advances in these areas by combining them with the state-of-theart in transformer models to provide a unified platform that is readily extensible and can be used for both the transfer learning research in NER, and for real-world applications. The framework is available at: https://github. com/suamin/t2ner.

Download
Ekaterina Loginova, Stalin Varanasi and Günter Neumann (2021) Towards End-to-End Multilingual Question Answering . In Journal Information Systems Frontiers, 23(1): 227-241 (2021).

Multilingual question answering (MLQA) is a critical part of an accessible natural language interface. However, current solutions demonstrate performance far below that of monolingual systems. We believe that deep learning approaches are likely to improve performance in MLQA drastically. This work aims to discuss the current state-of-the-art and remaining challenges. We outline requirements and suggestions for practical parallel data collection and describe existing methods, benchmarks and datasets. We also demonstrate that a simple translation of texts can be inadequate in case of Arabic, English and German languages (on InsuranceQA and SemEval datasets), and thus more sophisticated models are required. We hope that our overview will re-ignite interest in multilingual question answering, especially with regard to neural approaches.

Download (This is an earlier version of the paper, but content, experiments and results are the same.)
Online version.
Stalin Varanasi, Saadullah Amin, and Günter Neumann (2020) CopyBERT: A Unified Approach to Question Generation with Self-Attention. NLP for Conversational AI - Proceedings of the 2nd Workshop, ACL workshop, 2020.

Contextualized word embeddings provide better initialization for neural networks that deal with various natural language understanding (NLU) tasks including Question Answering (QA) and more recently, Question Generation (QG). Apart from providing meaningful word representations, pre-trained transformer models, such as BERT also provide self-attentions which encode syntactic information that can be probed for dependency parsing and POStagging. In this paper, we show that the information from self-attentions of BERT are useful for language modeling of questions conditioned on paragraph and answer phrases. To control the attention span, we use semidiagonal mask and utilize a shared model for encoding and decoding, unlike sequence-tosequence. We further employ copy mechanism over self-attentions to achieve state-of-the-art results for Question Generation on SQuAD dataset.

Download
Saadullah Amin, Stalin Varanasi, Katherine Dunfield and Günter Neumann (2020) LowFER: Low-rank Bilinear Pooling for Link Prediction. Proceedings of the 37th International Conference on Machine Learning (ICML-2020), 2020.

Knowledge graphs are incomplete by nature, with only a limited number of observed facts from the world knowledge being represented as structured relations between entities. To partly address this issue, an important task in statistical relational learning is that of link prediction or knowledge graph completion. Both linear and non-linear models have been proposed to solve the problem. Bilinear models, while expressive, are prone to overfitting and lead to quadratic growth of parameters in number of relations. Simpler models have become more standard, with certain constraints on bilinear map as relation parameters. In this work, we propose a factorized bilinear pooling model, commonly used in multi-modal learning, for better fusion of entities and relations, leading to an efficient and constraint-free model. We prove that our model is fully expressive, providing bounds on the embedding dimensionality and factorization rank. Our model naturally generalizes Tucker decomposition based TuckER model, which has been shown to generalize other models, as efficient low-rank approximation without substantially compromising the performance. Due to low-rank approximation, the model complexity can be controlled by the factorization rank, avoiding the possible cubic growth of TuckER. Empirically, we evaluate on real-world datasets, reaching on par or state-of-the-art performance. At extreme low-ranks, model preserves the performance while staying parameter efficient.

Download
Saadullah Amin, Katherine Dunfield, Anna Vechkaeva and Günter Neumann (2020) A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction . In Proceedings of BioNLP-2020 at ACL-2020.

Fact triples are a common form of structured knowledge used within the biomedical domain. As the amount of unstructured scientific texts continues to grow, manual annotation of these texts for the task of relation extraction becomes increasingly expensive. Distant supervision offers a viable approach to combat this by quickly producing large amounts of labeled, but considerably noisy, data. We aim to reduce such noise by extending an entity-enriched relation classification BERT model to the problem of multiple instance learning, and defining a simple data encoding scheme that significantly reduces noise, reaching state-of-the-art performance for distantly-supervised biomedical relation extraction. Our approach further encodes knowledge about the direction of relation triples, allowing for increased focus on relation learning by reducing noise and alleviating the need for joint learning with knowledge graph completion.

Download
Dominik Stammbach and Günter Neumann (2019) Team DOMLIN: Exploiting Evidence Enhancement for the FEVER Shared Task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER), EMNLP workshop, 2019.

This paper contains our system description for the second Fact Extraction and VERification (FEVER) challenge. We propose a two-staged sentence selection strategy to account for examples in the dataset where evidence is not only conditioned on the claim, but also on previously retrieved evidence. We use a publicly available document retrieval module and have fine-tuned BERT checkpoints for sentence selection and as the entailment classifier. We report a FEVER score of 68.46% on the blind testset.

Download.
Saadullah Amin, Günter Neumann, Katherine Dunfield, Anna Vechkaeva, Kathryn Annette Chapman, and Morgan Kelly Wixted (2019) MLT-DFKI at CLEF eHealth 2019: Multi-label Classification of ICD-10 Codes with BERT. In working notes of CLEF eHealth, 2019.

With the adoption of electronic health record (EHR) systems, hospitals and clinical institutes have access to large amounts of heterogeneous patient data. Such data consists of structured (insurance details, billing data, lab results etc.) and unstructured (doctor notes, admission and discharge details, medication steps etc.) documents, of which, latter is of great significance to apply natural language processing (NLP) techniques. In parallel, recent advancements in transfer learning for NLP has pushed the state-of-the-art to new limits on many language understanding tasks. Therefore, in this paper, we present team DFKI-MLT's participation at CLEF eHealth 2019 Task 1 of automatically assigning ICD-10 codes to non-technical summaries (NTSs) of animal experiments where we use various architectures in multi-label classification setting and demonstrate the effectiveness of transfer learning with pre-trained language representation model BERT (Bidirectional Encoder Representations from Transformers) and its recent variant BioBERT. We first translate task documents from German to English using automatic translation system and then use BioBERT which achieves an F1-micro of 73.02% on submitted run as evaluated by the challenge.

Download.
Alejandro Figueroa, Carlos Gómez-Pantoja, and Günter Neumann (2019) Integrating heterogeneous sources for predicting question temporal anchors across Yahoo! Answers. In journal Information Fusion, Volume 50, October 2019, Pages 112-125.

Modern Community Question Answering (CQA) web forums provide the possibility to browse their archives using question-like search queries as in Information Retrieval (IR) systems. Although these traditional IR methods have become very successful at fetching semantically related questions, they typically leave unconsidered their temporal relations. That is to say, a group of questions may be asked more often during specific recurring time lines despite being semantically unrelated. In fact, predicting temporal aspects would not only assist these platforms in widening the semantic diversity of their search results, but also in re-stating questions that need to refresh their answers and in producing more dynamic, especially temporally-anchored, displays.

In this paper, we devised a new set of time-frame specific categories for CQA questions, which is obtained by fusing two distinct earlier taxonomies (i.e., [29] and [50]). These new categories are then utilized in a large crowd-sourcing based human annotation effort. Accordingly, we present a systematical analysis of its results in terms of complexity and degree of difficulty as it relates to the different question topics.

Incidentally, through a large number of experiments, we investigate the effectiveness of a wider variety of linguistic features compared to what has been done in previous works. We additionally mix evidence/features distilled directly and indirectly from questions by capitalizing on their related web search results. We finally investigate the impact and effectiveness of multi-view learning to boost a large variety of multi-class supervised learners by optimizing a latent layer build on top of two views: one composed of features harvested from questions, and the other from CQA meta data and evidence extracted from web resources (i.e., nippets and Internet archives).

Download accepted manuscript

Online version: https://doi.org/10.1016/j.inffus.2018.10.006
Ekaterina Loginova and Günter Neumann (2018) An Interactive Web-Interface for Visualizing the Inner Workings of the Question Answering LSTM. In proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP-2018, October 31 – November 4, Brussels, Belgium, 2018.

Deep learning models for NLP are potent but not readily interpretable. It prevents researchers from improving a model’s performance efficiently and users from applying it for a task which requires a high level of trust in the system. We present a visualisation tool which aims to illuminate the inner workings of a specific LSTM model for question answering. It plots heatmaps of neurons’ firings and allows a user to check the dependency between neurons and manual features. The system possesses an interactive web-interface and can be adapted to other models and domains.

Download
Georg Heigold, Stalin Varanasi, Günter Neumann and Josef van Genabith (2018) How Robust Are Character-Based Word Embeddings in Tagging and MT Against Wrod Scramlbing or Randdm Nouse?. In proceedings of AMTA, March.

This paper investigates the robustness of NLP against perturbed word forms. While neural approaches can achieve (almost) human-like accuracy for certain tasks and conditions, they often are sensitive to small changes in the input such as non-canonical input (e.g., typos). Yet both stability and robustness are desired properties in applications involving user-generated content, and the more as humans easily cope with such noisy or adversary conditions. In this paper, we study the impact of noisy input. We consider different noise distributions (one type of noise, combination of noise types) and mismatched noise distributions for training and testing. Moreover, we empirically evaluate the robustness of different models (convolutional neural networks, recurrent neural networks, non-neural models), different basic units (characters, byte pair encoding units), and different NLP tasks (morphological tagging, machine translation).
Download
Georg Heigold, Günter Neumann and Josef van Genabith (2017) An Extensive Empirical Evaluation of Character-Based Morphological Tagging for 14 Languages. In proceedings of EACL, 2017.

This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets. Character-based approaches are attractive as they can handle rarely- and unseen words gracefully. We evaluate on 14 languages and observe consistent gains over a state-of-the-art morphological tagger across all languages except for English and French, where we match the state-of-the-art. We compare two architectures for computing character-based word vectors using recurrent (RNN) and convolutional (CNN) nets. We show that the CNN based approach performs slightly worse and less consistently than the RNN based approach. Small but systematic gains are observed when combining the two architectures by ensembling.
Download

günter neumann

German Research Center for Artificial Intelligence

Navigate

Hello!

Publications (full list here)