Publications (full list here)
-
Julian Schlenker, Jenny Kunz, Tatiana Anikina, Günter Neumann, and Simon Ostermann
(2025)
Only for the Unseen Languages, Say the Llamas:
On the Efficacy of Language Adapters for Cross-lingual Transfer in English-centric LLMs,
Proceedings of the 63st Annual Meeting of the Association for Computational Linguistics
(Student Research Workshop). ACL Student Research Workshop (ACL-IJCNLP-SRW-2025), located at ACL
(ACL-IJCNLP-SRW-2025),
Italy, 2025.
Most state-of-the-art large language models (LLMs) are trained mainly on English data,
limiting their effectiveness on non-English, especially low-resource, languages.
This study investigates whether language adapters can facilitate cross-lingual transfer
in English-centric LLMs. We train language adapters for 13 languages using Llama 2 (7B)
and Llama 3.1 (8B) as base models, and evaluate their effectiveness on two downstream tasks
(MLQA and SIB-200) using either task adapters or in-context learning. Our results
reveal that language adapters improve performance for languages not seen during pre-training,
but provide negligible benefit for seen languages. These findings highlight the limitations
of language adapters as a general solution for multilingual adaptation in English-centric LLMs.
-
Muhammad Umer Tariq Butt; Stalin Varanasi, and Günter Neumann (2025)
Enabling Low-Resource Language Retrieval:
Establishing Baselines for Urdu MS MARCO, Proceedings of 47th
European Conference on Information Retrieval
(ECIR 2025),
Italy, 2025.
As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity,
addressing the needs of low-resource languages remains a significant challenge.
This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset
through machine translation.We establish baseline results through zero-shot learning for
IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset.
Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a
Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant
improvements over zero-shot results and showing the potential for expanding IR access for Urdu speakers.
By bridging access gaps for speakers of low-resource languages, this work not only advances
multilingual IR research but also emphasizes the ethical and societal importance of
inclusive IR technologies. This work provides valuable insights into the
challenges and solutions for improving language representation and lays the groundwork for
future research, especially in South Asian languages, which can benefit
from the adaptable methods used in this study.
-
Noon Pokaratsiri, Saadullah Amin, and Günter Neumann (2024)
Towards Understanding Attention-based Reasoning through Graph
Structures in Medical Codes Classification, Proceedings of TextGraphs-17:
Graph-based Methods for Natural Language Processing at ACL-2024
(TextGraphs 2024),
Bangkok, Thailand, 2024.
A common approach to automatically assigning diagnostic and procedural clinical codes to health records
is to solve the task as a multi-label classification problem. Difficulties associated with this task stem
from domain knowledge requirements, long document texts, large and imbalanced label space, reflecting the
breadth and dependencies between medical diagnoses and procedures. Decisions in the healthcare domain also
need to demonstrate sound reasoning, both when they are correct and when they are erroneous. Existing works
address some of these challenges by incorporating external knowledge, which can be encoded into a graph-structured format.
Incorporating graph structures on the output label space or between the input document and output
label spaces have shown promising results in medical codes classification. Limited focus has been put on
utilizing graph-based representation on the input document space. To partially bridge this gap,
we represent clinical texts as graph-structured data through the UMLS Metathesaurus; we explore implicit
graph representation through pre-trained knowledge graph embeddings and explicit domain-knowledge
guided encoding of document concepts and relational information through graph neural networks.
Our findings highlight the benefits of pre-trained knowledge graph embeddings in understanding
model's attention-based reasoning. In contrast, transparent domain knowledge guidance in graph
encoder approaches is overshadowed by performance loss.
Our qualitative analysis identifies limitations that contribute to prediction errors.
-
Tanja Bäumel, Soniya Vijayakumar, Josef van Genabith, Günter Neumann, and Simon Ostermann (2023)
Investigating the Encoding of Words in BERT's Neurons Using Feature Textualization,
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP.
(BlackboxNLP-2023),
Singapore, 2023.
Pretrained language models (PLMs) form the basis of most state-of-the-art NLP technologies.
Nevertheless, they are essentially black boxes: Humans do not have a clear understanding of
what knowledge is encoded in different parts of the models, especially in individual neurons.
A contrast is in computer vision, where feature visualization provides a decompositional interpretability
technique for neurons of vision models. Activation maximization is used to synthesize inherently
interpretable visual representations of the information encoded in individual neurons.
Our work is inspired by this but presents a cautionary tale on the interpretability of single neurons,
based on the first large-scale attempt to adapt activation maximization to NLP, and, more specifically,
large PLMs. We propose feature textualization, a technique to produce dense representations of neurons
in the PLM word embedding space. We apply feature textualization to the BERT model to investigate
whether the knowledge encoded in individual neurons can be interpreted and symbolized.
We find that the produced representations can provide insights about the knowledge encoded in
individual neurons, but that individual neurons do not represent clear-cut symbolic
units of language such as words. Additionally, we use feature textualization
to investigate how many neurons are needed to encode words in BERT.
-
Stalin Varanasi, Muhammad Umer Butt, and Günter Neumann (2023)
AutoQIR: Auto-Encoding Questions with Retrieval Augmented Decoding
for Unsupervised Passage Retrieval and Zero-shot Question Generation,
Proceedings of Recent Advances in Natural Language Processing (RANLP-2023), Bulgaria, 2023.
Dense passage retrieval models have become state-of-the-art for information retrieval on many Open-domain Question Answering
(ODQA) datasets. However, most of these models rely on supervision obtained from the ODQA datasets, which hinders their
performance
in a low-resource setting.
Recently, retrieval-augmented language models have been proposed to improve both zero-shot
and supervised information retrieval. However, these models have pre-training tasks that are agnostic to the target task of
passage retrieval.
In this work, we propose Retrieval Augmented Auto-encoding of Questions for zeroshot
dense information retrieval. Unlike other pre-training methods, our pre-training method
is built for target information retrieval, thereby making the pre-training more efficient. Our
method consists of a dense IR model for encoding questions and retrieving documents during
training and a conditional language model that maximizes the question’s likelihood by
marginalizing over retrieved documents. As a by-product, we can use this conditional language
model for zero-shot question generation from documents. We show that the IR model
obtained through our method improves the current state-of-the-art of zero-shot dense information
retrieval, and we improve the results even further by training on a synthetic corpus
created by zero-shot question generation.
-
Saadullah Amin, Pasquale Minervini, David Chang, Pontus Stenetorp, and Günter Neumann (2022)
MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation
Extraction, Proceedings of The 29th International Conference on Computational Linguistics
(Coling-2022),
October 12-17, 2022, Gyeongju, Republic of Korea
Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing
domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing
knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering
a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation
extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%.
Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no
train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark
MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is
obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough
evaluation with domain-specific language models, we also conduct experiments validating general domain relation extraction
findings to biomedical relation extraction.
-
Ioannis Dikeoulias, Saadullah Amin, and Günter Neumann (2022)
Temporal Knowledge Graph Reasoning with Low-rank and Model-agnostic
Representations , Proceedings of the 7th Workshop on Representation Learning for NLP.
ACL-2022, RepL4NLP May 2022, Pages 111-120 ACL 5/2022
(RepL4NLP-2022), May, 2022.
Temporal knowledge graph completion (TKGC) has become a popular approach for reasoning over the event and temporal
knowledge graphs, targeting the completion of knowledge with accurate but missing information. In this context, tensor
decomposition has successfully modeled interactions between entities and relations. Their effectiveness in static knowledge
graph completion motivates us to introduce Time-LowFER, a family of parameter-efficient and time-aware extensions of the
low-rank tensor factorization model LowFER. Noting several limitations in current approaches to represent time, we propose
a cycle-aware time-encoding scheme for time features, which is model-agnostic and offers a more generalized representation
of time. We implement our methods in a unified temporal knowledge graph embedding framework, focusing on time-sensitive
data processing. The experiments show that our proposed methods perform on par or better than the state-of-the-art semantic
matching models on two benchmarks.
-
Saadullah Amin, Noon Pokaratsiri, Morgan Wixted, Alejandro García-Rudolph,
Catalina Martínez-Costa, and Günter Neumann (2022)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification
of Code-Mixed Clinical Texts , Proceedings of the 21st Workshop on Biomedical Language Processing. ACL-2022
BioNLP, May 22-27, Pages 200-211 ACL 5/2022.
(BioNLP-2022), May, 2022.
Despite the advances in digital healthcare systems offering curated structured knowledge, much of the critical information
still lies in large volumes of unlabeled and unstructured clinical texts. These texts, which often contain protected
health information (PHI), are exposed to information extraction tools for downstream applications, risking patient
identification. Existing works in de-identification rely on using large-scale annotated corpora in English, which often
are not suitable in real-world multilingual settings. Pre-trained language models (LM) have shown great potential for
cross-lingual transfer in low-resource settings. In this work, we empirically show the few-shot cross-lingual transfer
property of LMs for named entity recognition (NER) and apply it to solve a low-resource and real-world challenge of
code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke domain. We annotate a gold evaluation
dataset to assess few-shot setting performance where we only use a few hundred labeled examples for training. Our
model improves the zero-shot F1-score from 73.7% to 91.2% on the gold evaluation set when adapting Multilingual
BERT (mBERT) (CITATION) from the MEDDOCAN (CITATION) corpus with our few-shot cross-lingual target corpus. When
generalized to an out-of-sample test set, the best model achieves a human-evaluation F1-score of 97.2%.
-
Stalin Varanasi, Saadullah Amin and Günter Neumann (2021)
AutoEQA: Auto-Encoding Questions for Extractive Question Answering, The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP-2021), Nov. 2021.
There has been a significant progress in the
field of extractive question answering (EQA)
in the recent years. However, most of them
rely on annotations of answer-spans in the corresponding
passages. In this work, we address
the problem of EQA when no annotations
are present for the answer span, i.e.,
when the dataset contains only questions and
corresponding passages. Our method is based
on auto-encoding of the question that performs
a question answering (QA) task during encoding
and a question generation (QG) task during
decoding. Our method performs well in a zeroshot
setting and can provide an additional loss
that boosts performance for EQA.
-
Saadullah Amin and Günter Neumann (2021)
T2NER: Transformers based Transfer Learning Framework for Name Entity Recognition. The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), demo session, 2021.
Recent advances in deep transformer models
have achieved state-of-the-art in several natural
language processing (NLP) tasks, whereas
named entity recognition (NER) has traditionally
benefited from long-short term memory
(LSTM) networks. In this work, we present a
Transformers based Transfer Learning framework
for Named Entity Recognition (T2NER)
created in PyTorch for the task of NER with
deep transformer models. The framework is
built upon the Transformers library as the core
modeling engine and supports several transfer
learning scenarios from sequential transfer
to domain adaptation, multi-task learning, and
semi-supervised learning. It aims to bridge the
gap between the algorithmic advances in these
areas by combining them with the state-of-theart
in transformer models to provide a unified
platform that is readily extensible and can be
used for both the transfer learning research
in NER, and for real-world applications. The
framework is available at: https://github.
com/suamin/t2ner.
-
Ekaterina Loginova, Stalin Varanasi and Günter Neumann (2021)
Towards End-to-End Multilingual Question Answering
.
In Journal Information Systems Frontiers, 23(1): 227-241 (2021).
Multilingual question answering (MLQA) is a critical part of an accessible natural language interface.
However, current solutions demonstrate performance far below that of monolingual systems.
We believe that deep learning approaches are likely to improve performance in MLQA drastically.
This work aims to discuss the current state-of-the-art and remaining challenges.
We outline requirements and suggestions for practical parallel data collection and describe existing methods,
benchmarks and datasets. We also demonstrate that a simple translation of texts
can be inadequate in case of Arabic, English and German languages (on InsuranceQA and SemEval datasets),
and thus more sophisticated models are required. We hope
that our overview will re-ignite interest in multilingual question answering, especially with regard to neural approaches.
Download (This is an earlier version of the paper, but content, experiments and results are the same.)
Online version.
-
Stalin Varanasi, Saadullah Amin, and Günter Neumann (2020)
CopyBERT: A Unified Approach to Question Generation with Self-Attention.
NLP for Conversational AI - Proceedings of the 2nd Workshop, ACL workshop, 2020.
Contextualized word embeddings provide better
initialization for neural networks that deal
with various natural language understanding
(NLU) tasks including Question Answering
(QA) and more recently, Question Generation
(QG). Apart from providing meaningful word
representations, pre-trained transformer models,
such as BERT also provide self-attentions
which encode syntactic information that can
be probed for dependency parsing and POStagging.
In this paper, we show that the information
from self-attentions of BERT are useful
for language modeling of questions conditioned
on paragraph and answer phrases.
To control the attention span, we use semidiagonal
mask and utilize a shared model for
encoding and decoding, unlike sequence-tosequence.
We further employ copy mechanism
over self-attentions to achieve state-of-the-art
results for Question Generation on SQuAD
dataset.
-
Saadullah Amin, Stalin Varanasi, Katherine Dunfield and Günter Neumann (2020)
LowFER: Low-rank Bilinear Pooling for Link Prediction.
Proceedings of the 37th International Conference on Machine Learning (ICML-2020), 2020.
Knowledge graphs are incomplete by nature, with only a limited number of observed facts from the world knowledge being represented as structured relations between entities. To partly address this issue, an important task in statistical relational learning is that of link prediction or knowledge graph completion. Both linear and non-linear models have been proposed to solve the problem. Bilinear models, while expressive, are prone to overfitting and lead to quadratic growth of parameters in number of relations. Simpler models have become more standard, with certain constraints on bilinear map as relation parameters. In this work, we propose a factorized bilinear pooling model, commonly used in multi-modal learning, for better fusion of entities and relations, leading to an efficient and constraint-free model. We prove that our model is fully expressive, providing bounds on the embedding dimensionality and factorization rank. Our model naturally generalizes Tucker decomposition based TuckER model, which has been shown to generalize other models, as efficient low-rank approximation without substantially compromising the performance. Due to low-rank approximation, the model complexity can be controlled by the factorization rank, avoiding the possible cubic growth of TuckER. Empirically, we evaluate on real-world datasets, reaching on par or state-of-the-art performance. At extreme low-ranks, model preserves the performance while staying parameter efficient.
-
Saadullah Amin, Katherine Dunfield, Anna Vechkaeva and Günter Neumann (2020)
A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction
.
In Proceedings of BioNLP-2020 at ACL-2020.
Fact triples are a common form of structured
knowledge used within the biomedical domain.
As the amount of unstructured scientific texts
continues to grow, manual annotation of these
texts for the task of relation extraction becomes
increasingly expensive. Distant supervision
offers a viable approach to combat this
by quickly producing large amounts of labeled,
but considerably noisy, data. We aim to reduce
such noise by extending an entity-enriched relation
classification BERT model to the problem
of multiple instance learning, and defining
a simple data encoding scheme that significantly
reduces noise, reaching state-of-the-art
performance for distantly-supervised biomedical
relation extraction. Our approach further
encodes knowledge about the direction of relation
triples, allowing for increased focus on relation
learning by reducing noise and alleviating
the need for joint learning with knowledge
graph completion.
-
Dominik Stammbach and Günter Neumann (2019)
Team DOMLIN: Exploiting Evidence Enhancement for the FEVER Shared Task.
In Proceedings of the
Second Workshop on Fact Extraction and VERification (FEVER), EMNLP workshop, 2019.
This paper contains our system description for the second Fact Extraction and VERification (FEVER) challenge.
We propose a two-staged sentence selection strategy to account for examples in the dataset where evidence is not only
conditioned on the claim, but also on previously retrieved evidence. We use a publicly available
document retrieval module and have fine-tuned BERT checkpoints for sentence
selection and as the entailment classifier. We report a FEVER score of 68.46% on the blind testset.
-
Saadullah Amin, Günter Neumann, Katherine Dunfield, Anna Vechkaeva,
Kathryn Annette Chapman, and Morgan Kelly Wixted (2019)
MLT-DFKI at CLEF eHealth 2019: Multi-label Classification of ICD-10 Codes with BERT.
In working notes of CLEF eHealth, 2019.
With the adoption of electronic health record (EHR) systems,
hospitals and clinical institutes have access to large amounts
of heterogeneous patient data. Such data consists of structured
(insurance details, billing data, lab results etc.) and unstructured
(doctor notes, admission and discharge details, medication steps etc.)
documents, of which, latter is of great significance to apply natural
language processing (NLP) techniques. In parallel, recent advancements
in transfer learning for NLP has pushed the state-of-the-art to new
limits on many language understanding tasks. Therefore, in this paper,
we present team DFKI-MLT's participation at CLEF eHealth 2019 Task 1 of
automatically assigning ICD-10 codes to non-technical summaries (NTSs)
of animal experiments where we use various architectures in multi-label
classification setting and demonstrate the effectiveness of transfer
learning with pre-trained language representation model BERT
(Bidirectional Encoder Representations from Transformers) and its recent
variant BioBERT. We first translate task documents from German to
English using automatic translation system and then use BioBERT which
achieves an F1-micro of 73.02% on submitted run as
evaluated by the challenge.
-
Alejandro Figueroa, Carlos Gómez-Pantoja, and Günter Neumann (2019)
Integrating heterogeneous sources for predicting question temporal anchors across Yahoo! Answers.
In journal Information Fusion, Volume 50,
October 2019, Pages 112-125.
Modern Community Question Answering (CQA) web forums provide the possibility to browse their archives using
question-like search queries as in Information Retrieval
(IR) systems. Although these traditional IR methods have become very successful at fetching semantically related
questions, they typically leave unconsidered their
temporal relations. That is to say, a group of questions may be asked more often during specific recurring time lines
despite being semantically unrelated. In fact,
predicting temporal aspects would not only assist these platforms in widening the semantic diversity of their search
results, but also in re-stating questions that
need to refresh their answers and in producing more dynamic, especially temporally-anchored, displays.
In this paper, we devised a new set of time-frame specific categories for CQA questions, which is obtained by fusing
two distinct earlier taxonomies (i.e., [29] and
[50]). These new categories are then utilized in a large crowd-sourcing based human annotation effort. Accordingly, we
present a systematical analysis of its results
in terms of complexity and degree of difficulty as it relates to the different question topics.
Incidentally, through a large number of experiments, we investigate the effectiveness of a wider variety of linguistic
features compared to what has been done in
previous works. We additionally mix evidence/features distilled directly and indirectly from questions by capitalizing
on their related web search results. We
finally investigate the impact and effectiveness of multi-view learning to boost a large variety of multi-class
supervised learners by optimizing a latent layer
build on top of two views: one composed of features harvested from questions, and the other from CQA meta data and
evidence extracted from web resources (i.e.,
nippets and Internet archives).
-
Ekaterina Loginova and Günter Neumann (2018)
An Interactive Web-Interface for Visualizing the Inner Workings of the Question Answering LSTM.
In proceedings of the
Conference on Empirical Methods in Natural Language Processing - EMNLP-2018, October 31 – November 4, Brussels, Belgium, 2018.
Deep learning models for NLP are potent but not readily interpretable. It prevents researchers from improving a model’s performance
efficiently and users from applying it for a task which requires a high level of trust in the system. We present a visualisation tool
which aims to illuminate the inner workings of a specific LSTM model for question answering.
It plots heatmaps of neurons’ firings and allows a user to check the dependency between neurons and manual features. The system possesses
an interactive web-interface and can be adapted to other models and domains.
-
Georg Heigold, Stalin Varanasi, Günter Neumann and Josef van Genabith (2018)
How Robust Are Character-Based Word Embeddings in Tagging and MT Against Wrod Scramlbing or Randdm Nouse?.
In proceedings of AMTA, March.
This paper investigates the robustness of NLP against perturbed word forms. While neural approaches can achieve (almost)
human-like accuracy for certain tasks and conditions, they often are sensitive to small changes in the input such as non-canonical input
(e.g., typos). Yet both stability and robustness are desired properties in applications involving user-generated content,
and the more as humans easily cope with such noisy or adversary conditions. In this paper, we study the impact of noisy input.
We consider different noise distributions (one type of noise, combination of noise types) and mismatched noise distributions
for training and testing. Moreover, we empirically evaluate the robustness of different models (convolutional neural networks,
recurrent neural networks, non-neural models), different basic units (characters, byte pair encoding units),
and different NLP tasks (morphological tagging, machine translation).
Download
-
Georg Heigold, Günter Neumann and Josef van Genabith (2017)
An Extensive Empirical Evaluation of Character-Based Morphological Tagging for 14 Languages.
In proceedings of EACL, 2017.
This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets.
Character-based approaches are attractive as they can handle rarely- and unseen words gracefully. We evaluate on 14
languages and observe consistent gains over a state-of-the-art morphological tagger across all languages except
for English and French, where we match the state-of-the-art. We compare two architectures for computing
character-based word vectors using recurrent (RNN) and convolutional (CNN) nets. We show that the CNN based approach
performs slightly worse and less consistently than the RNN based approach.
Small but systematic gains are observed when combining the two architectures by ensembling.
Download