Tenders

For Students

Offers for Theses, Jobs, or, Internships.

Table of Contents

Bachelor Theses incl. Internship
Master Theses

In order to apply for a thesis topic, please use our contact form.

Bachelor Theses incl. Internship

Interactive Topic Modeling

The analysis of forensic communication data is usually guided by one or more hypotheses, e.g. about the existence of a specific criminal offence, such as a drug offence. Based on these hypotheses, the investigators assume that specific case-relevant topics (such as drug-related crime) were discussed in the chats. Semi-supervised topic modeling can be used to find evidence that these case-relevant topics were discussed. However, the investigator must still interpret whether the extracted topics correspond to the expected case-relevant topics. Furthermore, the algorithms require some words as input that accurately describe the case-relevant topics, which are often difficult to find. Therefore, the investigator’s feedback should be included to find better words that characterise the topics iteratively. Another goal of the work is to determine a score that provides an indication of whether the expected topics have been found.

Requirements

Fun with R and Python programming

Duration: 6 months

Assigned Places/Total Places: 0/2

apply now

Topic Modeling with syntactic filters

When analysing large amounts of messages from messenger services (e.g., WhatsApp, Signal), investigators are often interested in whether case-relevant topics were discussed in addition to the dominant everyday topics. Case-relevant topics can be extracted by incorporating the investigators’ prior knowledge into semi-supervised algorithms for topic modeling. However, interpreting the extracted topics remains a challenge. Usually, these are presented as a ranking of words. In particular, in the case of colloquial instant messages, the words that describe a topic often include many high-frequency, irrelevant words. This problem can be counteracted by only considering content-bearing words such as nouns and, if necessary, verbs and adjectives when modeling topics. It should, therefore, be examined to what extent the restriction of the vocabulary to certain parts of speech through POS tagging increases the interpretability of the case-relevant topics.

Requirements

Fun with R and Python programming

Duration: 6 months

Assigned Places/Total Places: 1/1

Veronika Fuchs

Topic modelling with Large Language Models

Topic modelling has proven to be very useful in forensics for obtaining a quick overview of large amounts of data to be analysed, especially in the case of notable communication data (e.g., instant messages). However, traditional topic modelling can cause problems, especially for short, colloquial, noisy texts. An alternative is to apply so-called large language models, such as ChatGPT, for topic modelling. In this paper, we will first investigate the extent to which large language models have been utilised for topic analysis, drawing on literature research. Subsequently, we will conduct our experiments with a focus on the extent to which LLMs can identify presumed topics in texts.

Requirements

Fun with extensive and systematic literary research
Basic knowledge of R or Python programming

Duration: 6 months

Assigned Places/Total Places: 0/1

apply now

Implementation of the Guided Topic Noise Model

Topic modeling can help with the analysis of large volumes of forensic communication data - from instant messaging, for example - by automatically identifying key content. However, conventional methods have their limitations: they tend to emphasize irrelevant small-talk topics and are prone to hard-to-interpret output, such as slang. The Guided Topic Noise Model (GTM), an algorithm developed in Java, addresses these weaknesses by incorporating thematic prior knowledge and being robust to colloquial expressions. The aim of the work is the implementation of GTM in the forensic software Mobile Network Analyzer (MoNA). In particular, an interactive feedback loop is to be realized in which users can iteratively adapt the integrated prior knowledge.

Requirements

Fun with Java programming

Duration: 6 months

Assigned Places/Total Places: 1/1

Tony Katzwinkel, Paul Klotzsche, Alexa Müller, Malte Reinhardt, Tim Reißner, Mandana Wille

Monitoring social networks makes an essential contribution to crime prevention and the resource planning of security forces. It can be used to identify announcements of events with escalation potential, such as demonstrations, the dissemination of potentially harmful (e.g. xenophobic) ideas and the intensification of discussions. To get a quick overview of what is currently being discussed in social media, there is potential in topic modelling. A challenge for topic modelling in social media posts is the continuous development and dynamic change of topics. This work aims first to show the current state of research on topic modelling in social media posts, focusing on methods such as lifelong topic modelling. Subsequently, a selected method will be examined experimentally.

Requirements

Fun with extensive and systematic literary research
Fun with programming

Duration: 6 months

Assigned Places/Total Places: 0/1

apply now

Literature review Topic-based Search

Whether searching for interesting articles as part of a research project, searching for books in digital library systems or searching for relevant information in forensic data volumes, such as forensic communication data, users often want to find texts on specific topics. Accordingly, there is a need for text retrieval systems that take a topic instead of simple search words, which the searched texts should deal with. This work aims to show the current state of research on topic-based search through a systematic literature review, thus creating a basis for developing a topic-based text retrieval system for forensic applications.

Requirements

Fun with extensive and systematic literary research

Duration: 3-6 months

Assigned Places/Total Places: 0/1

apply now

Scenarios for personality detection

In real-world investigations, knowledge of psychological personality traits (e.g. OCEAN, Cattell Sixteen Personality Factor) often helps to identify the offender(s). These personality traits are also expressed in written text. A suitable data basis is required to develop approaches for automatically recognising personality traits from texts. However, psychological personality traits cannot be derived from every conversation. Therefore, artificial conversations based on specific scenarios are to be developed. The thesis aims to design scenarios for conversations that allow personality recognition. The main tasks required for this include a comprehensive literature review, the development of the scenarios, and the validation of these scenarios with the help of experts, e.g., those from the field of psychology.

Requirements

Fun with extensive and systematic literary research
Basic knowledge of statistics

Duration: 3-6 months

Assigned Places/Total Places: 1/1

Lena Velthaus

Membership Inference Attacks on Large Language Models

Large Language Models (LLMs) are currently very popular in both our personal and professional lives across numerous fields, including healthcare, customer support, and software development. For many well-known commercial LLMs—such as GPT models, Gemini, or Claude, the exact composition of the training data is unknown. The question of whether certain documents were part of the training data can be of interest from both a copyright perspective (e.g., books or song lyrics) and a data protection perspective (e.g., sensitive patient information). So-called membership inference attacks attempt to determine, using a trained model, whether a specific text was included in its training corpus. This thesis examines whether and to what extent it is possible to determine the presence of specific documents in LLM training data, even when that data is not publicly available. To this end, a systematic literature review will be conducted first. Subsequently, existing approaches will be experimentally investigated, and new approaches will be proposed and evaluated.

Requirements

Fun with programming (R or Python)
A very basic understanding of LLMs
Willingness to independently familiarize yourself with a new field of research

Duration: 6 months

Assigned Places/Total Places: 0/2

apply now

Reverse Prompt Engineering in Large Language Models

Large Language Models (LLMs) (e.g., the GPT models) are highly versatile, from text generation and translation to code generation. The key to quality lies in what is known as prompt engineering: the art of crafting a good prompt. Companies like TikTok are specifically investing in specialists in this field. However, the growing value of such prompts comes with a new security risk: reverse prompt engineering (also known as prompt stealing), that is, reconstructing a prompt based on a model’s outputs. Understanding attack methods is the first step toward developing effective countermeasures. The thesis begins by addressing the topic through a systematic literature review, which summarizes the current state of research on reverse prompt engineering, including known attack methods and countermeasures. Finally, original experiments will be conducted to empirically investigate selected techniques.

Requirements

Interest in artificial intelligence and large language models (LLMs)
Enjoyment of programming (R or Python)
Willingness to independently familiarize yourself with a new field of research

Duration: 6 months

Assigned Places/Total Places: 0/2

apply now

Exploration of the forensic value of content credentials

Images and videos spread like wildfire on social media and can cause harm just as quickly. Modern AI tools make it easier than ever to manipulate or completely fabricate visual content. The consequences range from unintentionally misleading content to targeted disinformation campaigns. A promising countermeasure is what’s known as content provenance: media files carry a cryptographically secured history of their creation directly within the file, similar to a digital resume. It records who created the file, which device, and the editing steps taken since then. Any subsequent manipulation breaks the signature, making it visible. This is precisely what the open standard C2PA (Coalition for Content Provenance and Authenticity) promises, which was jointly developed by companies such as Adobe and Microsoft and is being integrated into cameras, software, and platforms. This bachelor’s thesis examines how well C2PA works in practice, specifically whether it can serve as a forensic tool for detecting image manipulation and where its limitations exist. To this end, a systematic literature review will first be conducted to determine the extent to which C2PA manifestations have already been used in forensic contexts, before conducting own experiments to investigate the extent to which real image manipulations can be reliably detected.

Requirements

Basic knowledge of and interest in forensics and cryptography
Basic programming skills (e.g., Python)
Willingness to independently familiarize yourself with a new field of research

Duration: 6 months

Assigned Places/Total Places: 0/1

apply now

Relationship between socio-demographic characteristics and perceptions of misogyny

The automated detection of misogyny – i.e. a derogatory attitude towards women – in social media posts is a key challenge. The development of suitable classification systems is based on annotated data sets that reflect human judgements. However, such annotations are often influenced by subjective assessments. In particular, the question arises as to what extent socio-demographic characteristics influence what is perceived as misogynistic and labelled as such. This question is the focus of the bachelor’s thesis. Specifically, based on the data provided in the EXIST 2025 competition, the thesis will investigate the extent to which annotations differ according to age, gender, ethnicity, educational attainment and origin. To this end, extensive exploratory data analyses will be carried out.

Requirements

Fun with R programming (or Python)
Basic knowledge of statistics

Duration: 6 months

Assigned Places/Total Places: 0/2

apply now

Data analysis and Data Visualization with Large Language Models

The Gemini language model, a chatbot developed by Google, can not only generate natural-sounding texts, but also support programming. It can be particularly useful in the field of data analysis. Gemini is even able to generate complete Jupyter notebooks - including code for reading, pre-processing, analyzing and visualizing data. In this bachelor thesis, the data analysis performed by Gemini will be compared with a manual analysis (using R or Python) using forensic data.

Requirements

Basic knowledge of R or Python

Duration: 3-6 months

Assigned Places/Total Places: 0/1

apply now

Automatic generation of test communication data with Puma

Since real forensic communication data contains sensitive information, generating test data plays a crucial role in researching mobile devices. However, creating this test data manually involves a considerable amount of effort. The Python-based open-source tool Puma, which automates the creation of realistic data sets, offers potential in this area. The aim of the internship, followed by a bachelor’s thesis, is first to utilise Puma to generate communication test data on mobile devices. Subsequently, the quality of this generated communication test data is to be evaluated, including a comparison with alternative methods of synthetic data generation, such as the use of large language models (LLMs).

Requirements

Knowledge of and enjoyment in programming with Python

Duration: 6 months

Assigned Places/Total Places: 0/1

apply now

Master Theses

Topic Modeling using seeded BTM

When analysing messages from messenger services (e.g. WhatsApp, Signal, etc.), investigators are often interested in whether specific case-relevant topics were discussed. To find evidence or proof of suspected topics, semi-supervised algorithms for topic modelling can be promising. These algorithms take a few characteristic words for the desired topics and are encouraged to extract these topics. However, the short length of instant messages is challenging for most algorithms. One exception is the ‘Seeded-BTM’ topic model explicitly developed for brief texts. This algorithm will be implemented, applied to forensic communication data, and evaluated.

Requirements

Fun with programming, e.g. in R or Python
basic mathematical understanding of algorithms for topic modelling

Duration: 6 months

Assigned Places/Total Places: 0/1

apply now

Semi-supervised Neural Topic Modeling

Topic modeling is essential for analysing large amounts of messages from messenger services (e.g. WhatsApp, Signal). This method should support investigators in achieving the two fundamental goals of forensic data analysis. On the one hand, investigators want to find evidence or proof of suspected, case-relevant topics (e.g. topics related to a known drug deal) to support or refute a specific forensic hypothesis. On the other hand, investigators are also interested in discovering new topics they would not have expected to find in the messages (e.g., about the crime’s motivation). For the first goal, unsupervised topic models based on neural networks have been increasingly used for several years, while for the second goal, their semi-supervised adaptation can be promising. However, so far, no neural approach to topic modeling supports both goals. The aim is, therefore, to extend an existing algorithm for semi-supervised topic modeling based on neural networks ‘KeyETM’, ‘Seeded NTM’, ‘vONTSS’ in such a way that new, unexpected topics are also found in addition to suspected topics. Subsequently, the adapted algorithm should be evaluated using forensic communication data.

Requirements

Good understanding of code (Python)

Duration: 6 months

Assigned Places/Total Places: 1/2

apply now

B.Sc. Eric Kropf

Semi-supervised Embedding-based Topic Modeling

When analysing messages from messenger services (e.g. WhatsApp, Signal), investigators are often interested in whether specific topics relevant to the case were discussed in these messages. There is potential in semi-supervised topic modelling approaches to find indications of these suspected topics. Most semi-supervised algorithms are extensions of probabilistic latent Dirichlet allocation. A promising alternative is approaches that consider words’ semantic similarity. That can be realised using word embeddings (e.g. word2vec, fastText). Therefore, the aim is to compare different semi-supervised approaches based on word embeddings (e.g. ‘CatE’, ‘SeedTopicMine’ und ‘JoSH’) to identify case-relevant topics in forensic chat messages. In a second step, existing implementations can be expanded to discover new topics in addition to the expected topics, such as the motivation for the offence or previously unexpected connections to particular persons.

Requirements

Fun with programming, e.g. with R or Python
a rough understanding of C code

Duration: 6 months

Assigned Places/Total Places: 0/2

apply now

Topic-characteristic messages with LVQ

The immense volume of messages from messenger services that need to be analysed as part of forensic investigations presents investigators with an increasing challenge. The automatic extraction of topics can be promising for gaining an overview of the content discussed in the chats. However, the interpretation of these topics often proves to be problematic. Most approaches to topic modeling present them as a ranking of words or phrases that have little meaning without further context. In contrast, describing each topic by its most characteristic messages would be more helpful. Care must be taken to ensure sufficient transparency concerning subsequent usability in court when selecting the methodology for extracting these messages. The aim is, therefore, to use an Explainable Artificial Intelligence (XAI) method, the Generalised Learning Vector Quantisation (GLVQ) algorithm, to identify the messages that are characteristic of a topic.

Requirements

Knowledge of Matlab or Python

Duration: 6 months

Assigned Places/Total Places: 0/1

apply now

Bachelor Theses incl. Internship

Interactive Topic Modeling

Topic Modeling with syntactic filters

Topic modelling with Large Language Models

Implementation of the Guided Topic Noise Model

Topic modelling for Social Media

Literature review Topic-based Search

Scenarios for personality detection

Membership Inference Attacks on Large Language Models

Reverse Prompt Engineering in Large Language Models

Exploration of the forensic value of content credentials

Relationship between socio-demographic characteristics and perceptions of misogyny

Data analysis and Data Visualization with Large Language Models

Automatic generation of test communication data with Puma

Master Theses

Topic Modeling using seeded BTM

Semi-supervised Neural Topic Modeling

Semi-supervised Embedding-based Topic Modeling

Topic-characteristic messages with LVQ