Tenders
For Students
Table of Contents
- Bachelor Theses incl. Internship
- Interactive Topic Modeling
- Topic Modeling with syntactic filters
- Topic modelling with Large Language Models
- Implementation of the Guided Topic Noise Model
- Topic modelling for Social Media
- Literature review Topic-based Search
- Scenarios for personality detection
- Membership Inference Attacks on Large Language Models
- Reverse Prompt Engineering in Large Language Models
- Exploration of the forensic value of content credentials
- Relationship between socio-demographic characteristics and perceptions of misogyny
- Data analysis and Data Visualization with Large Language Models
- Automatic generation of test communication data with Puma
- Master Theses
In order to apply for a thesis topic, please use our contact form.
Bachelor Theses incl. Internship
Interactive Topic Modeling
The analysis of forensic communication data is usually guided by one or more hypotheses, e.g. about the existence of a specific criminal offence, such as a drug offence. Based on these hypotheses, the investigators assume that specific case-relevant topics (such as drug-related crime) were discussed in the chats. Semi-supervised topic modeling can be used to find evidence that these case-relevant topics were discussed. However, the investigator must still interpret whether the extracted topics correspond to the expected case-relevant topics. Furthermore, the algorithms require some words as input that accurately describe the case-relevant topics, which are often difficult to find. Therefore, the investigator’s feedback should be included to find better words that characterise the topics iteratively. Another goal of the work is to determine a score that provides an indication of whether the expected topics have been found.
Requirements
- Fun with R and Python programming
Duration: 6 months
Assigned Places/Total Places: 0/2
Topic Modeling with syntactic filters
When analysing large amounts of messages from messenger services (e.g., WhatsApp, Signal), investigators are often interested in whether case-relevant topics were discussed in addition to the dominant everyday topics. Case-relevant topics can be extracted by incorporating the investigators’ prior knowledge into semi-supervised algorithms for topic modeling. However, interpreting the extracted topics remains a challenge. Usually, these are presented as a ranking of words. In particular, in the case of colloquial instant messages, the words that describe a topic often include many high-frequency, irrelevant words. This problem can be counteracted by only considering content-bearing words such as nouns and, if necessary, verbs and adjectives when modeling topics. It should, therefore, be examined to what extent the restriction of the vocabulary to certain parts of speech through POS tagging increases the interpretability of the case-relevant topics.
Requirements
- Fun with R and Python programming
Duration: 6 months
Assigned Places/Total Places: 1/1
- Veronika Fuchs
Topic modelling with Large Language Models
Topic modelling has proven to be very useful in forensics for obtaining a quick overview of large amounts of data to be analysed, especially in the case of notable communication data (e.g., instant messages). However, traditional topic modelling can cause problems, especially for short, colloquial, noisy texts. An alternative is to apply so-called large language models, such as ChatGPT, for topic modelling. In this paper, we will first investigate the extent to which large language models have been utilised for topic analysis, drawing on literature research. Subsequently, we will conduct our experiments with a focus on the extent to which LLMs can identify presumed topics in texts.
Requirements
- Fun with extensive and systematic literary research
- Basic knowledge of R or Python programming
Duration: 6 months
Assigned Places/Total Places: 0/1
Implementation of the Guided Topic Noise Model
Topic modeling can help with the analysis of large volumes of forensic communication data - from instant messaging, for example - by automatically identifying key content. However, conventional methods have their limitations: they tend to emphasize irrelevant small-talk topics and are prone to hard-to-interpret output, such as slang. The Guided Topic Noise Model (GTM), an algorithm developed in Java, addresses these weaknesses by incorporating thematic prior knowledge and being robust to colloquial expressions. The aim of the work is the implementation of GTM in the forensic software Mobile Network Analyzer (MoNA). In particular, an interactive feedback loop is to be realized in which users can iteratively adapt the integrated prior knowledge.
Requirements
- Fun with Java programming
Duration: 6 months
Assigned Places/Total Places: 1/1
- Tony Katzwinkel, Paul Klotzsche, Alexa Müller, Malte Reinhardt, Tim Reißner, Mandana Wille
Topic modelling for Social Media
Monitoring social networks makes an essential contribution to crime prevention and the resource planning of security forces. It can be used to identify announcements of events with escalation potential, such as demonstrations, the dissemination of potentially harmful (e.g. xenophobic) ideas and the intensification of discussions. To get a quick overview of what is currently being discussed in social media, there is potential in topic modelling. A challenge for topic modelling in social media posts is the continuous development and dynamic change of topics. This work aims first to show the current state of research on topic modelling in social media posts, focusing on methods such as lifelong topic modelling. Subsequently, a selected method will be examined experimentally.
Requirements
- Fun with extensive and systematic literary research
- Fun with programming
Duration: 6 months
Assigned Places/Total Places: 0/1
Literature review Topic-based Search
Whether searching for interesting articles as part of a research project, searching for books in digital library systems or searching for relevant information in forensic data volumes, such as forensic communication data, users often want to find texts on specific topics. Accordingly, there is a need for text retrieval systems that take a topic instead of simple search words, which the searched texts should deal with. This work aims to show the current state of research on topic-based search through a systematic literature review, thus creating a basis for developing a topic-based text retrieval system for forensic applications.
Requirements
- Fun with extensive and systematic literary research
Duration: 3-6 months
Assigned Places/Total Places: 0/1
Scenarios for personality detection
In real-world investigations, knowledge of psychological personality traits (e.g. OCEAN, Cattell Sixteen Personality Factor) often helps to identify the offender(s). These personality traits are also expressed in written text. A suitable data basis is required to develop approaches for automatically recognising personality traits from texts. However, psychological personality traits cannot be derived from every conversation. Therefore, artificial conversations based on specific scenarios are to be developed. The thesis aims to design scenarios for conversations that allow personality recognition. The main tasks required for this include a comprehensive literature review, the development of the scenarios, and the validation of these scenarios with the help of experts, e.g., those from the field of psychology.
Requirements
- Fun with extensive and systematic literary research
- Basic knowledge of statistics
Duration: 3-6 months
Assigned Places/Total Places: 1/1
- Lena Velthaus
Membership Inference Attacks on Large Language Models
Large Language Models (LLMs) are currently very popular in both our personal and professional lives across numerous fields, including healthcare, customer support, and software development. For many well-known commercial LLMs—such as GPT models, Gemini, or Claude, the exact composition of the training data is unknown. The question of whether certain documents were part of the training data can be of interest from both a copyright perspective (e.g., books or song lyrics) and a data protection perspective (e.g., sensitive patient information). So-called membership inference attacks attempt to determine, using a trained model, whether a specific text was included in its training corpus. This thesis examines whether and to what extent it is possible to determine the presence of specific documents in LLM training data, even when that data is not publicly available. To this end, a systematic literature review will be conducted first. Subsequently, existing approaches will be experimentally investigated, and new approaches will be proposed and evaluated.
Requirements
- Fun with programming (R or Python)
- A very basic understanding of LLMs
- Willingness to independently familiarize yourself with a new field of research
Duration: 6 months
Assigned Places/Total Places: 0/2
Reverse Prompt Engineering in Large Language Models
Large Language Models (LLMs) (e.g., the GPT models) are highly versatile, from text generation and translation to code generation. The key to quality lies in what is known as prompt engineering: the art of crafting a good prompt. Companies like TikTok are specifically investing in specialists in this field. However, the growing value of such prompts comes with a new security risk: reverse prompt engineering (also known as prompt stealing), that is, reconstructing a prompt based on a model’s outputs. Understanding attack methods is the first step toward developing effective countermeasures. The thesis begins by addressing the topic through a systematic literature review, which summarizes the current state of research on reverse prompt engineering, including known attack methods and countermeasures. Finally, original experiments will be conducted to empirically investigate selected techniques.
Requirements
- Interest in artificial intelligence and large language models (LLMs)
- Enjoyment of programming (R or Python)
- Willingness to independently familiarize yourself with a new field of research
Duration: 6 months
Assigned Places/Total Places: 0/2
Exploration of the forensic value of content credentials
Images and videos spread like wildfire on social media and can cause harm just as quickly. Modern AI tools make it easier than ever to manipulate or completely fabricate visual content. The consequences range from unintentionally misleading content to targeted disinformation campaigns. A promising countermeasure is what’s known as content provenance: media files carry a cryptographically secured history of their creation directly within the file, similar to a digital resume. It records who created the file, which device, and the editing steps taken since then. Any subsequent manipulation breaks the signature, making it visible. This is precisely what the open standard C2PA (Coalition for Content Provenance and Authenticity) promises, which was jointly developed by companies such as Adobe and Microsoft and is being integrated into cameras, software, and platforms. This bachelor’s thesis examines how well C2PA works in practice, specifically whether it can serve as a forensic tool for detecting image manipulation and where its limitations exist. To this end, a systematic literature review will first be conducted to determine the extent to which C2PA manifestations have already been used in forensic contexts, before conducting own experiments to investigate the extent to which real image manipulations can be reliably detected.
Requirements
- Basic knowledge of and interest in forensics and cryptography
- Basic programming skills (e.g., Python)
- Willingness to independently familiarize yourself with a new field of research
Duration: 6 months
Assigned Places/Total Places: 0/1
Relationship between socio-demographic characteristics and perceptions of misogyny
The automated detection of misogyny – i.e. a derogatory attitude towards women – in social media posts is a key challenge. The development of suitable classification systems is based on annotated data sets that reflect human judgements. However, such annotations are often influenced by subjective assessments. In particular, the question arises as to what extent socio-demographic characteristics influence what is perceived as misogynistic and labelled as such. This question is the focus of the bachelor’s thesis. Specifically, based on the data provided in the EXIST 2025 competition, the thesis will investigate the extent to which annotations differ according to age, gender, ethnicity, educational attainment and origin. To this end, extensive exploratory data analyses will be carried out.
Requirements
- Fun with R programming (or Python)
- Basic knowledge of statistics
Duration: 6 months
Assigned Places/Total Places: 0/2
Data analysis and Data Visualization with Large Language Models
The Gemini language model, a chatbot developed by Google, can not only generate natural-sounding texts, but also support programming. It can be particularly useful in the field of data analysis. Gemini is even able to generate complete Jupyter notebooks - including code for reading, pre-processing, analyzing and visualizing data. In this bachelor thesis, the data analysis performed by Gemini will be compared with a manual analysis (using R or Python) using forensic data.
Requirements
- Basic knowledge of R or Python
Duration: 3-6 months
Assigned Places/Total Places: 0/1
Automatic generation of test communication data with Puma
Since real forensic communication data contains sensitive information, generating test data plays a crucial role in researching mobile devices. However, creating this test data manually involves a considerable amount of effort. The Python-based open-source tool Puma, which automates the creation of realistic data sets, offers potential in this area. The aim of the internship, followed by a bachelor’s thesis, is first to utilise Puma to generate communication test data on mobile devices. Subsequently, the quality of this generated communication test data is to be evaluated, including a comparison with alternative methods of synthetic data generation, such as the use of large language models (LLMs).
Requirements
- Knowledge of and enjoyment in programming with Python
Duration: 6 months
Assigned Places/Total Places: 0/1
Master Theses
Topic Modeling using seeded BTM
When analysing messages from messenger services (e.g. WhatsApp, Signal, etc.), investigators are often interested in whether specific case-relevant topics were discussed. To find evidence or proof of suspected topics, semi-supervised algorithms for topic modelling can be promising. These algorithms take a few characteristic words for the desired topics and are encouraged to extract these topics. However, the short length of instant messages is challenging for most algorithms. One exception is the ‘Seeded-BTM’ topic model explicitly developed for brief texts. This algorithm will be implemented, applied to forensic communication data, and evaluated.
Requirements
- Fun with programming, e.g. in R or Python
- basic mathematical understanding of algorithms for topic modelling
Duration: 6 months
Assigned Places/Total Places: 0/1
Semi-supervised Neural Topic Modeling
Topic modeling is essential for analysing large amounts of messages from messenger services (e.g. WhatsApp, Signal). This method should support investigators in achieving the two fundamental goals of forensic data analysis. On the one hand, investigators want to find evidence or proof of suspected, case-relevant topics (e.g. topics related to a known drug deal) to support or refute a specific forensic hypothesis. On the other hand, investigators are also interested in discovering new topics they would not have expected to find in the messages (e.g., about the crime’s motivation). For the first goal, unsupervised topic models based on neural networks have been increasingly used for several years, while for the second goal, their semi-supervised adaptation can be promising. However, so far, no neural approach to topic modeling supports both goals. The aim is, therefore, to extend an existing algorithm for semi-supervised topic modeling based on neural networks ‘KeyETM’, ‘Seeded NTM’, ‘vONTSS’ in such a way that new, unexpected topics are also found in addition to suspected topics. Subsequently, the adapted algorithm should be evaluated using forensic communication data.
Requirements
- Good understanding of code (Python)
Duration: 6 months
Assigned Places/Total Places: 1/2
- B.Sc. Eric Kropf
Semi-supervised Embedding-based Topic Modeling
When analysing messages from messenger services (e.g. WhatsApp, Signal), investigators are often interested in whether specific topics relevant to the case were discussed in these messages. There is potential in semi-supervised topic modelling approaches to find indications of these suspected topics. Most semi-supervised algorithms are extensions of probabilistic latent Dirichlet allocation. A promising alternative is approaches that consider words’ semantic similarity. That can be realised using word embeddings (e.g. word2vec, fastText). Therefore, the aim is to compare different semi-supervised approaches based on word embeddings (e.g. ‘CatE’, ‘SeedTopicMine’ und ‘JoSH’) to identify case-relevant topics in forensic chat messages. In a second step, existing implementations can be expanded to discover new topics in addition to the expected topics, such as the motivation for the offence or previously unexpected connections to particular persons.
Requirements
- Fun with programming, e.g. with R or Python
- a rough understanding of C code
Duration: 6 months
Assigned Places/Total Places: 0/2
Topic-characteristic messages with LVQ
The immense volume of messages from messenger services that need to be analysed as part of forensic investigations presents investigators with an increasing challenge. The automatic extraction of topics can be promising for gaining an overview of the content discussed in the chats. However, the interpretation of these topics often proves to be problematic. Most approaches to topic modeling present them as a ranking of words or phrases that have little meaning without further context. In contrast, describing each topic by its most characteristic messages would be more helpful. Care must be taken to ensure sufficient transparency concerning subsequent usability in court when selecting the methodology for extracting these messages. The aim is, therefore, to use an Explainable Artificial Intelligence (XAI) method, the Generalised Learning Vector Quantisation (GLVQ) algorithm, to identify the messages that are characteristic of a topic.
Requirements
- Knowledge of Matlab or Python
Duration: 6 months
Assigned Places/Total Places: 0/1