“A good research question is crucial to ensuring you don’t lose your way when text mining”

Professor Henrik Müller is Professor of Economic Journalism at the Institute of Journalism at TU Dortmund University and spokesperson for the Dortmund Centre for Data-Based Media Analysis (DoCMA), which analyses large volumes of media content. He explains how text mining can be successfully carried out even by researchers from other disciplines.
At the DoCMA Research Centre, you use text mining to carry out big data analyses of newspaper articles and social media. What is the added value of text mining for research?
The added value lies in the fact that you are not limited to small samples, but can conduct comprehensive content surveys – for example, of all articles that have appeared in a newspaper over a period of 25 years. With traditional manual content analysis, this would be far too time-consuming. Thanks to text mining, it is possible to track the development of reporting over long periods for entire media outlets and even genres. Millions of articles are analysed simultaneously. This is not only of interest to journalism research, but also to economics, as these content analyses can be condensed into time series, which in turn can serve as inputs for econometric models. When I joined TU Dortmund twelve years ago, I would never have thought we’d get this far, or that text mining would be so fascinating.
How could other departments benefit from text mining?
Text mining is already being used in many fields, as I have learnt, including biochemical engineering. It is frequently employed to maintain an overview amidst the flood of scientific publications. DoCMA has an interdisciplinary structure: we are communication scientists or, like me, economists, and we benefit greatly from collaborating with colleagues in statistics, in particular Prof. Carsten Jentsch and Prof. Jörg Rahnenführer, as well as their research staff. We are in constant dialogue. For some time now, knowledge in the field of text mining has been exploding. In a paper we recently published, we combine topic modelling methods that we have further developed with Large Language Models (LLM) to better reveal the narrative content of large text corpora. The computational infrastructure we have here at TU Dortmund University, with the high-performance computer Lido 3 and, most recently, Lido 4, is a great help in this regard.
What advice would you give to someone from another department who wants to get started with text mining?
Researchers can contact the Text and Data Mining Advisory Service at the University Library of TU Dortmund directly; there are knowledgeable staff members there who can answer all questions and provide comprehensive assistance with accessing the texts. The key is to formulate a good and concise research question. If you don’t know what you actually want to find out, you can easily lose your way with text mining. Identifying relevant research questions is a skill that we, as journalism researchers and teachers, bring to the table. Asking interesting questions is, in a sense, the very essence of journalism. When it comes to methods, however, we find it more difficult. For some of our mathematically-minded colleagues, the situation is the reverse. So we complement each other perfectly. Incidentally, I would always caution against conducting text mining whilst trying to avoid reading any texts at all, simply because one considers reading to be too subjective. In my view, text mining tools initially provide only clues. Without engaging with the content of representative texts – what is known as ‘close reading’ – you won’t get anywhere. Is this text cluster that the algorithm is spitting out coherent, or is it gibberish? Is my LLM hallucinating some content right now? If you don’t calibrate the models properly, nothing meaningful will come of it. Ultimately, we’re working with human language, and for that, the human brain remains the most suitable analytical tool.
How much programming knowledge is required?
You don’t necessarily need advanced programming skills for text mining. Our previous generation of PhD students wrote an R package: Tosca (Tools for Statistical Content Analysis). A lot of students have already used it to carry out analyses for their theses. There are now such a wide variety of applications available, which makes the substantive work much easier.
About the person:
- Since October 2013, Professor of Economic Journalism at TU Dortmund
- Since 2015, Director of the Dortmund Centre for Data-based Media Analysis (DoCMA)
- Since 2019, Co-initiator of the UA Ruhr-wide Narrative Economics Alliance Ruhr (NEAR)
- Since 2025, Member of the Board of Trustees of the Dortmund Centre for Data Science and Simulation (DoDaS)
Further information:
- DoCMA – Information about DoCMA, including resources such as R packages for text mining
- Advice Centre for Text and Data Mining at the University Library
- DoDaS – Information about the research centre and membership
