Geplaatst op 30/01/2023 door Marieke van Steijn
in Categorieën: Onderzoek. Tags: phd-onderzoek.

CDH Interview: How did we perceive animals and plants in the Netherlands over the past centuries? Language model GysBERT assists with mapping it out

How did a sixteenth-century Dutch inhabitant experience the animals and plants around him? And how did that change in the centuries that followed? When did the rabbit change from food into pet and the horse from means of transportation into pleasure animal? Literary texts, scientific texts and folk tales, in particular, offer a unique glimpse into the bonds between humans and nature of the past. PhD candidate Arjan van Dalfsen aims to use the GysBERT language model to search large amounts of digitized texts for the Dutch disposition towards animals and plants from 1550 to 2000.

Arjan van Dalfsen obtained bachelor’s degrees in Dutch Language and Culture and in Chemistry. During his research master in Dutch Literature and Culture, he already started using digital methods to study early modern texts. After graduating, he started his large-scale PhD research in the AI Labs, where he is surrounded by an impressive group of experts: cultural historians, computer scientists, ecologists and statisticians. His ambitious research is made possible by the latest transformer-based language models, models that are increasingly able in learning to recognize concepts in texts. The Centre for Digital Humanities spoke with Arjan about his research and the methods he will apply.

You’re going to sift through huge amounts of text. How do you create order out of chaos?

‘There are two main topics of research that I will be looking into. Firstly, what knowledge the Dutch had of animal and plant species from 1550 to 2000. Which species did they know and what kind of classifying system was used? In the second line of research I look at representation. For example, how did people perceive the wolf through the ages? I’m guessing that people in the sixteenth century saw the wolf mainly as a danger, as the mean beast as he appears in folk tales. Today, the wolf also has a positive connotation and is counted as a sign of biodiversity. When did this shift to a more positive general viewpoint first start to emerge?’

In which area do you expect the most interesting results?

‘I am most curious about the development of biodiversity over the centuries. In a previously published paper, researchers examined historical biodiversity, so how and how often animals and plants were referred to in texts. They saw that historical biodiversity increases during the eighteenth century, but that it starts to decline again after the Industrial Revolution. People moved to cities more and more, and came into contact with nature less often. Interestingly, a replication study shows opposite results. I am very curious what conclusions I will find. I think it would be interesting to compare historical, perceived biodiversity with the biological biodiversity of past centuries. Perhaps the outcome can say something about the way in which you can get people involved with nature.’

Which digital methods will you use?

‘For this project, an AI tool has to establish and locate in the texts what a plant or an animal is. I’m going to use BERT for that, a state of the art language model. The variant of BERT specifically for historical Dutch – GysBERT, after Vondel’s Gysbrecht – has just been developed.’

How does BERT work?

‘BERT is a language model based on transformers, a particular form of neural networks. Transformers use self-attention mechanisms that enable them to understand the context of words within a sentence much better than previously possible. The BERT model is trained by having been fed great amounts of text. During training, the model learns by giving itself two types of commands. In the first type of assignment, he omits small pieces of text and then predicts which word should be found there. In the second type of assignment, the model is shown two different sentences and has to indicate whether or not the sentences follow each other. In this way, BERT learns little by little what is important for the formation of meaning in a certain language. With that knowledge, more complex tasks can then be tackled.’

*Collected works of Vondel (1910 edition)*

What are the biggest technical challenges in this research?

‘One of the biggest technical challenges in text mining is the automatic recognition and classification of the words. It was not until the eighteenth century that the division of the plant and animal kingdom according to Linnaeus as we still know it today came about. It will be interesting to examine how people divided plants and animals in the centuries before that. I plan to train the model by feeding it examples of sentences with animals and plants, but that process comes with all sorts of snags. For example, the names of species change over the centuries, or different names are used for the same species. What I want is for the AI tool to be able to automatically recognize an equivalent – think of the Dutch words kat en poes (cat), for example. If the same words occur in the vicinity of kat and poes, in theory it could be deduced that they are the same animal. But things like this are challenging. There are also animals whose names occur in other functions. For example, the Dutch gans (goose) can either mean an animal, or occur in the sense of heel (whole). In addition to distant reading, I will also apply close reading. In this way, it is possible to substantiate and illustrate the conclusions drawn by the AI tool with concrete examples in the texts.’

‘Automatically classifying objects in images is a whole new ballgame. Strange things can happen in image analysis. There is an example of a scientist who researched musical instruments in historical images. One of the results was that there were a lot of guitars in it. When the scientist went to investigate this strange outcome, it turned out that his AI tool thought that the baby Jesus was a guitar, because he is held in much the same way as someone usually holds a guitar. Automatically recognizing animals and plants in historical images therefore will be a completely different challenge.’

Which text genre are you most curious about?

‘I am more interested in cultural stories than scientific texts. In the fifteenth century, for example, there were hondenslagers (dog beaters) who beat the street dogs out of the church. Dogs were a real nuisance back then, and walked in everywhere. Also interesting are the differences between children’s books from that time and now. That’s where I expect to find the most exciting transition, and one I hope to see reflected in the data.’

What are you most excited about?

‘I really like programming. Writing code often goes wrong, until it goes right. And that is a magical moment. Thanks to this PhD, I can continue to work with programming ánd Dutch ánd biology. There is still so much to discover in this research area. It offers room for pioneering work and that appeals to me enormously.’

UU blog / Nederlandse Taal en Cultuur

Onderzoek