Cultural AI and its steady acceptance in the heritage sector

The heritage sector is learning to embrace artificial intelligence. This certainly is encouraging, according to CWI researcher Jacco van Ossenbruggen and Martijn Kleppe, head of research at the National Library of the Netherlands. Both researchers advocate the use AI, and point out recent advances in this field.

Publication date: 15-06-2020

The heritage sector is learning to embrace artificial intelligence. This certainly is encouraging, according to CWI researcher Jacco van Ossenbruggen and Martijn Kleppe, head of research at the National Library of the Netherlands. Both researchers advocate the use AI, and point out recent advances in this field.

In newspapers, magazines and trade journals we can regularly read about opportunities and possibilities that artificial intelligence offers us. Or rather, about its dangers. When discussing AI, you almost automatically think of image recognition techniques that confirm prejudices. Or data guzzling tech companies who know everything about our personal lives.

Of course, it is perfectly fine to maintain a cautious attitude towards novel techniques. But there’s no need to exaggerate, and let the discussion about the possible dangers of AI paralyze an exploration of its potential for our sector. Let’s not sit back and wait, but above all experiment and learn about the possibilities.

Experiments
Fortunately, experiments are already taking place in the Netherlands. This leads to wonderful applications and valuable lessons. Various archives are currently working with Transkribus, a system capable of reading manuscripts and converting them into computer-readable text. Using Transkribus, the National Archives and the Noord-Hollands Archief have made manuscripts from the seventeenth- and eighteenth-century VOC archives searchable.

The National Library of the Netherlands is currently experimenting with the semi-automatic description of publications using natural language processing. The Netherlands Institute for Sound and Vision applies speech recognition in order to be able to crawl television programmes even better. And the Noord-Hollands Archief is planning to use image recognition to make the image collection of Fotopersbureau De Boer searchable: not only with the help of metadata but also with elements on the photo itself.

Black box
These are all inspiring applications that use different forms of artificial intelligence. One of the more complex and elusive techniques within that field is neural networks. We don’t know exactly how they work, we do often see these ‘black boxes’ giving good results. This elusiveness in particular makes the deployment of neural networks complicated for many people, especially for people who hold data ethics in high esteem.

So, does this mean we shouldn’t use them at all? Not as far as we're concerned. But you do need to think carefully about when to use them and for what purpose. With this in mind, the National Library has developed AI principles that include elements such as ‘inclusive’, ‘impartial’ and ‘transparent’.

Bias
But perhaps even more important than the algorithms themselves are the data used to train algorithms. Especially the bias, or bias of the data, is crucial. When we first experimented with image recognition at the National Library, we noticed that existing open source algorithms performed very well in recognizing the content of modern photographs.

However, less was recognized in historical photographs, and in a way that didn’t come as a surprise. Imagenet, the largest dataset that forms the basis for many image recognition algorithms, contains only contemporary photos and was thus biased towards that type of photo.

Entering data
This brings us to the unique position that the heritage sector can play in the AI domain: we can ‘feed’ algorithms with the beautiful (digital) datasets that we have been collecting, describing and digitising for centuries.

This leads to three major advantages. First of all, we introduce a new type of bias, making algorithms more diverse. For example, we introduce historical language use and terminology. This gives our datasets a second life, allowing us to play a valuable role in the debate about the limitations of AI algorithms.

A second advantage is that algorithms also become more and more applicable in our domain when they’re trained with our kind of data. And a third advantage is that training (open source) algorithms on publicly available (open) data makes it possible to break through the black box and enable research into bias by third parties.

Steps taken
Of course, making data available is easier said than done. Besides technical expertise (e.g. finding out what is actually the best way to share datasets) copyright and privacy can entail the necessary restrictions that prevent the use of these datasets in the first place.

Nevertheless, some promising steps have been taken recently. The National Archives and Noord-Hollands Archief have made their models from the Transkribus project available to other researchers. At the National Library we make so-called ground truth datasets available for OCR research, among other things.

Recently, Brill published a dataset with images to develop algorithms that can automatically assign Iconclass codes. And in the new project with Fotopersbureau De Boer, the Noord-Hollands Archief has also promised to make their data available afterwards.

Cultural AI
Although the number of such initiatives is limited, they are the steps we need to take as a sector if we want to use artificial intelligence in a responsible way. In our view, it should even lead to the new (sub) discipline ‘cultural AI’. This field uses digital heritage data to further develop algorithms that are increasingly capable of better understanding human cultural norms and values.

In other words: how can we use our knowledge from and about our (digital) heritage to make AI better? Not only does this provide a particularly good way to learn more about artificial intelligence, it is also the way to actively make an important contribution that is desperately needed.

====
This article was published earlier in the journal IP.

Martijn Kleppe is head of the Research Department at the National Library of the Netherlands
Jacco van Ossenbruggen is leader of the Human-centered Data Analytics group at CWI and the User Centric Data Science group at the VU University in Amsterdam.