In her office, Madelon Hulsebos shows the UN’s open data platform on her screen. A counter at the top displays over 19,300 datasets uploaded from 254 crisis areas worldwide. Local governments share their data here, ranging from conflicts and wars to floods and other natural disasters. The UN uses this information to determine which regions are in need of humanitarian assistance and financial support.
The datasets contain a wide variety of information, for instance about people, buildings, and locations. “Organisations upload these to the platform without realising that certain data could be harmful if they fall into the wrong hands,” Hulsebos explains. “Think, for example, of the coordinates of hospitals in conflict zones, which could then become targets. That means the information has to be filtered out afterwards."
That is, to put it mildly, a monumental task. Not only do the content and type of data vary from one dataset to another, but so does the way the information is structured. One organization might use a simple Excel sheet, while another works with columns under completely different headings, and in different languages. Hulsebos: “This makes it very difficult to automate the detection of sensitive information.”
Manual checks
The UN previously relied on a commercial tool from Google, Google DLP. But this system failed to uncover much of the sensitive information and often mislabelled data as ‘sensitive’ when it was not. As a result, manual checks were still needed by so-called Data Quality Officers. “But this takes a great deal of time, as more and more data are being shared. “Moreover, labeling data sensitivity in a consistent manner is hard, also for humans.”
The need for a more efficient solution is clear. In Hulsebos’ earlier research, Artificial intelligence (AI) appeared effective for recognizing patterns in highly diverse structured datasets. Based on this insight she turned to AI to develop a solution for the project, which is the first project within an existing collaboration between the UN and Hulsebos. AI Masterstudent Liang Telkamp joined the project and wrote her thesis on it.
Context matters
Together, Hulsebos and Telkamp developed two mechanisms to analyze data for sensitivity. Their approach introduced a new concept: contextual sensitive data. “Sensitive data are more than just personal details. What matters is whether the information could cause harm if it falls into the wrong hands,” explains Hulsebos. “Sensitivity can also be time-bound: data that may not have been sensitive five years ago, can be sensitive today. Or location-bound: the coordinates of a hospital in the Netherlands are less sensitive than those of a hospital in Gaza. Moreover, names are not always sensitive. A company name can be public, but the names of police officers should not be. The context of a dataset is therefore crucial.”
For one of the mechanisms, the researchers leveraged UN policy documents that spell out rules for handling data: which kinds of data should not be published and which can. The tool retrieved the relevant rules for a given dataset. The researchers then applied various large language models (LLMs), such as GPT-4 and open-source models like Qwen, to interpret the rules and determine whether the datasets contained sensitive information.
Far more effective
Hulsebos: “We observed that our mechanisms work much better than Google’s commercial tool. When it came to detecting sensitive personal data, Google DLP identified 63 per cent, whereas our LLM-powered mechanism reached 94 per cent.”
Because the LLMs were trained on the UN’s guidelines for sensitive data, they also became more precise. The number of false positives was halved, significantly reducing the workload for Quality Assurance Officers. “The LLMs are still not as specific as one would like,” says Hulsebos, “but they turned out to be quite consistent. They also provided clear explanations of why certain data would be sensitive.”
The UN has now decided to integrate the developed mechanisms into the HDX platform. Next month, Hulsebos will present the results at a UN meeting in Barcelona.
“The strength of these mechanisms is that they are applicable well beyond the UN, for instance to cloud platforms that host vast amounts of data, from corporate financials to government records,” says Hulsebos. “Many public data portals supply material for the training of AI models. And that is exactly where sensitive information should never be found.”
Header foto: kibri_ho / Shutterstock.com