Using AI to Anonymize Sensitive Client Data

In the humanitarian sector, safeguarding sensitive data and ensuring privacy are paramount. Every day, Signpost partners respond to thousands of messages in more than 20 languages, providing accurate, accessible, and timely information to people in crisis. Each chat message may contain valuable insights into client queries but also sensitive information. To convert these messages into actionable data for improving our services, it is crucial for Signpost to effectively anonymize the data to protect user privacy.

Techniques for Anonymization

Generative AI tools, which are built upon vast amounts of internet data, exhibit a significant language bias. Nearly half of the internet’s content is in English, and over 80% is in Latin-based scripts. Given that only a quarter of internet users are English speakers, this bias presents a challenge for Signpost, where Ukrainian, Arabic, and Persian are among the top five languages spoken.

Before the advent of powerful Generative AI tools, Signpost used the free and open-source Natural Language Processing (NLP) Python library spaCy for data anonymization. SpaCy, with its pre-trained model Presidio, utilizes named entity recognition (NER) to identify and remove sensitive data such as names, addresses, emails, social security numbers, and phone numbers. While effective for Latin-based languages like English, French, and Spanish, spaCy struggled with non-Latin languages like Arabic, Pashto, Swahili, and Russian. Other libraries and services, such as Stanza and Microsoft’s Azure Language Detection Services, also failed to detect and remove non-Western names and phone numbers.

Challenges and Impact

Faced with the inadequacy of pre-trained models, we decided to put four Generative AI tools to the test. Using a uniform, simple prompt, we evaluated Claude, ChatGPT, and Gemini on their ability to anonymize dummy data containing personally-identifiable information (PII) like names, phone numbers, locations, and ID numbers in English, Swahili, Arabic, and Russian. While all models recognized and removed phone numbers and ID numbers with non-EU/US country codes, ChatGPT 4 struggled with names in Swahili, and Gemini with names in Arabic. Neither ChatGPT nor Gemini effectively detected or removed locations across the languages. Our results thus found Anthropic’s Claude to be the only model that was able to accurately identify and remove such information across all languages tested.

This image illustrates the diversity of languages and the complexity involved in identifying and anonymizing sensitive information across different scripts.

Future Outlook

With over 2.5 million rows of historical data and thousands of new rows added weekly, our next challenge is to anonymize this data safely, cost-effectively, and efficiently, using tools that transcend Western and Latin-language biases. 

In doing so, we will not only protect our users, but also strengthen the trust and reliability of the information we provide in the long-term.

Previous
Previous

Why We Prioritized Transparency Over Fine-Tuning

Next
Next

[BtB | PO Diary] Wk 1: First Impressions in the AI Lab