Signpost AI Chat Pilot: Early Results and Feedback

Hello AI enthusiasts and humanitarians! We are excited to share a first look at the progress of our AI chatbot/agent pilot and the early insights we've gathered. We launched the pilot in September in Greece and Italy, with subsequent launches in Kenya and El Salvador. Since then, protection officers and moderators have been diligently evaluating Signpost AI chat’s performance in delivering quality responses suited to the humanitarian contexts they are operating in.

Their evaluations not only assess the chatbot’s quality of response but the underlying AI tool as well. The primary focus of the evaluations is to ensure that the chatbot meets our Signpost standards for client responses. This assessment process will inform the chatbot’s future refinement and adaptation for broader deployment in crisis response and humanitarian aid settings.

The pilot started with training for protection officers who would be conducting the testing and evaluations. This training included informative sessions and hands-on practice, where staff received guidance on the AI assessment rubric (both quality and red-team flags), key aspects to observe, and how to use the AI tool (which is integrated into our customer service platform).

Early Results

So what have we learned so far? In this blog post, we are going to limit our findings to Greece and Italy for the last two weeks (23rd September - 4th October) using an AI agent utilizing Claude. During this time, in Greece, testing was conducted on 7 languages (English, French, Arabic, Ukrainian, Farsi/Dari/Urdu and Somali). There were a total of a 100 responses assessed and evaluated:

58% of all responses in all languages were given a pass
36% of all responses in all languages were given a fail
6% of all responses in all languages were red-flagged

In terms of language specific findings:

88.8% of all Farsi/Dari responses passed (16 out of 18); this being the highest percentage score
Other Pass rates: French scored 60% on 10 responses, while English was 52.2% on 23 responses and Arabic 51.3% on 39 total responses

In Italy, there were three languages (English, Italian and Ukrainian) tested with a total of 61 responses:

63.9% of all responses in all languages were given a pass
32.8% of all responses in all languages were given a fail
3.3% of all responses in all languages were red-flagged

In terms of language specific findings:

81.6% of all English responses passed (31/38)
Only 28.6% of all Italian responses passed (6/15)

These are early results with a small sample size but already, they point to potential vectors for investigation. For example, how come there is vast difference in English language pass rate between Italy and Greece? This would definitely be a point of investigation if such results remain the same with a larger sample size; are the kinds of questions asked in the two countries just that much different or are the evaluators using different interpretations for assessment?

The high pass rate of Farsi/Dari responses in Greece is an encouraging yet unexpected early outcome. We will continue to monitor these results and will be conducting more in-depth tests soon.

Early Feedback

So those were the results, how about feedback from the Protection Officers on their experiences? Early feedback provided during regularly scheduled debrief meetings can be roughly divided in two categories:

1.Feedback on Quality and Red Team evaluations

AI chatbot responses are largely considered good in terms of “correctness” and accuracy in both Italy and Greece.
Greece staff reported that using AI saved timing writing the response while giving comprehensive details in a format that is structured and relevant
Greece has reported that the AI chatbot may be providing an excessive number of references (and it takes time removing them) and responding in English, even when users are interacting in a different language.
The chatbot in Italy does not use the correct pronouns when responding in Italian, addressing the client in third person (formal); whereas this is not desired from the perspective of the protection officers
Protection Officers in Greece reported that it was time-consuming to review responses (including links and contact information for services)
Some responses were too long and overly-friendly
Pilot leads have reported that the "Red team" metrics can be somewhat confusing for some staff, leading to potential misflagging of certain responses

2.AI Tool and Platform use

Some protection officers reported that the AI chatbot sometimes did not generate answers
Protection officers reported that the “Generate Tags” feature intermittently does not work (this feature populates tags for each response)
Evaluation tool on the CMS has been considered not user-friendly for pilot leads and reviewers

In addition to this feedback, we are also in the process of conducting surveys to measure trust, background knowledge, confidence and productivity.

Overall, there are promising early signs of progress in delivering accurate and appropriate responses. However, we are continually refining the AI chatbot based on this feedback. Initial insights suggest that we should focus on several areas while iterating for their solutions:

Monitoring unnecessary references in responses and adding a reference limit in agents
Ensuring that references are made in the same language as the user's interaction
Addressing issues such as incorrect pronoun usage through prompting tuning
Providing clear guidelines and training to ensure all team members interpret and apply the assessment framework uniformly.
Adjusting for overly lengthy responses while maintaining an well-balance, appropriately friendly tone

That is all for now and we will keep you updated on how the pilot is going!

Signpost AI Chat Pilot: Early Results and Feedback

Excessive and Irrelevant References:

Language Consistency:

Grammar Consistency:

Evaluation Consistency:

Tone Consistency:

Top Prompting Rules for a Humanitarian Chatbot

[BtB | PO Diary] Wk 5: Teaching AI to Care