SPAI Pilot Update: Challenges and Progress
Introduction
The Signpost AI (SPAI) chabot 6-month pilot kicked off in September 2024 and is slated to finish in February 2025. The pilot has been designed to understand how Signpost AI chatbot could potentially improve humanitarian response for frontline workers and clients alike.
The pilot is currently being tested in 4 Signpost sites: Greece, Italy, Kenya and El Salvador. During the pilot, the chatbot will be rigorously tested, iterated and optimized to enhance information access for those in need and lay the groundwork for future Signpost AI innovation.
In this Pilot update, we will look at the objectives of the pilot, how it has been structured, workflows, results so far and progress made. We will also look closely at the challenges that the moderators have been facing in their evaluations of the chatbot.
Objectives
Empower Community Liaisons: The pilot will give insights into how Generative AI tools can support the work of moderators - to save time, improve responses and streamline workflows related to two-way communications with clients
Enhance Information Access and Resolution: Investigate whether the Signpost AI chat can provide accurate, safe, quality and timely information to clients. Simultaneously investigate, if there is a significant time reduction in time it takes to address and resolve client inquiries
Evaluate Feasibility: Gauge AI tool feasibility by evaluating the chatbot’s performance, usability and potential impact. Such evaluations will inform future decisions about if and how AI agents and tools can add value to the Signpost offering
Foster Learning and Collaboration: Provide basis for learning and collaboration between Signpost AI team and country stakeholders to better tailor chatbot to specific geographical needs. This learning would also inform best practices, guidance and training of various stakeholders in the future
Timeline
Phase 1: Deployment and Rapid Iteration: Launching Signpost AI chatbot in all pilot countries and begin the process of gathering feedback for rapid iteration. In this phase, moderators in all countries begin to use the AI agent. The launches were staggered because each country needed time to coordinate their logistics and staffing. This phase is planned to last one month.
Phase 2: Testing and Refinement: The goal of this phase, which will last 2 months, is to conduct in-depth testing, collecting feedback and making adjustments through rapid iteration. This means testing various features and functionalities while and gathering feedback on AI AI agent accuracy, quality of response and AI tool ease of use
Phase 3: Impact Assessment and Feature Enhancement: In this phase, the team will thoroughly evaluate the chatbot’s performance and impact. Based on the results of the evaluation, enhancements will be implemented. This phase is planned to last two months
Phase 4: Scaling and Sustainability Planning: Based on evaluations, and learning from the previous phases, in this one month phase, Signpost will develop a plan for scaling and long-term sustainability. Feasibility checks will be conducted to assess how the chatbot will be deployed in additional contexts as well as surveying potential impact of external factors.
Team Structure, Roles and Key Activities
The pilot of each country is led by a Protection Officer (PO) who has extensive knowledge and expertise about their Signpost instance. These POs have also been involved with Signpost AI chat since before the pilot and so have expertise with using and optimizing the AI chatbot for better responses. These POs lead the community liaisons (or the moderators) who interact directly with the clients. These moderators are experts in dealing with clients and addressing their issues. They however, have no knowledge of the Signpost AI chatbot and were provided general and tool specific training and knowledge tests before beginning their evaluations.
While the country's pilots are structured differently based on available resources, personnel, country specific conditions and requirements, there is uniformity in the involvement of full teams in the testing of the pilot; the team’s roles are also similarly defined:
Community Liaisons: Directly interact with the chatbot, provide feedback, test new features, and ensure accurate responses.
Editorial Content Team: Update and manage the knowledge base, ensuring content accuracy and relevance to local context.
Social Media Team: Integrate the chatbot into social media channels, monitor interactions, and gather user feedback
Service Mapping Team: Ensure the chatbot's information is up-to-date and aligned with available services in Kenya
Protection Officer: Facilitates training, provides guidance, and serves as the primary liaison between the team and Signpost AI
The number of moderators involved in each team differs based on available size of teams, available resources and personnel but they roughly look like the following:
Number of Moderators
Greece: around 6 moderators testing 7 languages
Italy: around 2 moderators testing 3 languages
Kenya: 3 moderators + 1 pilot dedicated moderator testing and evaluating 4-8 tickets per day testing
El Salvador: 2 moderators testing 1 language (Spanish)
Please note that the exact number of moderators fluctuated based on vacation time and workloads. Now that we have looked at the structure and make up of the pilot, let us shift to the challenges and progress so far.
Moderator Work Flow
Moderators in all country instances access the SPAI chatbot through the customer service platform, Zendesk. You can read more about how the tool works here. The chatbot generates AI responses to user tickets assigned to them. The moderators review, evaluate and score each AI answer based on the SPAI Quality framework. There are three quality metrics:
Trauma-informed language or Psychological First Aid (PFA) specific language
Client-Centeredness
Safety/Do no harm
After the scoring, the moderators leave detailed feedback on each response which is reviewed by the Protection Officer (PO) who is also the Pilot country lead. The PO then tracks issues, makes improvements (through feedback or changes through prompt engineering) and captures this data to analyze trends.
The moderators do not do the evaluation full-time. There is a fixed time period, about 1 hour, (or question limit, usually 3 questions per day) that moderators take out of their daily work day to contribute towards testing the SPAI chatbot.
Pilot Performance Results So Far
To give a sense of how the SPAI chatbot is performing approximately 3 months into the pilot, let’s take a look at some available statistics from Greece, Italy and El Salvador
Over a combined sample size of 620 evaluated AI answers, the chatbot (which uses of OpenAI and Anthropic models) in Greece has a pass rate of 58.39%, a fail rate of 29.84% and flag rate of 11.77%. The Claude model seems to perform marginally better than the ChatGPT version.
For Italy, over a combined sample size of 227 responses, the pass rate was 71.37%, fail rate was 16.74% and Flag rate was 11.89%. Per model, ChatGPT performed significantly better than Claude but over a much smaller sample size.
For El Salvador, where only ChatGPT is being used, over a sample size of 218 AI-generated responses, the pass rate was 68%, fail rate was 23% and red flag rate was 8%
The results are very geographically dependent and in a future update, we will take a closer look at the number and investigate why there might be a variations and whether the pilot throws light on which model performs better for information provision for the Signpost use case.
Challenges
Time-consuming to give Feedback
According to moderators, assessment and evaluation of the AI responses requires focus and is time-consuming. This has been reported across all four country instances. In Kenya, moderators expressed that while generating responses and scoring them was manageable, the most time-consuming task was providing feedback after scoring. The time required for feedback significantly increases when the chatbot delivers responses that are deemed failures or raise flags. Moderators in some countries have also expressed willingness to test the AI tool outside of their pilot period. This creates a tension between the desire to test a more answers and the necessity of providing detailed feedback on fewer questions.
In response to this feedback, country POs are actively exploring ways to streamline the feedback process.
Hallucinations
Hallucinations continue to be an issue. Even as moderators across the pilot report improvement over time, made-up, false or fictional information continues to be generated. In El Salvador, the chatbot continues to generate links for resources which are either non-relevant or non-existent. In Greece and Italy, the chatbot has also been found hallucinating links as well as fictional answers to questions of processes of obtaining documents. In some cases, the chatbot uses information from outside of the knowledge base, which has led to frustration for moderators in El Salvador.
Language Issues
Moderators from all instances report language issues of all sorts. One issue that has kept coming up is the generation of responses in non-supported languages. For example, if the user asks a question in a non-supported language, the chatbot will respond in the same language, despite configurations which aim to eliminate such behavior.
In a similar vein, following issues have also cropped up in terms of translation:
the chatbot is able to generate correct, accurate answers but then mistranslates the final response in the correct language
The chatbot translates the language of the service articles in a manner which changes their meaning. This issue can loop back to the final generation of the chatbot response
In terms of the tone of the structure of the chatbot response the following issues come up:
chatbot uses repetitive language, to the point where it sounds robotic in its responses. This issue has been reported primarily by moderators in Greece and Italy
This is compounded by the length of the responses which can become overly long. This has been reported by moderators in Italy
In El Salvador, the chatbot has been found to (a) use imperative and directive language such as “we recommend”, “we invite” or “we suggest” and (b) use words such as “help”. Such language is not aligned with the Signpost Moderator handbook as they indicate a prescriptive tone, instead of one which favors openness of choice for the client
The chatbot does not ask for the the location of the user nor follow-up questions for clarifications; all important pieces of information to provide service the client effectively
The chatbot has been found to be good in some languages and not so good in the other languages, regularly missing context or important cues. In Kenya for example, while the Swahili responses are okay, Somali responses have been found with grammatical mistakes, missing cultural and contextual nuances about language use in the region.
The chatbot has also been found to occasionally misidentify the gender and age of the client in the Greece instance.
Complex Requests
The chatbot has shown no real ability so far to deal with complex requests. While it is able to answer simple queries (related to questions about how to get documentation, etc.) with mostly pass-worthy responses, it is unable to respond to questions which require nuance or a combination of information. In El Salvador, the SPAI chatbot does extremely well in basic requests (see below) but is unable to deal with questions that have to combine knowledge base information with the personal circumstances of the client.
In Greece for example, it has failed to provide a high pass rate even on basic requests, since processes documented in the country instance require some combination of knowledge. While accuracy on this has been improving, moderators did find that the SPAI chatbot could not be relied upon to provide complex information without making errors.
Unreasonable Expectations and Fears
Country Pilot leads report that moderators came into the pilot with high expectations of the chatbot’s performance even though these expectations have been tempered over the duration of the pilot. This level of expectation in some cases has led to frustrations because they expect the chatbot to perform better than it is, as has been found in the case of Kenya.
In El Salvador, moderators anthropomorphized the chatbot by believing that it thinks like a human and that it would respond to psycho-social contexts much better.
In Greece, there was expectation that the chatbot would respond perfectly to questions. In Italy, the moderators were unsure of what to expect from the chatbot and hence were positively surprised by the responses that it provided.
The expectations of moderators can be tied to their fears over AI. In meetings and introductory sessions, they expressed fears that AI would replace them and the work that they do. Despite assurances that it was just a technological tool, such fears can be linked to high expectations,i.e. fears that AI is so advanced and human-like logically lead to expectations that it will perform like a human in responding to user requests for information.
Such initial expectations have been lowered over time after seeing the actual performance of the chatbot.
This challenge provides a good learning opportunity to improve the content of AI literacy and training sessions. The moderators were trained and tested on their knowledge of the tool and Generative AI prior to the pilot. The bulk of this education was generally abstracted and focused on high level explanations of Generative AI, the AI tool, the Quality Frameworks and the work that they would be expected to do. In future iterations, it might be advisable to explore the architecture of the AI (which explains how the tool has been made and works) and the tool in much greater depth, particularly focused on the failings of the chatbot. Such grounded, work-specific, and failure-exposing explanations could potentially provide a more realistic counterweight to anthropomorphic views, high expectations and fears of AI taking over jobs.
Trust and Potential Complacency
As moderators become more familiar with the AI tool, they have become adept and faster in utilizing it. With the addition of chatbot performance incrementally improving, pilot leads report that moderators’ trust with the SPAI chatbot is increasing.For instance, it has been observed that moderators' trust in the tool increases immediately following a positive response. While this is generally beneficial, it also leads to instances where moderators mistakenly rate a poor response—such as one containing incorrect location information—as good. Conversely, they may flag a correct response after previously scoring it as a failure, although this effect is less pronounced.
Another indicator for increased trust comes from moderators expressing interesting in using the AI tool in their daily work.
Both these indicators highlight a potential dependency or complacency developing in the moderators using the AI tool. The data on this is not definitive but it is an issue worth keeping an eye out as studies on trust have shown that growing trust does produce dependencies and complacency. With a technology as presently unreliable as Generative AI, this issue is worth keeping an eye on in sensitive case such as the humanitarian sector.
AI tool Problems
As with any rapid iterative development, there have been instances of bugs with the AI tool hosted on Zendesk. In early testing for example, there were some issues where some of the response-generation buttons would not occasionally work. In a case for El Salvador Pilot, the tool was in English for the first couple of weeks even though it should have been in Spanish. Such problems as they have emerged have been resolved quickly once they have been identified.
Progress
Reduction in time for Basic Requests
While feedback is time-consuming, the AI tool itself seems to save time for moderators, to the point, as mentioned above, they would like to use it outside of the pilot constraints. This time-reduction to client servicing has been seen more strongly in Greece, Italy and El Salvador and to a lesser extent in Kenya.
There seems to be unanimous agreement that the tool saves time specifically for basic requests (given improvement in their performance scores). For more complex requests, where performance scores have not improved as drastically, it is still to be investigated whether the chatbot saves moderators time. This is because for example, editing long generated wrong, inaccurate or hallucinated answers might take more time than typing out a fresh response.
Observable Improvements in Performance
While performance is not where it needs to be to deploy the chatbot currently, there have been observable improvements as reported by moderators. This improvement is seen very strongly in the chatbot responding to basic requests over time. This improvement has been incremental and can be attributed to better prompt engineering, newer models as well as the moderators’ upskilling and their increased familiarity with the AI tool and its processes (which variables have what effects is a topic of investigation in the future). The improvements are undeniable:
Accuracy and clarity of the answers have improved monthly (as indicated by pass % going up across the board)
Hallucination rates have done decrementally
Structure of the response and the summarization of articles has also improved
Responding to complex responses has also improved slightly
Useful Tool
With the tool saving time, and their growing familiarity, moderators are beginning to find the AI tool useful (Italy). This usefulness can be gauged from their feedback as well as their growing trust (Greece and El Salvador) in the tool and the answers that it is generating.
These are preliminary results from the ongoing pilot. Future reports will update these findings and include in-depth analyses of key questions related to performance improvement and moderator trust. The goal of these updates is to assess the feasibility of the chatbot while developing a comprehensive set of insights and best practices for future use.