LLM Showdown: Choosing Ethical & Effective AI

Hello everyone and welcome to another episode of our “Behind the Bot” series!

Today, we share our framework and decision-making process for selecting a Large Language Model for our AI Chabot. 

Donning our experimental and research hats we started off broadly, connecting our Chatbot to different versions of OpenAI's ChatGPT, Google’s Gemini and Anthropic’s Claude. We wanted to test the performance of all of the major Generative AI LLM offerings in humanitarian-case uses. Our Expert Protection Officers (POs) and Red Teams got to work, cross-evaluating the LLMs on quality and performance, keeping meticulous testing diaries, and raising a host of crucial ethical questions.

Deliberating the Process

As soon as testing began, we already faced an important question: 

“For our Signpost AI Chatbot, do we test with one LLM or all three together?”: We were always open to the idea that different LLMs might be more proficient on different kinds of questions and if this were to be the case, we would put together a super team of LLMs to face any question. The trade-offs of this option would come at the expense of development complexity and increased costs. Our testing began showing that responses of all LLMs were similar in kind and that the only difference was in their level of performance against our quality metrics. 

Coincidentally, one LLM seemed to be outperforming others on these metrics . 

Claude Running Away!

Our initial weekly PO evaluations showed that Claude outperformed other LLMs on quality. What is Quality? We are going to do a deep dive on that soon but early testing was done on the following Signpost AI quality indicators:

  1. Trauma-Informed:  utilizing better Psychological First Aid (PFA) language and principles

  2. Client-Centered: Responding to the specific concerns of the client

  3. Safety/Do no Harm: Output does not include hate Speech, stereotypes, judgements and political statements, etc

Metrics associated with these key pillars are a crucial factor in deciding which LLM we would select. But is this the only factor that should dictate the selection? We were kind of sure but not completely and so the decision required additional due diligence, research and accounting of all possible but relevant factors.

Developing the Selection Framework

All of the Signpost AI teams have had conversations around the criteria listed below. So we decided to consolidate all of those conversations in one framework and added a dash of research:

Criteria

  • Test outputs are:

    1. Trauma-Informed

    2. Client-Centered

    3. Safe/Do no Harm

    4. Manage Expectations

  • Rationale -

    LLM choice with regards to development included conversations on:

    1. Complexity(less is good)

    2. Easy to access and work with

    3. Has development integrations 

    4. Developer support

    5. Development Scaling Potential 

    6. LLM Response Time

    7. Initial and Future access, and tokens costs

    8. How well does LLM benchmark against industry standards?

  • What are the LLM providers’:

    1. Policies on Data Privacy and Ownership

    2. Usage of ingested LLM data?

    3. Stance on transparency and sharing information about LLMs

  • Does a particular LLM offer more, less or the same:

    1. Future Partnerships

    2. Funding Opportunities

  • A number of different factors which could affect our development, deployment and scaling efforts:

    1. Is the LLM provider an established, stable technology player or innovative startup? What are potential trade-offs?

    2. Are there regulatory or legal issues with the provider that might impact the future performance of LLM?

    3. What is the LLM provider’s position on LLM-related climate emissions? What are their alleviation arrangements?


Choice Time!

We accounted for a lot! This framework is a good way to get started with selecting an LLM. Given our humanitarian use-case, our conversations revolved around the first three criteria primarily. We could have added weights based on relative importance to this table and did it a little bit more scientifically. But as we discussed more, the choice actually became much simpler; our teams seemed fairly unanimous about what really really mattered in selecting an LLM for our humanitarian use-case: The quality of the response. 

At the end of the day, the Signpost AI chatbot is supposed to give us good, accurate, tailored and safe answers to our clients. This is our humanitarian moral responsibility to do best by our communities. So which LLM did we choose? We chose Claude. Because its responses consistently aligned on our users’ needs. Optimizing for quality is the only way we can ensure practical and ethical usability of Generative AI LLM in a humanitarian case-use.

Is Claude good enough yet? Of course not. There is a lot of more work to be done in testing, evaluating and making technical tweaks. We might even need to re-evaluate our LLM selection. But currently, it shows the most potential in being the LLM, that once connected to our chatbot, can give our communities the information they need.

Previous
Previous

[BtB | PO Diary] Wk 1: First Impressions in the AI Lab

Next
Next

[FAQs] De-Risking AI for Humanitarian Response