Decoding the Language of AI: Understanding the Impact of Prompts on Performance

Aug 2

The Power of Prompts

Have you ever wondered why your latest AI gadget seems to misunderstand you half the time? You ask for a recipe, it gives you a sonnet about vegetables. You request a summary of a news article, it launches into a philosophical treatise. The culprit? A communication gap bridged by the unsung hero of AI interactions: the prompt.

Prompts are like magic words for AI interactions. They act as a bridge, translating our intentions and desires into clear instructions the AI can understand. By crafting effective prompts, we can unlock the full potential of AI, leading to more productive tasks, creative explorations, and even deeper understanding.

Prompts simply are instructions that you ask the AI to follow - if phrased the right way can yield incredible results.

Gauging the Power of Prompts: How We Measure Effectiveness

The inner workings of a bot can be opaque, making it difficult to gauge if it's truly following your instructions. You might receive a seemingly perfect answer that misses key points from your prompt, or a nonsensical response that leaves you wondering what went wrong.

Crafting effective prompts involves a two-pronged approach: first, clearly identifying your desired outcome, and second, structuring your prompt in a way the bot can understand. It's like giving clear instructions and ensuring they're delivered in a language the recipient comprehends.

There are a variety of methods and techniques used to evaluate prompt effectiveness and adherence for AI interactions, especially LLMs:

Human Evaluation
a. Subjective Evaluation: Humans judge the outputs generated by the LLM based on factors like relevance, coherence, grammatical correctness, and overall quality. This can be done through surveys, rating scales, or open-ended feedback.
b. A/B Testing: Different prompts are compared to see which one produces the most reliable outcomes. Human evaluators then assess the quality and effectiveness of the outputs from each prompt, introducing new changes one by one and comparing results of A and B.
c. Analysis of prompt Logs: Reviewing what prompts are firing or not firing when your bot makes a request is essential to understanding whether prompts are “sticking” or being overlooked.
Task-Specific Evaluation
a. Accuracy: Measure how often the LLM’s output aligns with a desired outcome for a specific prompt (e.g., correctly summarizing a document, generating a factually accurate response).
b. Complete Rate: Measures the percentage of prompts that the LLM can successfully complete the intended task for.
c. User Satisfaction: Evaluates how satisfied users are with the LLM’s performance in completing a specific task or interacting with a system powered by the LLM.
Additional Considerations
a. Bias Evaluation: It’s important to evaluate prompts for potential biases that might lead to skewed or unfair outputs from the LLM. Analyzing the generated text and using fairness metrics can help identify potential issues.
b. Explainability: Understanding how the LLM arrived at its output can be crucial in evaluating the effectiveness and trustworthiness of the prompt. Techniques like attention visualization can provide insights into the LLM’s reasoning process.

Choosing the Right Evaluation Method

The most suitable evaluation method will depend on the specific task, the desired outcome, and the resources available. Often, a combination of different techniques is used to get a more comprehensive picture of prompt effectiveness.

To ensure the effectiveness and responsible development of Signpost Bot, we employed a rigorous evaluation process. This involved testing the bot with both real user prompts and synthetic requests. We then assessed its performance against our Red Team and Quality Team metrics, with a strong focus on the LLM model itself and its outputs based on the development of the system and local prompts.

What is the difference between System Prompts, Local Prompts and User Prompts?

The big picture of LLMs and Prompts are that we rely on system prompts, local prompts and user prompts. They help shape how the LLM will generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.

System Prompts creates the big picture instructions for your LLM. It defines the overall task at hand and establish the tone and style for the LLM’s response. An example will be: "The AI shouldn’t cause harm to the user or anyone else."
Local Prompts act like zoom lenses, focusing the LLM’s understanding of your specific needs within a broader context. They provide additional details and instructions, like how you want the AI to represent a particular place. For example, Signpost manages a global platform from Peru to Thailand. To ensure the AI portrays Kenya in a specific way, I would write a Local Prompt outlining these desired characteristics as shown below.

"Respond to the user ‘Hello. I’m an AI Assistant for Julisha.Info program. My role is to provide information on services and assistance available to refugees, asylum seekers and the host community in Kenya as shared on the Julisha.Info website, and the Julisha.Info social media platforms. Please let me know if you have any specific questions, and I'll do my best to respond within the scope of my knowledge and capabilities as an AI system.' when asked who you are, what you are and how you can help them.". It's important to provide examples.

User Prompts are real-life requests, queries, and concerns from actual users who have contacted us. These prompts are anonymized and aggregated to test the bot’s performance under real-world scenarios.

We also leverage moderator comments alongside the logged responses from the LLM. This combined analysis helps us identify areas for improvement and pinpoint opportunities to create system or local prompts. By incorporating these insights, we can further refine the AI’s capabilities and enhance the user journey.

To gain a deeper understanding of the bot's functionality, we conducted a comprehensive evaluation that went beyond surface-level analysis. We leveraged our bot logs to examine various aspects of its performance, including handling search results, detecting user location, generating search terms, and processing responses through the "Answer after Constitutional" stage. To achieve this, we employed a combination of methods: Human Evaluation, Task-Specific Evaluation, Bias Evaluation, and Explainability.

The Pitfalls of Prompts: How We Learned Less is More in LLM Evaluation

During our LLM evaluation, we compared the bot’s performance against our systems and local prompts. We began evaluating our bots with a pre-written set of roughly 200 system prompts which all were drafted based on assumptions our development team made. Interestingly, the results showed stagnation or even a decrease in performance compared to two months prior as we added more, despite adding new prompts as development progressed. This raised concerns that bombarding the bot with too many prompts might be hindering its learning and even leading to increased hallucinations in its outputs.

Recognizing this, we decided to take a step back and re-evaluate our prompt creation process. We acknowledged that our rapid development pace might have compromised prompt quality. To determine if prompts were the root cause of the issue, we conducted a new test. We pitted our existing bot, loaded with all our created prompts, against a ‘RAW’ version with the same configuration but only with the most basic prompts. This head-to-head test aimed to isolate the impact of prompts on the LLM’s performance, and reintroduce control.

To illustrate the difference, consider these two examples from the Kenya bot:

‘[Red Team - bot with many prompts]’: The bot made the assumption that the user is in a bad situation and that they need help to renew their work permit in Nairobi which is not what they asked for. Whereas ‘[RAW - bot with less tests after A/B testing]’ presents a comprehensive list of services to the user and gives them the space to make a decision with the information.

Beyond the Output: A Look Inside Signpost Bot’s Evaluation

In the second example, we leveraged Human Evaluation and other methods (mentioned previously) to assess the bot’s performance. Notably, the bot referenced the correct article (“Health Care”) from its Kenyan knowledge base (Julisha.info). The response, information, additional information, and links are correct and functional.

However, a crucial detail was missing: Hospital Visiting Hours for Saturday. This occured with both bots we tested. This inconsistency raises several questions:

Could the article’s structure be the culprit? (e.g., Is Saturday information formatted differently?)
Was Saturday intentionally excluded from the template?
Can we address this with improved prompting?
How can we ensure Saturday hours are included in the response for accurate information?

By investigating these questions through further examination of the frontend, backend, and the article itself, we can pinpoint the root cause and implement a solution.

Level Up Your Prompts: Essential Tips for Effective LLM Communication

Here are some best practices for prompt optimization to get the most out of your Large Language Model (LLM):

Clarity and Specificity
a. Be clear and concise: State your instructions and desired outcomes in a way that’s easy for the LLM to understand. Avoid ambiguity and convoluted language.
b. Provide specific details: The more specific you are, the better the LLM can tailor its response.
Focus and Context
a. Maintain focus: Keep your prompts focused on a single task or question. Avoid bombarding the LLM with too much information at once.
b. Provide context: Give the LLM enough background information to understand the situation and respond appropriately. This could include relevant facts, data, or previous interactions.
Examples and Style

Use examples: If possible, provide examples of the desired output or style you’re aiming for. This helps the LLM understand your expectations.
Specify the style: Guide the LLM by indicating the desired tone and style. For Signpost, for example, be empathetic and do not make assumptions or suggestions.

Data and Feedback
a. Leverage training data: Consider the type of data your LLM was trained on and tailor your prompts accordingly as shown in the Beyond the Output.
b. Utilize feedback: Analyze the LLM’s outputs and use them to refine your prompts. Identify patterns in successful and unsuccessful prompts to learn and improve.

It’s best to start simple with the basic prompts and gradually increase complexity as the LLM’s capabilities improve. If you have a complex task in mind, break it down into smaller, more manageable prompts, and be sure to test and iterate. Experimentation is key to know and see what works best for your specific needs.

This is why we recommend testing prompts one by one as you implement - otherwise it will be hard to troubleshoot, be certain of your direction, and control your testing environment

Don’t be afraid to revise and iterate based on results! Remember, the more you understand your LLM’s capabilities and limitations, the better you can craft effective prompts that drive successful interactions.

Helen Bui

Decoding the Language of AI: Understanding the Impact of Prompts on Performance

The Signpost AI Quality Framework

Why We Prioritized Transparency Over Fine-Tuning