Rapid Prototyping: Human and Automated Evaluation Before Launch

This article is for development and subject matter expert teams considering approaches to evaluate AI’s quality while they are prototyping. 

Developing AI safely is a rapidly evolving process. One technical change can totally reshape an AI product in a way that removes controls of what you previously knew. If you changed the library that your bot uses to search for keywords to match with your knowledge base, a tester may one day see one answer, and the next day see something completely different. So the question remains- how can you evaluate how well your AI works when your product is changing rapidly? 

To answer this question there are two schools of thought- a more scientific approach, focused on evaluation purity would suggest that you should wait until your product is controlled or stable until you begin your evaluation - this is commonly known as an “online” evaluation. The problem with this approach in AI development is that, because the whole world of AI is changing daily, any AI that is not fully custom-built will be impossible to fully control.

  • Pros

    • Results are definitive and possibly scientifically admissible

    • Evaluation is only needed after product is developed

    • You can hire for only a definite period

    Cons

    • Development may steer off course without continuous evaluation - waste

    • Time to market of the product will take longer

    • It is fundamentally impossible to control AI as a downstream actor

    • Product quality will ultimately suffer

  • Pros

    • Early indication means ability to pivot and address issues early

    • Quality of the tool is better

    • Evaluation becomes a means to steer the direction of the product

    • Evaluators become part of the development process

    • Repeatability

    • Cost effective and allows for fast iterations (esp when using automated metrics)

    • Could be integrated into continuous integration (CI) environment (ensure minimum performance standards are met before deployment)

    Cons

    • Evaluation can be additive to cost because it must continue

    • Evaluation outcomes are less scientific and require accounting of dates and when changes affected performance

    • Evaluators must be continuously informed

Therefore, SignpostAI takes the second school of thought that evaluation should be done throughout the rapid prototyping process, despite a lack of control - an “offline” evaluation that graduates to “online”. This option provides the possibility for your evaluation to feed directly into the development process. Instead of thinking about Prototyping and Evaluation as sequential stages, you can think of them as complementary processes that contribute to each other. 

In order for this approach to work you need to ensure that your evaluators are properly kept in the know about when changes to the AI are happening and what possible outcomes they can expect. Upon every release in the rapid prototyping phase Signpost sends release notes to evaluators detailing the most relevant changes- these notes need to be clear even to the least technical audience. In addition, as evaluators notice trends, their feedback should be analyzed by business analysis skillsets connected to developers who can clearly articulate when a trend found by evaluators is actually a bug or an opportunity to build a new feature. We call this approach rapid evaluation.

GenAI Model Evaluation vs GenAI-based System Evaluation

What types of evaluations are there? 

Human Evaluation: Benefits and challenges

Using human evaluation  instead of automated evaluation offers several advantages, particularly in providing more meaningful and trustworthy scores due to the evaluators' contextual understanding. Human evaluators can consider nuances and specific circumstances that automated systems might miss, leading to more accurate and insightful assessments. This is especially relevant in a humanitarian context where GenAI models likely cannot understand the whole nuance of the context of an emerging crisis in a less-indexed country like Niger.  However, this approach is time-consuming and expensive, posing scalability challenges for larger projects unless it can be transitioned to upskilled staff who can contribute a portion of their time towards evaluation.

On the downside, relying on human assessment requires substantial investment in training and evaluating the evaluators to ensure consistency and reliability. Furthermore, human evaluations can be inconsistent and subjective, leading to potential discrepancies in ratings between different individuals- especially dependent on cultural and ethnolinguistic context. This variability can undermine the objectivity and comparability of the assessments. Nevertheless, this is the most effective “offline” way to get a sense of how an “online” implementation would work. 

Automated Evaluation: Benefits and challenges

Another option is to utilize GenAI itself and create AI evaluation agents that can conduct automated assessments of outputs. These AI evaluation agents should be modeled after the personas of quality agents and be given context like you would human quality assurance agents and more.  This approach is scalable and cost effective as it is easy to make AI bots. However, evaluation bots will yield results that miss context and show less biased but ultimately flawed data. For example- a bot without retrieval of the entire context of deportations in Pakistan could not effectively evaluate if a bot's response to a simulated client in Pakistan saying they are at risk of deportation is indeed correct. These contextual scenarios require human evaluation paired with automated evaluation. Automated evaluation is most successful when measuring consistency of style, format and clarity of content, identifying red flags. 

Ironically for AI evaluation to work well, the evaluation bot requires human evaluations from quality analysts! 

Best Practices

  1. Never rely solely on automated evaluation. A balanced approach that includes both human and automated evaluations ensures a comprehensive assessment. While automated metrics provide consistency and scalability, human evaluators add essential contextual insights that machines cannot grasp. This dual approach helps in identifying nuanced issues and ensures a more robust evaluation process.

  2. Incorporate continuous evaluation into your development process. By integrating evaluation into the rapid prototyping phase, you allow for iterative improvements and timely feedback. This approach ensures that any issues or potential improvements are identified early, enabling quick pivots and adjustments. Keeping evaluators informed about changes and involving them as part of the development team enhances the overall quality of the AI product. This collaborative process not only steers the direction of development but also ensures that the final product meets the desired standards of quality and functionality.

Previous
Previous

[FAQs] De-Risking AI for Humanitarian Response

Next
Next

What We’re Building, What We’re Buying and Why