What We’re Building, What We’re Buying and Why

Building Trust, One Evaluation at a Time: Why We Chose AI Development Over Off-the-Shelf Solutions.

In the ever-evolving world of AI, choosing the right tools can be a daunting task. When it came to evaluating the performance of our own AI systems, we faced a critical decision: buy an existing solution or build our own? 

While readily available tools offered a tempting shortcut, we ultimately opted for the path less traveled: developing a custom AI performance evaluator. This decision wasn’t taken lightly, but it stemmed from a deep commitment to transparency, control, and building trust in our AI for both ourselves and our users. 

Here’s a breakdown of the key factors we weighed when deciding between building a custom LLM performance evaluator or acquiring a pre-existing solution. Our primary objective is to analyze and compare various LLMs, evaluating their performance against Signpost’s core principles outlined in our constitution. 

  • Pros: 

    • Customization: You can tailor the evaluator to your specific needs and LLM functionalities. 

    • Control: You have full control over the evaluation process and data. 

    • Potential Cost Savings: In the long run, building might be cheaper if the evaluator is highly specialized for your needs.

    Cons:

    • Time and Resources: Building requires significant time, expertise, and resources, including developers, data scientists, and potentially computational infrastructure. 

    • Maintenance: You’ll be responsible for ongoing maintenance and updates as LLMs and evaluation techniques evolve. 

    • Expertise: Building a robust evaluator requires expertise in LLM performance evaluation methodologies and potentially machine learning itself.

  • Pros:

    • Faster Implementation: Existing evaluators are readily available and can be implemented more quickly. 

    • Reduced Cost Upfront: Requires less initial investment compared to building from scratch.

    • Expertise: Leverages the expertise of the evaluator developers.

    Cons:

    • Customization: You might not get the level of customization you need for your specific use case.

    • Limited Control: You have less control over the evaluation process and data used, and potential API and connections across our internal platforms, resulting in additional charges. 

    • Cost Over Time: Depending on the pricing model, purchasing could be more expensive in the long run compared to building.

Below are the types of questions we asked ourselves: 

  1. What are our specific needs and priorities for evaluating our LLM? 

  2. What is our budget for the evaluator? 

  3. What level of expertise do we have in-house for building and maintaining an evaluator? 

  4. Are there any existing evaluators that meet a significant portion of our requirements with acceptable customization options? 

Through in-depth evaluations, including meetings with third-party vendors such as Vellum AI, Vectorview, and Phospho, we thoroughly assessed the capabilities of pre-built LLM performance evaluators. This comprehensive evaluation process ultimately informed our decision to build or buy.

In a nutshell, we chose the path to invest in building our own to allow us to:

  • Tailor to Our Needs as off-the-shelf solutions might not perfectly align with Signpost’s specific use case. 

  • Greater Control and Flexibility in its functionalities and future development. We can adapt it to evolving needs and incorporate unique evaluation metrics relevant to Signpost’s goals. 

  • Transparency and Explainability allow for a deeper understanding of its inner workings. Transparency is crucial for building trust in the evaluation process and the ultimate decisions made regarding LLM performance.

  • Mitigating Potential Biases with pre-built solutions which might have inherent biases that could skew the evaluation. Developing our own evaluator allows us to design it with fairness and objectivity in mind, ensuring a more accurate assessment of LLMs.

By taking the path of building, we prioritize control, transparency, and a future-proof solution that aligns perfectly with Signpost’s specific needs and ethical considerations. 

Stay tuned as we unpack the intricate world of AI performance evaluation and delve into the ethical considerations that guide our decision. We’ll explore how a custom evaluator empowers us to mitigate potential harm, build trust with our users, and ultimately, pave the way for a future fueled by responsible AI development!

Previous
Previous

Rapid Prototyping: Human and Automated Evaluation Before Launch

Next
Next

The Humans Behind the Signpost AI Mission