A five step framework for evaluating and comparing models for the safe adoption of Generative AI

Vivek Sriram
9 min readOct 10, 2023

Introduction

Business leaders considering the suitability of Generative AI have a dizzying set of decisions to make. While there is no shortage of technical information much of it is of isn’t accessible to non-technical audiences and frequently of dubious quality. This is particularly acute given the exponential growth of pre-trained models, now numbering close to 400,000. Since any Generative AI project will be composed of people with a multitude of specialties and decision making input and responsibility, the unavailability of simple frameworks for evaluating models hampers adoption, particularly for enterprise use cases.

bookend AI is driven by the belief that true adoption of Generative AI will not materialize in the enterprise until it’s safe and simple. In order for Generative AI to be safe and simple, it must first be transparent and trustworthy and easily accessible to all of the stakeholders involved in bringing ideas to life and sustaining and managing it after. The first step in creating transparency and trust is a simple and accessible framework for comparing the various features and capabilities of the different flavors of Generative AI models in the world. Until then, product leaders and others in similar roles looking to creative competitive advantage with Generative AI will remain experimenting with capabilities vs putting this transformational power to actual productive use.

This document makes the case for a simple, accessible framework for choosing and evaluating pre-trained models for suitability for powering enterprise use cases. It is meant for use by business leaders such as in IT, product, marketing and business operations who are considering using Generative AI. This framework is meant to help people involved making a decision on model suitability, but want more choice and flexibility than the rigid, “do-everything” models like Open AI but also do not have the time or capacity to pore through the hundreds of thousands of models on Hugging Face to figure out what’s right for them.

The case for a standards-based model comparison framework

80% of corporate executives see trust as a blocker for enterprise Generative AI adoption. Since companies see their data as critical assets, they are rightfully concerned about how it gets used, how their employees, partners and customers interact with it and what they must do to reduce and protect against risk from emerging rules and regulations. They must further account for all of these governance, risk and compliance issues alongside dealing with the non insubstantial challenges involved in deployment, operations and optimization of Generative AI applications.

Security, data protection, privacy and most of all, trust, top the list of concerns as executives across the organization have further concerns about explainability and bias. IT leaders who have to run, manage and operate these systems must additionally also deal with unpredictable cost / performance tradeoffs. Until Generative AI is both safe and simple organizations will be cautious about doing more with it. At present, complication and confusion abounds in every stage of development and deployment lifecycle.

Safety and simplicity are critical to enterprise adoption. Accessibility and availability of the features, capabilities and pitfalls of pre-trained models are of critical value in every step for each of the multitude of roles involved in the selection, development, deployment and securing of Generative AI applications. The absence of which erodes transparency and trust and curtailing adoption and leaving Generative AI applications stuck in experimentation limbo (Note: add data to reinforce point).

Standards can inject transparency into the process by which people across business functions work together on complex problems. They create trust and reduce risk. While there are many factors involved in the ethics and trust in AI applications, here we’re simply only concerned with standards that create transparency in the evaluation and selection of pre-trained models, particularly the vast number of open source models with their varying and diverse capabilities, performance, features, authorship and provenance.

Model selection in practice

Corporate policies governing data access and usage, as well as IT restrictions and capability around data management and operations affect which models are the most appropriate for any particular situation. Generative AI models feed on large volumes of data in order to perform well on non-generic tasks. They also consume a lot of computing power. These twin issues require collaboration and consensus across the organization — demanding alignment between both the need for innovation and experimentation with managing risk and cost.

The right starting place is with a use case. Generative AI can do a lot — from protein synthesis to time series analysis along with the more common text and image generation familiar to most. Knowing the use case is critical to frame organizational alignment. Understanding the specific purpose of a use case will also force an understanding of data, computational requirements and performance characteristics.

Accounting for the different decision points illustrated with the diagram above requires the input and consensus of roles from across the organization — including business stakeholders, ML and AI ops, data engineering, support, security and legal.

Since the data that underpins good answers to each of those decisions is often substandard, missing and inaccessible to non-technical users, organizations suffer delays in getting to agreement on how and in what way to move forward. Model selection in practice, then, falls back to the assessment of the technically able — but often without input or understanding from other stakeholders. Even within this narrow competency, a number of additional considerations have to be made.

This is a lengthy and arduous process that most non AI experts will find frustrating to do and difficult to do accurately. It can take weeks to do when the need is for open source models for highly specialized use cases dealing with sensitive data. The rapid growth of new open source models further necessitates a continuous evaluation.

Model evaluation checklist

At bookend we specialize in curating models fit for enterprise use. Our team has built, deployed and scaled models to power use cases ranging from enterprise search, ecommerce personalization all the way to fraud detection and anti money laundering. The checklist below represents our go-to for going about determining how to evaluate models.

Use-case alignment

Everything starts with a model that can fit your use-case! If you’re looking for a chatbot, then you’re looking for an LLM that has been fine-tuned on conversational data, or if you’re looking to build a new programming model, it will make sense to choose one that has seen a lot of code in its training run. Be clear about what you’re looking to solve before hunting for models.

License restrictions

For the enterprise, licensing is always an important consideration for any piece of software. In the world of Generative AI, there is an additional wrinkle. Many models have very permissive licenses for their source code, but different and more restrictive ones for the actual model weights — without which, the model is useless. At Bookend, all the models available for use on our platform are only those models that can be used in an enterprise deployment for both code and weights.

Cost of model development / customization

Training large models from scratch can easily escalate to cost millions of dollars of computer time. Even fine-tuning larger models can cost tens of thousands of dollars if done in a traditional fashion. Enterprise users should take watch for models that are aligned to their use-case and also use efficient training techniques such as LoRA to produce high-quality models at a low cost.

Operations expenses

Renting out those GPUs is not cheap. Managing costs requires balancing the biggest model with the constraints of your operational budget. Even smaller models at large context lengths can require enormous amounts of GPU memory. Watch for models that use cutting-edge techniques like FlashAttention2 to optimize performance without breaking the bank.

Availability and accessibility of data for fine tuning

High quality data is of the utmost importance for a performant model. While there are lots of good datasets in the public domain, enterprise users have to be careful and be aware of the provenance and licensing of those datasets. These might be generated by a commercial large language model which can make commercial — non-licensed use problematic. Watch for datasets that have as liberal licensing. This is as important as licensing for for models.

Tradeoffs between accuracy, computational efficiency and time

Model choice is going to affect the accuracy and the speed of operations. A 70bn model is going to be memory-hungry and will be slower than a svelte 7bn model, but it’s likely to be less accurate for your task. However, a fine-tuned model may perform better on your task than a much larger general purpose model.

Portability across on-prem, hybrid and public cloud environments

GPUs are currently a scarce commodity — even cloud providers are having difficulty keeping up with demand. You want a system that allows you to go where the GPUs are plentiful — no matter what cloud or other provider they’re running on — and offer the best way to make efficient use of the GPUs you get your hands on.

Security, audit and access controls

Who is using your models? What data do they have access to? If you fine-tune a model with all your corporate data for a knowledge management use-case for example, you run the risk of exposing that data to people in the company who perhaps shouldn’t be able to see it. A better and more careful approach is to use a model that doesn’t have your knowledge base as part of its training, but uses techniques such as retrieval augmented generative (RAG) search where documents are fed into the model at runtime with appropriate security and safety filters.

Compliance with legal and regulatory standards

Many enterprises operate in domains where regulators can step in to demand details of all aspects of ML. Can you risk having picked a model where you cannot fully describe all the data that it was trained on?

A 5 factor framework for comparing and evaluating Gen AI models

Bookend has a simple 5 factor framework for comparing and evaluating models. Modify and use these for your own situation in line with the checklist above. We will update and revise these along with model standards and publish those revisions regularly.

Factor 1: Model Type

  • What sort of model is it?
  • What modalities does it support?
  • Has it been tuned for a particular use-case?
  • What architecture is it based on?

Factor 2: Model Capability

We define different classes of models, corresponding to their parameter size, to indicate relative capabilities:

  • Class 1: 0–7B parameters
  • Class 2: 7B — 30B parameters
  • Class 3: 30B — 60B parameters
  • Class 4: greater than 60B parameters

Factor 3: Fine-tuning data requirements

  • What data is needed for a particular use-case?
  • How much data is needed?
  • Will the additional dataset need to be specially formatted for fine-tuning that particular model?

Factor 4: Operations

  • How much memory and compute will a model consume?
  • What GPUs will it run on?
  • What is the most efficient way of running this model for the input it will be consuming?

Factor 5: Governance, Risk and Compliance

  • What access controls are there for this model?
  • What guardrails can be applied to reduce hallucinations?
  • Can the training data of the model be explicitly defined and laid out for auditing purposes?

Conclusion

Over 500,000 open source models will soon be available. Choosing from among them is a complex decision that involves a number of subjective considerations. While the choices may seem daunting — especially in light of cost, GPU scarcity and many potential risks — a structured process for making a decision can help narrow the choices down to a very manageable set. Enterprises need the freedom, flexibility and assurance to run and manage enterprise-grade workloads requiring the appropriate licensing, security and scale parameters. Bookend aims to make Safe AI simple by offering enterprise developers the widest choice for the most secure and flexible deployment options for open source and commercial models.

--

--