Introducing bookend Piton, a 3X improvement embeddings performance vs traditional transformer models
Summary
Today we’re excited to introduce our advanced embeddings model (codenamed “Piton”) which generates 3x the improvement in performance over standard sentence transformers. This release offers significant new enhancements in document ranking. Further improvements in embedding efficiency allows developers to quickly rank documents by relevance, a critical capability for large and clunky datasets. We are also introducing techniques that allow for the efficient management of embeddings, which result in reduced infrastructure costs.
With Piton, developers can easily:
- Improve the quality and performance of search applications
- Improve the relevance and performance of retrieval-augmentation generation (RAG)
Issues with embeddings today
Embeddings are a critical component for common AI applications including semantic / vector search and Retrieval Augmented Generation searching — which combines retrieval models and generative models to improve the quality and relevance of generated text.
Embeddings are generated by models that have been trained on large datasets, and can either be focused on one modality, often text (e.g. OpenAI’s text-embedding-ada-002) or sometimes text and images together in a multimodal embedding model (like CLIP)
Since embeddings can represent complex data concisely in a dimensional space, they are often used in consumer facing applications such as search and recommendations, image classification and ecommerce. These low-latency, high-throughput applications require embeddings to be fast. Obviously most users also want the process of generating embeddings and managing models to be cost-effective in addition to being performant.
Cost implications to running embeddings can be significant. Three main factors are involved:
Compute: embeddings take up a lot of memory and compute power for both generating as well as storage. Factors such as context lengths and parameter sizes dramatically increase the needs for compute resources.
- Training: training a model for generating embeddings is computationally expensive. Most users will use an off the shelf model such as text-embedding-ada-002 or an open source model, but models trained on a user’s actual data will often perform much better.
- Inference serving: the cost and latency for generating predictions from embeddings as inferences is heavily dependent on factors ranging from the previously mentioned context lengths and parameter sizes but also on model size, hardware, input data size and batch size.
- Inference serving: the cost and latency for generating predictions from embeddings as inferences is heavily dependent on factors ranging from the previously mentioned context lengths and parameter sizes but also on model size, hardware, input data size and batch size.
The cost and performance of running embeddings will depend on the implementation of the model and hardware it is running on. Running embeddings on GPUs are faster and more efficient than running them on a CPU but cost more and are subject to GPU supply chain issues.
While there are a plethora of options available for generating embeddings from open source frameworks, cost and performance vary widely. One very popular method for serving embeddings is using the sentence-transformer library. The library has been a standard for embedding training for years, supporting over a hundred different models, and remains a great platform for research. However, it is not optimized for performance when it comes to inference.
Improvements with bookend optimized models
Bookend offers optimizations above these standard libraries. We take these models and export them to the ONNX (Open Neural Network Exchange) format. We also apply model graph and operator fusion optimizations at the time of export. We then deploy these optimized models on a GPU with a high-performance inference wrapper.
Performance and cost improvements with bookend, using the BAAI/bge-small-en-v1.5 model to produce 384-dimensional embeddings resulted in 3x better throughput and latency. Even with the current high-end GPU shortage, performance advantages (run on L4-equivalent hardware) show that bookend optimized GPU-powered embeddings are able to handle sizable search workloads without a sweat.
Safe AI, Simplified
Our mission at bookend is to make Safe AI simple. To make that happen, we are on a path to build the most comprehensive set of tools for developers building Generative AI powered applications for the enterprise.