👽 Edition 10: Testing framework for LLM Part 1
Documenting the LLM Testing Framework till date.
Back from vacation and the writer’s block ☀️
In this edition, I have meticulously documented every testing framework for Llm that I've come across on the internet and GitHub.
Basic LLM Testing Framework:
I am organizing the frameworks in alphabetical order, without assigning any specific rank to them.
👩⚖️ DeepEval
DeepEval provides a Pythonic way to run offline evaluations on your LLM pipelines so you can launch comfortably into production. The guiding philosophy is a "Pytest for LLM" that aims to make productionizing and evaluating LLMs as easy as ensuring all tests pass.
DeepEval – It’s a tool for easy and efficient LLM testing. Deepeval aims to make writing tests for LLM applications (such as RAG) as easy as writing Python unit tests.
🪂 Metrics
AnswerRelevancy
: Depends on"sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
BertScoreMetric
: Depends on"sentence-transformers/all-mpnet-base-v2"
Dbias
: LLMs can become highly biased after finetuning from any RLHF or optimizations. Bias, however, is a very vague term so the paper focuses on bias in the following areas.Gender (e.g. "All man hours in his area of responsibility must be approved.")
Age (e.g. "Apply if you are a recent graduate.")
Racial/Ethnicity (e.g. "Police are looking for any black males who may be involved in this case.")
Disability (e.g. "Genuine concern for the elderly and handicapped")
Mental Health (e.g. "Any experience working with retarded people is required for this job.")
Religion
Education
Political ideology
This is measured according to tests with logic following this paper:
BLEUMetric
: Compute the BLEU score for a candidate sentence given a reference sentence. Depends on the nltk models.CohereRerankerMetric
ConceptualSimilarityMetric:
Asserting conceptual similarity.Depends on"sentence-transformers/all-mpnet-base-v2"
ranking_similarity
: Similarity measures between two differentranked lists. Built on “A Similarity Measure for Indefinite Rankings”
NonToxicMetric:
Built on detoxifyFactualConsistencyMetric: Depends on
"cross-encoder/nli-deberta-v3-large"
EntailmentScoreMetric:
Depends on"cross-encoder/nli-deberta-base"
Custom Metrics can be added.
🎈 Details
License: Apache-2.0 license
🧗 Remarks
Clean Dashboard.
The model derived Metrics - and it’s good. You can adjust the model depending on the performance.
Helpful to measure the output quality.
Less Community Support.
🕵️ Agentops(in development)
🎈 Details
🧗 Remarks
Enlisting the product because of the exciting LLM debugging roadmap
baserun.ai💪💪💪
Testing & Observability Platform for LLM Apps
From prompt playground to end-to-end tests, baserun helps you ship your LLM apps with confidence and speed.
Baserun is a YCombinator-backed great tool to debug the prompts on runtime.
🎈 Details
🧗 Remarks
Clean Detailed Dashboard with prompt cost(I loved that).
The evaluation framework is heavily inspired by the OpenAI Evals project and offers a number of built-in evaluations which we record and aggregate in the Baserun dashboard.
The framework simplifies the LLM Debugging workflow.
The hallucinations can be prevented with the tool to some extent.
Less Customisation Scope.
🐤 PromptTools
Welcome to
prompttools
created by Hegel AI! This repo offers a set of open-source, self-hostable tools for experimenting with, testing, and evaluating LLMs, vector databases, and prompts. The core idea is to enable developers to evaluate using familiar interfaces like code, notebooks, and a local playground.In just a few lines of codes, you can test your prompts and parameters across different models (whether you are using OpenAI, Anthropic, or LLaMA models). You can even evaluate the retrieval accuracy of vector databases.
🎈 Details
License: Apache-2.0 license
🪂 Metrics
Experiments and Harnesses
Here are two main abstractions used in the prompttools
library: Experiments and Harnesses. Occasionally, you may want to use a harness, because it abstracts away more details.
An experiment is a low-level abstraction that takes the cartesian product of possible inputs to an LLM API. For example, the
OpenAIChatExperiment
accepts lists of inputs for each parameter of the OpenAI Chat Completion API. Then, it constructs and asynchronously executes requests using those potential inputs. An example of using an experiment is here.There are two main abstractions used in the
prompttools
library: Experiments and Harnesses. Occasionally, you may want to use a harness, because it abstracts away more details.A harness is built on top of an experiment and manages abstractions over inputs.
Evaluation and Validation
These built-in functions help you to evaluate the outputs of your experiments. They can also be used to be part of your CI/CD system.
You can also manually enter feedback to evaluate prompts, see HumanFeedback.ipynb.
IT uses gpt4 as a judge
Here is a list of APIs that we support with our experiments:
LLMs
OpenAI (Completion, ChatCompletion, Fine-tuned models) - Supported
LLaMA.Cpp (LLaMA 1, LLaMA 2) - Supported
HuggingFace (Hub API, Inference Endpoints) - Supported
Anthropic - Supported
Google PaLM - Supported
Azure OpenAI Service - Supported
Replicate - Supported
Ollama - In Progress
Vector Databases and Data Utility
Chroma - Supported
Weaviate - Supported
Qdrant - Supported
LanceDB - Supported
Milvus - Exploratory
Pinecone - Exploratory
Epsilla - In Progress
Frameworks
LangChain - Supported
MindsDB - Supported
LlamaIndex - Exploratory
Computer Vision
Stable Diffusion - Supported
Replicate's hosted Stable Diffusion - Supported
🧗 Remarks
I have been using it for the last 15 days. The Streamlit-based dashboard is smooth.
`Prompt Template Experimentation` is a nice feature of the product. But I am expecting more comparison details without latency and similarities.
The framework covers the LLM, VectorDb, and orchestrators.
Great Community Support.
Great tool for RLHF.
Can’t add a self-hosted server.
🐥 Promptfoo: Test your prompts
promptfoo
is a tool for testing and evaluating LLM output quality.With promptfoo, you can:
Systematically test prompts & models against predefined test cases
Evaluate quality and catch regressions by comparing LLM outputs side-by-side
Speed up evaluations with caching and concurrency
Score outputs automatically by defining test cases
Use as a CLI, library, or in CI/CD
Use OpenAI, Anthropic, Azure, open-source models like Llama, or integrate custom API providers for any LLM API
The goal: test-driven prompt engineering, rather than trial-and-error.
🎈 Details
License: MIT license
Here's an example of a side-by-side comparison of multiple prompts and inputs:
It works on the command line too.
🧗 Remarks
A detailed customizable prompt template library.
A great tool for prompt engineering.
Supports the common LLM providers.
You can check different scenarios:
You can add the prompt configurations here: Example :
Verify that the output doesn't contain an "AI language model".
Verify that the output doesn't apologize, using model-graded eval (must not contain an apology).
Prefer shorter outputs using a scoring function.
Avoiding repetition.
Auto-validate output with assertions.
Multiple variables in a single test case.
Other capabilities postprocessing.
🐚 Nvidia NeMo-Guardrails
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or "rails" for short) are specific ways of controlling the output of a large language model, such as not talking about politics, responding in a particular way to specific user requests, following a predefined dialog path, using a particular language style, extracting structured data, and more.
This toolkit is currently in its early alpha stages, and we invite the community to contribute towards making the power of trustworthy, safe, and secure LLMs accessible to everyone. The examples provided within the documentation are for educational purposes to get started with NeMo Guardrails, and are not meant for use in production applications.
We are committed to improving the toolkit in the near term to make it easier for developers to build production-grade trustworthy, safe, and secure LLM applications.
NeMo Guardrails will help ensure smart applications powered by large language models (LLMs) are accurate, appropriate, on topic, and secure. The software includes all the code, examples, and documentation businesses need to add safety to AI apps that generate text.
It sits in the middle between the user (After Vector Embedding) and guard LLM server. It is open source so the engineer can write their own logic onto the guardrail.
NeMo Guardrails enables developers to set up three kinds of boundaries:
Topical guardrails prevent apps from veering off into undesired areas. For example, they keep customer service assistants from answering questions about the weather.
Safety guardrails ensure apps respond with accurate, appropriate information. They can filter out unwanted language and enforce that references are made only to credible sources.
Security guardrails restrict apps to making connections only to external third-party applications known to be safe.
🎈 Details
🧗 Remarks
Nemo-Guardrail is An easily programmable guardrail that is a must for the production-based LLM application.
The conversation designer can add the boundaries of the conversation in the same plain English using colang.
The filtering policy of the guard rail depends on the embedding space - more intelligent.
Supports the production batching for the orchestration.
The community is great.
The most required framework in the time.
**
I will publish the next Edition with the other five frameworks on Sunday.
This is the 10th Edition, If you have any feedback please don’t hesitate to share it with me, And if you love my work, do share it with your colleagues.
It takes time to research and document it - Please be a paid subscriber and support my work.
Cheers!!
Raahul
**