Continuation of the last post:
Please visit the first post of the series if you missed it.
🦜 Agenta
Building production-ready LLM-powered applications is currently very difficult. It involves countless iterations of prompt engineering, parameter tuning, and architectures.
Agenta provides you with the tools to quickly do prompt engineering and 🧪 experiment, ⚖️ evaluate, and 🚀 deploy your LLM apps. All without imposing any restrictions on your choice of framework, library, or model.
🎈 Details
🧗 Remarks
The website and app code have excellent UX. The end-to-end user journey, from creation to testing, is beautifully designed.
Can be hosted OnPrem - Aws or GCP
They have different parts:
Playground: to create the prompts from lots of predefined templates like
sales_call_summarizer
baby_name_generator
chat_models
completion_models
compose_email
experimental
extract_data_to_json
job_info_extractor
noteGPT
recipes_and_ingredients
sales_call_summarizer,
sales_transcript_summarizer,
sentiment_analysis
Test Sets
Evaluate
API Endpoint
🦚 AgentBench
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors.
🎈 Details
🧗 Remarks
This paper evaluates the performance of several LLMs (LLama 2, Vicuna, GPT-X, Dolly, etc.) as intelligent agents in a long chain environment that involves databases (SQL), web booking, and product comparison on the internet. The main question to be answered is whether LLama 2 is superior to ChatGPT in comparing products on the internet. For the purpose of this study, an agent refers to an LLM that operates in this environment.
An "AGENT" is an LLM that operates within a simulated environment to achieve a specific goal. In this study, the term, "agent" is used to refer to such an LLM. The agent's performance is assessed based on its capability to complete assigned tasks.
To date, It’s one of the best approaches to evaluating a LLM model for various tasks.
🦃 AI Hero Studio ✨ Prompt Craft ✨ (Beta Phase)
🎈 Details
🧗 Remarks
Detailed “openAI” API-based prompt experimentation dashboard.
The tool have a Promot Auto Completion feature that will enhance the input prompt quality using the predefined prompt templates.
Version Wise prompt management.
🦆 Arthur Bench
Today, we’re excited to introduce our newest product: Arthur Bench, the most robust way to evaluate LLMs. Bench is an open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models. This open source tool will enable businesses to evaluate how different LLMs will perform in real-world scenarios so they can make informed, data-driven decisions when integrating the latest AI technologies into their operations.
Here are some ways in which Arthur Bench helps businesses:
Model Selection & Validation
Budget & Privacy Optimization
Translation of Academic Benchmarks to Real-World Performance
🎈 Details
🧗 Remarks
This tool creates a test suite automatically using datasets.
Periodically validates models for resiliency to model changes outside their control.
The system offers deployment gates that identify anomalous inputs, potential PII leakage, toxicity, and other quality metrics. It learns from production performance to optimize thresholds for these quality gates.
Provides core token-level observability, performance dashboarding, inference debugging, and alerting.
Accelerates ability to identify and debug underperforming regions.
🐿️ Guidance
Guidance enables you to control modern language models more effectively and efficiently than traditional prompting or chaining. Guidance programs allow you to interleave generation, prompting, and logical control into a single continuous flow matching how the language model actually processes the text. Simple output structures like Chain of Thought and its many variants (e.g., ART, Auto-CoT, etc.) have been shown to improve LLM performance. The advent of more powerful LLMs like GPT-4 allows for even richer structure, and
guidance
makes that structure easier and cheaper.
🎈 Details
🕵️♀️ Features
🔹 Live streaming
Simple, intuitive syntax. Guidance feels like a templating language, and just like standard Handlebars templates, you can do variable interpolation (e.g., {{proverb}}) and logical control.
🔹 Chat dialog
Guidance supports API-based chat models like GPT-4, as well as open chat models like Vicuna through a unified API based on role tags (e.g., {{#system}}...{{/system}}). This allows interactive dialog development that combines rich templating and logical control with modern chat models.
🔹 Guidance acceleration
When multiple generation or LLM-directed control flow statements are used in a single Guidance program then we can significantly improve inference performance by optimally reusing the Key/Value caches as we progress through the prompt. This means Guidance only asks the LLM to generate the green text below, not the entire program. This cuts this prompt's runtime in half vs. a standard generation approach.
🔹 Token healing
The standard greedy tokenizations used by most language models introduce a subtle and powerful bias that can have all kinds of unintended consequences for your prompts. Using a process we call "token healing"
guidance
automatically removes these surprising biases, freeing you to focus on designing the prompts you want without worrying about tokenization artifacts.
🔹 Rich output structure example
To demonstrate the value of output structure, we take a simple task from BigBench, where the goal is to identify whether a given sentence contains an anachronism (a statement that is impossible because of non-overlapping time periods). Below is a simple two-shot prompt for it, with a human-crafted chain-of-thought sequence.
🔹 Guaranteeing valid syntax JSON example
Large language models are great at generating useful outputs, but they are not great at guaranteeing that those outputs follow a specific format. This can cause problems when we want to use the outputs of a language model as input to another system. For example, if we want to use a language model to generate a JSON object, we need to make sure that the output is valid JSON. With
guidance
we can both accelerate inference speed and ensure that generated JSON is always valid. Below we generate a random character profile for a game with perfect syntax every time.
🔹 Role-based chat model example
Modern chat-style models like ChatGPT and Alpaca are trained with special tokens that mark out "roles" for different areas of the prompt. Guidance supports these models through role tags that automatically map to the correct tokens or API calls for the current LLM. Below we show how a role-based guidance program enables simple multi-step reasoning and planning.
🔹 Agents
We can easily build agents that talk to each other or to a user, via the
await
command. Theawait
command allows us to pause execution and return a partially executed guidance program. By puttingawait
in a loop, that partially executed program can then be called again and again to form a dialog (or any other structure you design). For example, here is how we might get GPT-4 to simulate two agents talking to one another.
🧗 Remarks
If I need to select a tool for prompt engineering, I select this one.
Community Support is Superb.
🌳 Galileo LLM Studio
Algorithm-powered LLMOps Platform
Find the best prompt, inspect data errors while fine-tuning, monitor LLM outputs in real-time. All in one powerful, collaborative platform.
🎈 Details
🕵️♀️ Features
🔹 Prompt Engineering
Promot Inspector.
A detailed easy Dashboard with multiple parameters and evaluation scores.
Hallucination Score.
🔹 LLM Fine-Tune and Debugging
The
watcher function
analyze the input data.A detailed dashboard with data quality - Auto identification of the data pulling from LLM that reduces the performance.
Fix and track data changes over time.
🔹 Production Monitoring
Real-time LLM Monitoring.
Risk Control with customized plugins
Customized alert with your Slack.
🧗 Remarks
To date, I found this one is the tool for LLMOps. The developer can push the LLM model into production with confidence using the tool.
🎄 lakera.ai
An Overview of Lakera Guard – Bringing Enterprise-Grade Security to LLMs with One Line of Code
At Lakera, we supercharge AI developers by enabling them to swiftly identify and eliminate their AI applications’ security threats so that they can focus on building the most exciting applications securely.
Businesses around the world are integrating LLMs into their applications at lightning speeds. At the same time, LLM applications bring completely new types of security risks that organizations need to address.
This is why we’re super excited to introduce Lakera Guard – a developer-first API to bring enterprise-grade security to your LLM applications. It is lightning-fast and can be integrated within minutes. We’ve designed it so that developers love working with it!
🕵️♀️ Features
🔹 Content moderation
These are the categories that Lakera Guard currently evaluate against for inappropriate content in the input prompt.
Hate
Content targeting race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste, including violence. Content directed at non-protected groups (e.g., chess players) is exempt.
Sex
Content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness).
🔹 Prompt injections
Jailbreaks
LLMs can be forced into malicious behavior by jailbreak attack prompts. Lakera Guard updates to protect against these.
Prompt injections
Prompt injection attacks must be stopped at all costs. Attackers will do whatever it takes to manipulate the system's behavior or gain unauthorized access. But fear not, Lakera Guard is constantly updated to prevent prompt injections and protect your system from harm.
🔹 Sensitive information
PII stands for Personally Identifiable Information - data that can identify an individual. It requires strict protection due to identity theft and privacy risks. Organizations handling PII must safeguard it to prevent unauthorized access. Laws like GDPR and HIPAA ensure proper PII handling and privacy protection.
🔹 Relevant Language
There are many ways to challenge LLMs using language. Users may:
Either Use Japanese jailbreaks.
Employ Portuguese prompt injections
Intentionally include spelling errors in prompts to bypass defenses.
Insert extensive code or special characters into prompts.
They assign a score between 0 and 1 to indicate the authenticity of a prompt. A higher score suggests a genuine attempt at regular communication.
🔹 Unknown links
One way in which prompt injection can be dangerous is phishing.
🧗 Remarks
The Roadmap is amazing.
LLM security is a real topic - and they are working on it.
🐣 NightFall AI
Securing generative AI
ChatGPT and other generative AI tools are powerful ways to increase your team's output. But sensitive data such as PII, confidential information, API keys, PHI, and much more can be contained in prompts. Rather than block these tools, use Nightfall's Chrome extension or Developer Platform to:
Automatically redact sensitive data in AI prompts
Safely scale the use of AI tools across your organization
Train users with customized alerts so they learn what data should not be input to AI tools
🎈 Details
🧗 Remarks
A great tool for handling LLM security.
Manage all security tasks in your SIEM or Nightfall dashboard.
Proactively protect your company and customer data.
Identify and manage secrets and keys from a single dashboard.
Train employees on best practice security policies, Build a culture of trust and strong data security hygiene.
Complete visibility of your sensitive data.
🦢 BenchLLM
BenchLLM is a Python-based open-source library that streamlines the testing of Large Language Models (LLMs) and AI-powered applications. It measures the accuracy of your model, agents, or chains by validating responses on any number of tests via LLMs.
🎈 Details
🧗 Remarks
A detailed customizable library to evaluate prompt performance.
A great tool for prompt engineering.
Support Vector Retrieval, Similary, Orchestrators and Function Calling.
Test the responses of your LLM across any number of prompts.
Continuous integration for chains like Langchain, agents like AutoGPT, or LLM models like Llama or GPT-4.
Eliminate flaky chains and create confidence in your code.
Spot inaccurate responses and hallucinations in your application at every version.
🦉 Martian
Dynamically route every prompt to the best LLM. Highest performance, lowest costs, incredibly easy to use.
There are over 250,000 LLMs today. Some are good at coding. Some are good at holding conversations. Some are up to 300x cheaper than others. You could hire an ML engineering team to test every single one — or you can switch to the best one for each request with Martian.
Before:
After:
🎈 Details
🧗 Remarks
In the development phase, but I love the idea. It is trying to solve one of the most burning problems in the LLM ecosystem.
There are various models available in the market that specialize in different tasks such as coding and storytelling. The Martian SDK is designed to identify the prompt's intention and utilize various models internally to produce the output.
GPT 4 models is 316x Costlier than a 7 billion model - “Don't waste money by paying senior models to do junior work. The model router sends your tasks to the right model.”
🐹 Special Mention
🥬 Rellm
ReLLM was created to fill a need when developing a separate tool. We needed a way to provide long term memory and context to our users, but we also needed to account for permissions and who can see what data.
🥦 LangDock
The GDPR-compliant ChatGPT for your team
🥒 TryTaylor
Taylor AI allows enterprises to train and own their own proprietary fine-tuned LLMs in minutes, not weeks.
🍉 scorecard.ai
Testing for Production-ready LLMs.Ship faster with more confidence.
Integrate in minutes.
🍈 signway.io
Signway is a proxy server that addresses the problem of re-streaming API responses from backend to frontend by allowing the frontend to directly request the API using a pre-signed URL created by Signway. This URL is short-lived, and once it passes verification for authenticity and expiry, Signway will proxy the request to the API and add the necessary authentication headers.
🥥 mithrilsecurity.io
Deploy AI SaaS to security-demanding organizations
Mithril Security helps software vendors sell SaaS to enterprises, thanks to our secure enclave deployment tooling, which provides SaaS on-prem levels of security and control for customers.
🥝 kobaltlabs
LLMs, made private and secure
Unlock the power of GPT for your most sensitive data with a fast, simple security API
🥭 cadea.ai
Secure AI for Business
Deploy enterprise-level AI tools equipped with e2e data security and role based access control. Our platform helps you create, manage, and monitor chatbots that can answer questions about your internal documents.
🐶 Summary
It is hard to compare apple-to-apple. That why I have grouped the frameworks (No rank).
🔹 Prompt Engineering (Make Prompts better)
Baserun
PromptTools
DeepEval
Promptfoo
Nvidia NeMo-Guardrails
Agenta
AI Hero Studio
Guidance
Galileo LLM Studio
BenchLLM
🔹 Everything about LLM (Fine-tune, Debugging, Monitoring)
Baserun
Agenta
Nvidia NeMo-Guardrails
AgentBench
Galileo LLM Studio
Martian
🔹 LLM Security (Guard The LLM Fortress)
Nvidia NeMo-Guardrails
Arthur Bench
Galileo LLM Studio
lakera.ai
NightFall AI
**
I will publish the next Edition on Sunday.
This is the 11th Edition, If you have any feedback please don’t hesitate to share it with me, And if you love my work, do share it with your colleagues.
It takes time to research and document it - Please be a paid subscriber and support my work.
Cheers!!
Raahul
**
Cool overview. I'm on the Bench team at Arthur. If anyone needs help getting started, please reach out at hardik@arthur.ai.