🌻 Edition 29: LLM Metrics That Matter for Industry 📊
How do you measure the quality of the LLM Application
How do we measure the value of an LLM? The answer to this question will depend on your environment; academic/scientific or commercial/industry. In the industry, kudos in the research community are nice but revenue performance trumps all. This brief article, the 2nd of two, will cover metrics that matter for industry.
For qualification and consideration of your data strategy, feel free to reach out to patrickocr@ pocr.co.uk. Separately, as the founder of Zolayola, (patrickocr @ zolayola.com) Patrick is building a series of cognitive tools to visualize the laws of thought with supporting products including a data marketplace and universal decision engine built on a new blockchain protocol.
Finding value in use cases is what matters. This typically reduces to “how can an LLM help my organization do things better, or do entirely new things?” Typically, IT projects get funded for their ability to deliver cost savings or revenue enhancements. Timelines and outcomes matter – near-term, highly certain outcomes are preferable to long-term, low-certainty outcomes. Most common use cases typically cluster around:
Customer ChatBot; supporting customers via a natural language interface.
Technical Workflow SpeedUp; code drafting, review, explanation, and tests.
Non-Technical Speed-Up; report summarisation, draft preparation.
Knowledge Base; augmenting an intranet as a retrieval agent.
Creative/Ideation; artwork for marketing/content creation.
There are additional intangible benefits to LLM R&D, such as enhancing corporate prestige, and perception as a thought leader/innovative company, and indeed this can filter through to bolstering recruitment efforts for technical talent.
Purely financial metrics for evaluation may be premature – the time will come for accountants to supervise AI – but that time is not yet. The outsized rewards for leaders in space mean that costs need to be subordinate to surviving and thriving post-AI.
Thus, near-term measures of LLM success should be focussed on product issues:
Does it work? As in, pass a smell test for a coherent intelligent agent?
Are replies coherent? As in, is the response well formed and plausible?
Are replies truthful? As in, would a knowledgeable person in the field agree?
Are replies useful? As in, does the reply help solve the problem I came up with?
There is also a set of subsidiary questions, grouped here as infrastructure issues:
Does the tool return replies without a noticeable delay in user experience?
How easily/quickly can the model be stood up and run across iterations?
Is the cloud computing (GPU) cost sustainable enough to break even?
Are there optimal costs: performance trade-offs across model sizes?
The costs and technical complexity of establishing an LLM may quickly sober up the most enthusiastic of tech advocates - measurable ROI is still somewhat off - they are still in the experimental/exploratory stage. While 2023 saw the emergence of large language models, 2024 will likely see small language models take center stage.
🌼 RAG Evaluation
🌸 Methods
There are primarily two approaches to evaluating the effectiveness of RAG: independent evaluation and end-to-end evaluation.
🏵️ Independent Evaluation
Independent evaluation includes assessing the retrieval module and the generation (read/synthesis) module.
Retrieval Module: A suite of metrics that measure the effectiveness of systems (like search engines, recommendation systems, or information retrieval systems) in ranking items according to queries or tasks are commonly used to evaluate the performance of the RAG retrieval module. Examples include Hit Rate, MRR, NDCG, Precision, etc.
Generation Module: The generation module here refers to the enhanced or synthesized input formed by supplementing the retrieved documents into the query, distinct from the final answer/response generation, which is typically evaluated end-to-end. The evaluation metrics for the generation module mainly focus on context relevance, measuring the relatedness of retrieved documents to the query question.
🏵️ End-to-End Evaluation
The end-to-end evaluation assesses the final response generated by the RAG model for a given input, involving the relevance and alignment of the model-generated answers with the input query. From the perspective of content generation goals, evaluation can be divided into unlabeled and labeled content. Unlabeled content evaluation metrics include answer fidelity, answer relevance, harmlessness, etc., while labeled content evaluation metrics include.
🌸 Evaluation Frameworks
🏵️ RAGAS
This framework considers the retrieval system’s ability to identify relevant and key context paragraphs, the LLM’s ability to use these paragraphs faithfully, and the quality of the generation itself. RAGAS is an evaluation framework based on simple handwritten prompts, using these prompts to measure the three aspects of quality - answer faithfulness, answer relevance, and context relevance - in a fully automated manner.
🏵️ ARES
ARES aims to automatically evaluate the performance of RAG systems in three aspects: Context Relevance, Answer Faithfulness, and Answer Relevance. These evaluation metrics are similar to those in RAGAS. However, RAGAS, being a newer evaluation framework based on simple handwritten prompts, has limited adaptability to new RAG evaluation settings, which is one of the significances of the ARES work.
🌸 Key Metrics and Abilities
🏵️ Faithfulness
This metric emphasizes that the answers generated by the model must remain true to the given context, ensuring that the answers are consistent with the context information and do not deviate or contradict it. This aspect of evaluation is vital for addressing illusions in large models.
🏵️ Answer Relevance
This metric stresses that the generated answers need to be directly related to the posed question.
🏵️ Context Relevance
This metric demands that the retrieved contextual information be as accurate and targeted as possible, avoiding irrelevant content. After all, processing long texts is costly for LLMs, and too much irrelevant information can reduce the efficiency of LLMs in utilizing context. The OpenAI report also mentioned ”Context Recall” as a supplementary metric, measuring the model’s ability to retrieve all relevant information needed to answer a question. This metric reflects the search optimization level of the RAG retrieval module. A low recall rate indicates a potential need for the optimization of the search functionality, such as introducing re-ranking mechanisms or fine-tuning embeddings to ensure more relevant content retrieval.
🌸 Special Mention DeepEval
DeepEval provides a Pythonic way to run offline evaluations on your LLM pipelines so you can launch comfortably into production. The guiding philosophy is a "Pytest for LLM" that aims to make productionizing and evaluating LLMs as easy as ensuring all tests pass.
DeepEval – It’s a tool for easy and efficient LLM testing. Deepeval aims to make writing tests for LLM applications (such as RAG) as easy as writing Python unit tests.
🏵️ Details
License: Apache-2.0 license
🏵️ Metrics
AnswerRelevancy
: Depends on"sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
BertScoreMetric
: Depends on"sentence-transformers/all-mpnet-base-v2"
Dbias
: LLMs can become highly biased after finetuning from any RLHF or optimizations. Bias, however, is a very vague term so the paper focuses on bias in the following areas.Gender (e.g. "All man hours in his area of responsibility must be approved.")
Age (e.g. "Apply if you are a recent graduate.")
Racial/Ethnicity (e.g. "Police are looking for any black males who may be involved in this case.")
Disability (e.g. "Genuine concern for the elderly and handicapped")
Mental Health (e.g. "Any experience working with retarded people is required for this job.")
Religion
Education
Political ideology
This is measured according to tests with logic following this paper.
**
I will publish the next Edition on Sunday.
This is the 29th Edition, If you have any feedback please don’t hesitate to share it with me, And if you love my work, do share it with your colleagues.
It takes time to research and document it - Please be a paid subscriber and support my work.
Cheers!!
Raahul
**