Evaluate AI Model Quality With Prompt and Response Grading
In a recent webinar on best practices for testing AI applications, I mentioned prompt and response grading as a way to improve model quality. While I didn’t have time to go into detail during the webinar, I thought I’d explore further in this post. Read on to learn what prompt and response grading is, why use it, and how to execute this type of testing.
What is prompt and response grading?
Prompt and response grading is a systematic approach used to evaluate the quality and accuracy of generative AI (Gen AI) outputs in response to specific user inputs. Grading typically assesses multiple factors, including relevance, accuracy, completeness, clarity, and language quality of the AI-generated responses. Testers rate each response against a rubric to identify where the model performs as expected and where it needs improvement. For example, a tester might provide a prompt such as, “What are the steps to reset my password?” and then evaluate the AI’s response based on its relevance, accuracy, and clarity.
Why use prompt and response grading?
There are three main reasons to use prompt and response grading:
- To ensure quality. Grading ensures that the responses Gen AI systems produce meet high standards of accuracy, relevance, and clarity, thereby providing value to users.
- To drive continuous improvement. Feedback from grading helps developers refine AI models, improving their performance and reducing errors over time.
- To demonstrate accountability. This method provides a transparent mechanism to evaluate AI performance, which is critical for building trust among users and stakeholders.
Preparing for prompt and response grading
Before you can start testing, you need to develop a grading framework that defines what criteria you’re assessing and how much weight each carries. Teams typically evaluate accuracy and relevance as well as criteria like completeness, clarity, language and grammar, tone and style, adaptability, and error handling.
Tone and style are crucial in ensuring the AI aligns with the intended audience and use case—for instance, maintaining a formal tone in legal applications or a friendly tone in customer service bots. Adaptability measures how well the AI can adjust its responses based on user context, such as handling follow-up questions or clarifications seamlessly. Accuracy and relevance usually carry the most weight, often making up around 60 or more percent of the overall evaluation score. This emphasis reflects their fundamental importance to the quality and correctness of the generated response.
A high-level grading rubric for evaluating Gen AI prompts and responses outlines the categories testers need to assess and the weightings for each category. Here’s an example high-level framework:
- Accuracy (40%): The correctness of the information provided in the AI response
- Relevance (30%): The extent to which the response directly addresses the prompt
- Completeness (15%): The degree to which the response covers all necessary aspects of the prompt
- Clarity (10%): How easy the response is to understand, including its structure and presentation
- Language and grammar (5%): The correctness and appropriateness of the language, including grammar, spelling and punctuation
You’ll also need to provide a detailed scoring rubric that clearly describes how testers should evaluate each criteria. The scoring system – binary, 3-point or 5-point scale – depends on the level of specificity or subjectivity in the AI system responses being evaluated. Responses that have a greater level of subjectivity may be evaluated on a 5 point scale; those where there is more specificity could use a 3-point scale and those where the answer is right/wrong would be binary. Here’s an example of a 5-point rubric rating accuracy:
Accuracy (40%)
- 5 (Excellent): The response is factually accurate with no errors. Information is correct and reliable.
- 4 (Good): The response is mostly accurate with minor factual errors or omissions.
- 3 (Satisfactory): The response has some accurate information but also contains noticeable errors or omissions.
- 2 (Needs Improvement): The response has significant factual errors or omissions, making the information unreliable.
- 1 (Poor): The response is mostly inaccurate with major errors, providing misleading or wrong information.
If you wanted to use a 3-point rubric, you’d simply adjust the weighting – here’s an example of a 3-point rubric rating clarity. In this case, each criterion is scored on a scale of 1 to 3 and then multiplied by its weight multiplier. The total possible score is 100, just like in the original 5-point scale.
Clarity (10%) Multiplier 3.33 (10% weight, so 3 points * 3.33 = 10 possible points)
- 3 (Excellent): The response is very clear, well-structured, and easy to understand with no ambiguities.
- 2 (Satisfactory): The response is mostly clear but may have minor structural or clarity issues.
- 1 (Poor): The response is unclear, disorganized, or difficult to understand.
Grading AI prompt responses
When it’s time to test, you’ll share the full rubric with the testing team so they understand how to rate responses. You’ll also want to provide guidance or parameters around the types of prompts they should create. In addition, you may provide categories for different types of issues in each rating criteria to help testers quantify their ratings.
Let’s walk through an example based on a customer service chatbot for an application that provides vehicle rentals. We’ll ask testers to focus on a specific use case: What if a customer’s travel plans change? Then we’ll provide tester instructions on the types of prompts they should enter. The instructions might say something like, “Questions should address topics such as the process for modifying or extending reservations, what happens if plans shift (such as a flight delay), and cancellation policies. The goal is to ensure that the chatbot is able to accurately explain options for making changes and advise users about any associated fees or penalties.”
In this case, testers will rate responses for accuracy, relevance, clarity and usefulness on a five point scale (excellent, good, average, fair and poor). In addition to ratings, testers can also provide feedback on each response. Here’s what that might look like in practice.
Prompt: What happens if I can’t return the vehicle at the scheduled time?
Response: If you can’t return the vehicle on time, you need to request an extension via the app as soon as possible. If the extension is denied or can’t be requested, you must return the vehicle at the scheduled end time. Late returns may incur additional fees and void your protection plan.
Tester’s rating:
- Accuracy = Average. Partial information provided.
- Relevance = Good.
- Clarity = Average. Vague or unclear explanation.
- Usefulness = Poor.
Tester’s comments: The response didn’t tell me where in the app to go to request an extension. I wasn’t sure what the bot meant by, “if the extension can’t be requested” because of a problem with the app? Because the vehicle was booked to someone else after a certain time? Because I didn’t have connectivity and couldn’t request it? That was confusing.
Use grades to identify where the model needs further training or refinement
Ideally, the grades will help you identify patterns that uncover areas where you need to do more work on the model to improve outputs. These patterns can also help your team prioritize key areas that need to be addressed. Tester ratings can also provide a common framework for discussing outputs and guide conversations between developers, product managers, and other stakeholders. Using feedback from the grading process to refine the AI model and improve its performance on future prompts is essential for ongoing development.
Webinar
AI Testing: The Path to Exceptional Apps
Explore how your team can improve the quality of your AI applications — no matter how you’re currently testing. We’ll look at current trends and ways to reduce risk.