Use this file to discover all available pages before exploring further.
Prerequisites:
Python 3.10+
An API key for an LLM provider (e.g., OpenAI, Anthropic, Google)
This guide demonstrates how to write a robust, statistical, end-to-end test for a conversational AI chatbot with SigmaEval and pytest. Weβll start by building a simple AI assistant to have an application to test. By the end, you will have a complete, runnable example that you can adapt for your own projects.
You will also need to set your API key for the LLM provider you wish to use. SigmaEval supports 100+ LLM providers via LiteLLM, including OpenAI, Anthropic, Google, and local models via Ollama.
export GEMINI_API_KEY="your-api-key"
The code in this tutorial uses Gemini as the LLM provider. If you want to use a different provider replace the model name in the code with the name of the model you want to use. See the LiteLLM documentation for more information.
First, letβs create a simple, but complete, AI application that we can test. This assistant is for an e-commerce store and uses a system prompt to define its capabilities.Create a file named app.py:
app.py
import asynciofrom typing import List, Dict, Anyfrom litellm import acompletion# This is the application we want to test.# It's a simple chatbot that uses a system prompt to define its behavior.async def app_handler(messages: List[Dict[str, str]], state: Any) -> str: system_prompt = { "role": "system", "content": "You are a helpful assistant for an e-commerce store. Your main functions are: tracking orders, initiating returns, answering product questions, and escalating to a human agent. Keep your answers concise." } # In a real app, you would manage conversation history. # For this simple example, we'll just prepend the system prompt to the latest user message. messages_with_system_prompt = [system_prompt, messages[-1]] # Use LiteLLM's async completion function to call the model response = await acompletion( model="gemini/gemini-2.5-flash", # Or any other model messages=messages_with_system_prompt ) return response.choices[0].message.content
Now that we have an application, we can write a test for it using SigmaEval and pytest. This test will verify that our assistant not only provides the correct information but also does so in a timely manner.Create a file named test_app.py in the same directory:
test_app.py
import pytestfrom sigmaeval import ( SigmaEval, ScenarioTest, assertions, metrics,)# Import the application handler we want to testfrom app import app_handler@pytest.mark.asyncioasync def test_bot_explains_its_capabilities(): """ This test evaluates if the bot correctly and efficiently explains its capabilities. """ # 1. Initialize SigmaEval sigma_eval = SigmaEval( judge_model="gemini/gemini-2.5-flash", sample_size=10, # A smaller sample size for a quick tutorial run significance_level=0.05 ) # 2. Define the ScenarioTest # This scenario tests two things: # - The bot provides the correct information (.expect_behavior) # - The bot responds quickly enough (.expect_metric) scenario = ( ScenarioTest("Bot explains its capabilities") .given("A new user who has not interacted with the bot before") .when("The user asks a general question about the bot's capabilities, like 'what can you do?'") .expect_behavior( "Bot lists its main functions: tracking orders, initiating returns, answering product questions, and escalating to a human agent.", criteria=assertions.scores.proportion_gte(min_score=7, proportion=0.90) ) .expect_metric( metrics.per_turn.response_latency, # We want to be confident that at least 90% of responses are faster than 2 seconds. criteria=assertions.metrics.proportion_lt(threshold=2.0, proportion=0.90) ) ) # 3. Run the evaluation result = await sigma_eval.evaluate(scenario, app_handler) # 4. Print the detailed summary for logs and assert the result print(result) assert result.passed, "The bot capabilities scenario failed."
With your app.py and test_app.py files in place, you can run the evaluation from your terminal.
terminal
pytest
When you run the test, SigmaEval will simulate 10 conversations with your bot, have an AI Judge score each one against both of your expectations, and then print a summary of the results. The final output will look something like this:
--- Result for Scenario: 'Bot explains its capabilities' ---Overall Status: β PASSEDSummary: 2/2 expectations passed.Breakdown: - [β PASSED] Bot lists its main functions: tracking orders, initiating returns, answering product questions, and escalating to a human agent., p-value: 0.0011 - [β PASSED] Per-turn response latency < 2.0 (proportion >= 0.9), p-value: 0.0000β Scenario passed!========================= 1 passed in 15.34s =========================
This output confirms that the test passed because both the behavioral (expect_behavior) and the performance (expect_metric) expectations were met with statistical confidence.
Now that youβve run your first end-to-end evaluation, you can start applying SigmaEval to your own Gen AI applications. Try modifying the system prompt in app.py or the expectations in test_app.py to see how the results change.