Skip to main content
Prerequisites:
  • Python 3.10+
  • An API key for an LLM provider (e.g., OpenAI, Anthropic, Google)
This guide demonstrates how to write a robust, statistical, end-to-end test for a conversational AI chatbot with SigmaEval and pytest. We’ll start by building a simple AI assistant to have an application to test. By the end, you will have a complete, runnable example that you can adapt for your own projects.

Installation

First, install the necessary libraries from PyPI. We recommend creating a virtual environment to avoid dependency conflicts.
pip install sigmaeval-framework litellm pytest pytest-asyncio
You will also need to set your API key for the LLM provider you wish to use. SigmaEval supports 100+ LLM providers via LiteLLM, including OpenAI, Anthropic, Google, and local models via Ollama.
export GEMINI_API_KEY="your-api-key"
The code in this tutorial uses Gemini as the LLM provider. If you want to use a different provider replace the model name in the code with the name of the model you want to use. See the LiteLLM documentation for more information.

Step 1: Build a Simple AI Assistant

First, let’s create a simple, but complete, AI application that we can test. This assistant is for an e-commerce store and uses a system prompt to define its capabilities. Create a file named app.py:
app.py
import asyncio
from typing import List, Dict, Any
from litellm import acompletion

# This is the application we want to test.
# It's a simple chatbot that uses a system prompt to define its behavior.
async def app_handler(messages: List[Dict[str, str]], state: Any) -> str:
    system_prompt = {
        "role": "system",
        "content": "You are a helpful assistant for an e-commerce store. Your main functions are: tracking orders, initiating returns, answering product questions, and escalating to a human agent. Keep your answers concise."
    }
    
    # In a real app, you would manage conversation history.
    # For this simple example, we'll just prepend the system prompt to the latest user message.
    messages_with_system_prompt = [system_prompt, messages[-1]]

    # Use LiteLLM's async completion function to call the model
    response = await acompletion(
        model="gemini/gemini-2.5-flash", # Or any other model
        messages=messages_with_system_prompt
    )
    
    return response.choices[0].message.content

Step 2: Write Your First Evaluation with Pytest

Now that we have an application, we can write a test for it using SigmaEval and pytest. This test will verify that our assistant not only provides the correct information but also does so in a timely manner. Create a file named test_app.py in the same directory:
test_app.py
import pytest
from sigmaeval import (
    SigmaEval,
    ScenarioTest,
    assertions,
    metrics,
)
# Import the application handler we want to test
from app import app_handler

@pytest.mark.asyncio
async def test_bot_explains_its_capabilities():
    """
    This test evaluates if the bot correctly and efficiently explains its capabilities.
    """
    # 1. Initialize SigmaEval
    sigma_eval = SigmaEval(
        judge_model="gemini/gemini-2.5-flash",
        sample_size=10,  # A smaller sample size for a quick tutorial run
        significance_level=0.05
    )
    
    # 2. Define the ScenarioTest
    # This scenario tests two things:
    # - The bot provides the correct information (.expect_behavior)
    # - The bot responds quickly enough (.expect_metric)
    scenario = (
        ScenarioTest("Bot explains its capabilities")
        .given("A new user who has not interacted with the bot before")
        .when("The user asks a general question about the bot's capabilities, like 'what can you do?'")
        .expect_behavior(
            "Bot lists its main functions: tracking orders, initiating returns, answering product questions, and escalating to a human agent.",
            criteria=assertions.scores.proportion_gte(min_score=7, proportion=0.90)
        )
        .expect_metric(
            metrics.per_turn.response_latency,
            # We want to be confident that at least 90% of responses are faster than 2 seconds.
            criteria=assertions.metrics.proportion_lt(threshold=2.0, proportion=0.90)
        )
    )
    
    # 3. Run the evaluation
    result = await sigma_eval.evaluate(scenario, app_handler)
    
    # 4. Print the detailed summary for logs and assert the result
    print(result)
    assert result.passed, "The bot capabilities scenario failed."

Step 3: Run the Test and Interpret the Results

With your app.py and test_app.py files in place, you can run the evaluation from your terminal.
terminal
pytest
When you run the test, SigmaEval will simulate 10 conversations with your bot, have an AI Judge score each one against both of your expectations, and then print a summary of the results. The final output will look something like this:
--- Result for Scenario: 'Bot explains its capabilities' ---
Overall Status: ✅ PASSED
Summary: 2/2 expectations passed.

Breakdown:
  - [✅ PASSED] Bot lists its main functions: tracking orders, initiating returns, answering product questions, and escalating to a human agent., p-value: 0.0011
  - [✅ PASSED] Per-turn response latency < 2.0 (proportion >= 0.9), p-value: 0.0000

✅ Scenario passed!
========================= 1 passed in 15.34s =========================
This output confirms that the test passed because both the behavioral (expect_behavior) and the performance (expect_metric) expectations were met with statistical confidence.

Next Steps

Now that you’ve run your first end-to-end evaluation, you can start applying SigmaEval to your own Gen AI applications. Try modifying the system prompt in app.py or the expectations in test_app.py to see how the results change.
⌘I