> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sigmaeval.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Tutorial: Testing Conversational AI with pytest and SigmaEval

> Learn how to write an end-to-end test for an AI Chatbot using pytest and SigmaEval

<Info>
  **Prerequisites**:

  * Python 3.10+
  * An API key for an LLM provider (e.g., OpenAI, Anthropic, Google)
</Info>

This guide demonstrates how to write a robust, statistical, end-to-end test for a conversational AI chatbot with SigmaEval and `pytest`. We'll start by building a simple AI assistant to have an application to test. By the end, you will have a complete, runnable example that you can adapt for your own projects.

## Installation

First, install the necessary libraries from PyPI. We recommend creating a virtual environment to avoid dependency conflicts.

```bash theme={null}
pip install sigmaeval-framework litellm pytest pytest-asyncio
```

You will also need to set your API key for the LLM provider you wish to use. SigmaEval supports 100+ LLM providers via [LiteLLM](https://docs.litellm.ai/docs/providers/), including OpenAI, Anthropic, Google, and local models via Ollama.

<CodeGroup>
  ```bash Gemini theme={null}
  export GEMINI_API_KEY="your-api-key"
  ```

  ```bash OpenAI theme={null}
  export OPENAI_API_KEY="your-api-key"
  ```

  ```bash Anthropic theme={null}
  export ANTHROPIC_API_KEY="your-api-key"
  ```

  ```bash Ollama theme={null}
  # No API key needed for local Ollama models
  # Set the model name directly, e.g., "ollama/llama3"
  ```
</CodeGroup>

<Note>The code in this tutorial uses Gemini as the LLM provider. If you want to use a different provider replace the model name in the code with the name of the model you want to use. See the [LiteLLM documentation](https://docs.litellm.ai/docs/providers/) for more information.</Note>

## Step 1: Build a Simple AI Assistant

First, let's create a simple, but complete, AI application that we can test. This assistant is for an e-commerce store and uses a system prompt to define its capabilities.

Create a file named `app.py`:

```python app.py theme={null}
import asyncio
from typing import List, Dict, Any
from litellm import acompletion

# This is the application we want to test.
# It's a simple chatbot that uses a system prompt to define its behavior.
async def app_handler(messages: List[Dict[str, str]], state: Any) -> str:
    system_prompt = {
        "role": "system",
        "content": "You are a helpful assistant for an e-commerce store. Your main functions are: tracking orders, initiating returns, answering product questions, and escalating to a human agent. Keep your answers concise."
    }
    
    # In a real app, you would manage conversation history.
    # For this simple example, we'll just prepend the system prompt to the latest user message.
    messages_with_system_prompt = [system_prompt, messages[-1]]

    # Use LiteLLM's async completion function to call the model
    response = await acompletion(
        model="gemini/gemini-2.5-flash", # Or any other model
        messages=messages_with_system_prompt
    )
    
    return response.choices[0].message.content
```

## Step 2: Write Your First Evaluation with Pytest

Now that we have an application, we can write a test for it using SigmaEval and `pytest`. This test will verify that our assistant not only provides the correct information but also does so in a timely manner.

Create a file named `test_app.py` in the same directory:

```python test_app.py theme={null}
import pytest
from sigmaeval import (
    SigmaEval,
    ScenarioTest,
    assertions,
    metrics,
)
# Import the application handler we want to test
from app import app_handler

@pytest.mark.asyncio
async def test_bot_explains_its_capabilities():
    """
    This test evaluates if the bot correctly and efficiently explains its capabilities.
    """
    # 1. Initialize SigmaEval
    sigma_eval = SigmaEval(
        judge_model="gemini/gemini-2.5-flash",
        sample_size=10,  # A smaller sample size for a quick tutorial run
        significance_level=0.05
    )
    
    # 2. Define the ScenarioTest
    # This scenario tests two things:
    # - The bot provides the correct information (.expect_behavior)
    # - The bot responds quickly enough (.expect_metric)
    scenario = (
        ScenarioTest("Bot explains its capabilities")
        .given("A new user who has not interacted with the bot before")
        .when("The user asks a general question about the bot's capabilities, like 'what can you do?'")
        .expect_behavior(
            "Bot lists its main functions: tracking orders, initiating returns, answering product questions, and escalating to a human agent.",
            criteria=assertions.scores.proportion_gte(min_score=7, proportion=0.90)
        )
        .expect_metric(
            metrics.per_turn.response_latency,
            # We want to be confident that at least 90% of responses are faster than 2 seconds.
            criteria=assertions.metrics.proportion_lt(threshold=2.0, proportion=0.90)
        )
    )
    
    # 3. Run the evaluation
    result = await sigma_eval.evaluate(scenario, app_handler)
    
    # 4. Print the detailed summary for logs and assert the result
    print(result)
    assert result.passed, "The bot capabilities scenario failed."
```

## Step 3: Run the Test and Interpret the Results

With your `app.py` and `test_app.py` files in place, you can run the evaluation from your terminal.

```bash terminal theme={null}
pytest
```

When you run the test, SigmaEval will simulate 10 conversations with your bot, have an AI Judge score each one against both of your expectations, and then print a summary of the results. The final output will look something like this:

```text theme={null}
--- Result for Scenario: 'Bot explains its capabilities' ---
Overall Status: ✅ PASSED
Summary: 2/2 expectations passed.

Breakdown:
  - [✅ PASSED] Bot lists its main functions: tracking orders, initiating returns, answering product questions, and escalating to a human agent., p-value: 0.0011
  - [✅ PASSED] Per-turn response latency < 2.0 (proportion >= 0.9), p-value: 0.0000

✅ Scenario passed!
========================= 1 passed in 15.34s =========================
```

This output confirms that the test passed because both the behavioral (`expect_behavior`) and the performance (`expect_metric`) expectations were met with statistical confidence.

## Next Steps

Now that you've run your first end-to-end evaluation, you can start applying SigmaEval to your own Gen AI applications. Try modifying the system prompt in `app.py` or the expectations in `test_app.py` to see how the results change.
