You will also need to set your API key for the LLM provider you wish to use for the AI Judge. SigmaEval supports 100+ LLM providers via LiteLLM, including OpenAI, Anthropic, Google, and local models via Ollama.
Here is a minimal, complete example of how to use SigmaEval to test a simple AI application. This example evaluates a bot that is expected to return a friendly greeting.
test_app.py
Copy
from sigmaeval import SigmaEval, ScenarioTest, assertionsimport asynciofrom typing import List, Dict, Any# 1. Define the ScenarioTest to describe the desired behaviorscenario = ( ScenarioTest("Simple Test") .given("A user interacting with a chatbot") .when("The user greets the bot") .expect_behavior( "The bot provides a simple and friendly greeting.", # We want to be confident that at least 75% of responses will score a 7/10 or higher. criteria=assertions.scores.proportion_gte(min_score=7, proportion=0.75) ) .max_turns(1) # Only needed here since we're returning a static greeting)# 2. Implement the app_handler to allow SigmaEval to communicate with your appasync def app_handler(messages: List[Dict[str, str]], state: Any) -> str: # In a real test, you would pass messages to your app and return the response. # For this example, we'll return a static, friendly greeting. return "Hello there! Nice to meet you!"# 3. Initialize SigmaEval and run the evaluationasync def main(): # You can use any model that LiteLLM supports: https://docs.litellm.ai/docs/providers sigma_eval = SigmaEval( judge_model="gemini/gemini-2.5-flash", sample_size=20, # The number of times to run the test significance_level=0.05 # Corresponds to a 95% confidence level ) result = await sigma_eval.evaluate(scenario, app_handler) # Print the detailed summary to the console print(result) # Programmatically check the result if result.passed: print("✅ Scenario passed!") else: print("❌ Scenario failed.")if __name__ == "__main__": asyncio.run(main())
When you run the script, SigmaEval will simulate 20 conversations, have an AI Judge score each one, and then print a summary of the results. The summary shows the overall pass/fail status for the scenario and a breakdown of each expectation.Here’s an example of what the output might look like:
Copy
--- Result for Scenario: 'Simple Test' ---Overall Status: ✅ PASSEDSummary: 1/1 expectations passed.Breakdown: - [✅ PASSED] The bot provides a simple and friendly greeting., p-value: 0.0032✅ Scenario passed!
This output confirms that the test passed, along with the p-value for the statistical test.