Benchmark and Evaluation Tutorial¶
This tutorial will guide you through the process of setting up and running benchmark evaluations using EvoAgentX. We'll use the HotpotQA dataset as an example to demonstrate how to set up and run the evaluation process.
1. Overview¶
EvoAgentX provides a flexible and modular evaluation framework that enables you to:
- Load and use predefined benchmark datasets
- Customize data loading, processing, and post-processing logic
- Evaluate the performance of your multi-agent workflows
- Process multiple evaluation tasks in parallel
2. Setting Up the Benchmark¶
To get started, you need to import the relevant modules and set up the language model (LLM) that your agent will use during evaluation.
from evoagentx.config import Config
from evoagentx.models import OpenAIConfig, OpenAI
from evoagentx.benchmark import HotpotQA
from evoagentx.workflow import QAActionGraph
from evoagentx.evaluators import Evaluator
from evoagentx.core.callbacks import suppress_logger_info
Configure the LLM Model¶
You'll need a valid OpenAI API key to initialize the LLM. It is recommended to save your API key in the .env
file and load it using the load_dotenv
function:
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
llm_config = OpenAILLMConfig(model="gpt-4o-mini", openai_key=OPENAI_API_KEY)
llm = OpenAILLM(config=llm_config)
3. Initialize the Benchmark¶
EvoAgentX includes several predefined benchmarks for tasks like Question Answering, Math, and Coding. Please refer to the Benchmark README for more details about existing benchmarks. You can also define your own benchmark class by extending the base Benchmark
interface, and we provide an example in the Custom Benchmark section.
In this example, we will use the HotpotQA
benchmark.
mode
parameter determines which split of the dataset is loaded. Options include:
"train"
: Training data"dev"
: Development/validation data"test"
: Test data"all"
(default): Loads the entire dataset
The data will be automatically downloaded to a default cache folder, but you can change this location by specifying the path
parameter.
4. Running the Evaluation¶
Once you have your benchmark and LLM ready, the next step is to define your agent workflow and evaluation logic. EvoAgentX supports full customization of how benchmark examples are processed and how outputs are interpreted.
Here's how to run an evaluation using the HotpotQA
benchmark and a QA workflow.
Step 1: Define the Agent Workflow¶
You can use one of the predefined workflows or implement your own. In this example, we use the QAActionGraph
designed for question answering, which simply use self-consistency to generate the final answer:
workflow = QAActionGraph(
llm_config=llm_config,
description="This workflow aims to address multi-hop QA tasks."
)
Step 2: Customize Data Preprocessing and Post-processing¶
The next key aspect of evaluation is properly transforming data between your benchmark, workflow, and evaluation metrics.
Why Preprocessing and Postprocessing Are Needed¶
In EvoAgentX, preprocessing and postprocessing are essential steps to ensure smooth interaction between benchmark data, workflows, and evaluation logic:
-
Preprocessing (
collate_func
):The raw examples from a benchmark like HotpotQA typically consist of structured fields such as questions, answer, and context. However, your agent workflow usually expects a single prompt string or other structured input. The
collate_func
is used to convert each raw example into a format that can be consumed by your (custom) workflow. -
Postprocessing (
output_postprocess_func
):The workflow output might include reasoning steps or additional formatting beyond just the final answer. Since the
Evaluator
internally calls the benchmark'sevaluate
method to compute metrics (e.g., exact match or F1), it's often necessary to extract the final answer in a clean format. Theoutput_postprocess_func
handles this and ensures the output is in the right form for evaluation.
In short, preprocessing prepares benchmark examples for the workflow, while postprocessing prepares workflow outputs for evaluation.
In the following example, we define a collate_func
to format the raw examples into a prompt for the workflow, and a output_postprocess_func
to extract the final answer from the workflow output.
Each example in the benchmark can be formatted using a collate_func
, which transforms raw examples into a prompt or structured input for the agent.
def collate_func(example: dict) -> dict:
"""
Args:
example (dict): A dictionary containing the raw example data.
Returns:
The expected input for the (custom) workflow.
"""
problem = "Question: {}\n\n".format(example["question"])
context_list = []
for item in example["context"]:
context = "Title: {}\nText: {}".format(item[0], " ".join([t.strip() for t in item[1]]))
context_list.append(context)
context = "\n\n".join(context_list)
problem += "Context: {}\n\n".format(context)
problem += "Answer:"
return {"problem": problem}
After the agent generates an output, you can define how to extract the final answer using output_postprocess_func
.
def output_postprocess_func(output: dict) -> dict:
"""
Args:
output (dict): The output from the workflow.
Returns:
The processed output that can be used to compute the metrics. The output will be directly passed to the benchmark's `evaluate` method.
"""
return output["answer"]
Step 3: Initialize the Evaluator¶
The Evaluator ties everything together β it runs the workflow over the benchmark and calculates performance metrics.
evaluator = Evaluator(
llm=llm,
collate_func=collate_func,
output_postprocess_func=output_postprocess_func,
verbose=True,
num_workers=3
)
num_workers
is greater than 1, the evaluation will be parallelized across multiple threads.
Step 4: Run the Evaluation¶
You can now run the evaluation by providing the workflow and benchmark to the evaluator:
with suppress_logger_info():
results = evaluator.evaluate(
graph=workflow,
benchmark=benchmark,
eval_mode="dev", # Evaluation split: train / dev / test
sample_k=10 # If set, randomly sample k examples from the benchmark for evaluation
)
print("Evaluation metrics: ", results)
suppress_logger_info
is used to suppress the logger info.
Please refer to the benchmark and evaluation example for a complete example.
Custom Benchmark¶
To define a custom benchmark, you need to extend the Benchmark
class and implement the following methods:
-
_load_data(self)
:Load the benchmark data, and set the
self._train_data
,self._dev_data
andself._test_data
attributes. -
_get_id(self, example: Any) -> Any
:Return the unique identifier of an example.
-
_get_label(self, example: Any) -> Any
:Return the label or ground truth associated with a given example.
This is used to compare predictions against the correct answer during evaluation. The output will be directly passed to the
evaluate
method. -
evaluate(self, prediction: Any, label: Any) -> dict
:Compute the evaluation metrics for a single example, based on its prediction and ground-truth label (obtained from
_get_label
). This method should return a dictionary of metric name(s) and value(s).
For a complete example of a benchmark implementation, please refer to the HotPotQA class.