🧑⚖️ 评估器接口¶
evoagentx.evaluators ¶
Evaluator ¶
Evaluator(llm: BaseLLM, num_workers: int = 1, agent_manager: Optional[AgentManager] = None, collate_func: Optional[Callable] = None, output_postprocess_func: Optional[Callable] = None, verbose: Optional[bool] = None, **kwargs)
A class for evaluating the performance of a workflow.
Initialize the Evaluator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
llm
|
BaseLLM
|
The LLM to use for evaluation. |
required |
num_workers
|
int
|
The number of parallel workers to use for evaluation. Default is 1. |
1
|
agent_manager
|
AgentManager
|
The agent manager used to construct the workflow. Only used when the workflow graph is a WorkFlowGraph. |
None
|
collate_func
|
Callable
|
A function to collate the benchmark data.
It receives a single example from the benchmark and the output (which should be a dictionary) will serve as inputs |
None
|
output_postprocess_func
|
Callable
|
A function to postprocess the output of the workflow.
It receives the output of an WorkFlow instance (str) or an ActionGraph instance (dict) as input
and the output will be passed to the |
None
|
verbose
|
bool
|
Whether to print the evaluation progress. |
None
|
Source code in evoagentx/evaluators/evaluator.py
evaluate ¶
evaluate(graph: Union[WorkFlowGraph, ActionGraph], benchmark: Benchmark, eval_mode: str = 'test', indices: Optional[List[int]] = None, sample_k: Optional[int] = None, seed: Optional[int] = None, verbose: Optional[bool] = None, update_agents: Optional[bool] = False, **kwargs) -> dict
Evaluate the performance of the workflow on the benchmark.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
graph
|
WorkFlowGraph or ActionGraph
|
The workflow to evaluate. |
required |
benchmark
|
Benchmark
|
The benchmark to evaluate the workflow on. |
required |
eval_mode
|
str
|
which split of the benchmark to evaluate the workflow on. Choices: ["test", "dev", "train"]. |
'test'
|
indices
|
List[int]
|
The indices of the data to evaluate the workflow on. |
None
|
sample_k
|
int
|
The number of data to evaluate the workflow on. If provided, a random sample of size |
None
|
verbose
|
bool
|
Whether to print the evaluation progress. If not provided, the |
None
|
update_agents
|
bool
|
Whether to update the agents in the agent manager. Only used when the workflow graph is a WorkFlowGraph. |
False
|
Returns: dict: The average metrics of the workflow evaluation.
Source code in evoagentx/evaluators/evaluator.py
get_example_evaluation_record ¶
Get the evaluation record for a given example.
Source code in evoagentx/evaluators/evaluator.py
get_evaluation_record_by_id ¶
get_evaluation_record_by_id(benchmark: Benchmark, example_id: str, eval_mode: str = 'test') -> Optional[dict]
Get the evaluation record for a given example id.
Source code in evoagentx/evaluators/evaluator.py
async_evaluate
async
¶
async_evaluate(graph: Union[WorkFlowGraph, ActionGraph], benchmark: Benchmark, eval_mode: str = 'test', indices: Optional[List[int]] = None, sample_k: Optional[int] = None, seed: Optional[int] = None, verbose: Optional[bool] = None, **kwargs) -> dict
Asynchronously evaluate the performance of the workflow on the benchmark.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
graph
|
WorkFlowGraph or ActionGraph
|
The workflow to evaluate. |
required |
benchmark
|
Benchmark
|
The benchmark to evaluate the workflow on. |
required |
eval_mode
|
str
|
which split of the benchmark to evaluate the workflow on. Choices: ["test", "dev", "train"]. |
'test'
|
indices
|
List[int]
|
The indices of the data to evaluate the workflow on. |
None
|
sample_k
|
int
|
The number of data to evaluate the workflow on. If provided, a random sample of size |
None
|
verbose
|
bool
|
Whether to print the evaluation progress. If not provided, the |
None
|
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The average metrics of the workflow evaluation. |