🧪 基准测试接口¶
evoagentx.benchmark ¶
NQ ¶
Bases: Benchmark
Benchmark class for evaluating question answering on Natural Questions dataset.
Natural Questions (NQ) is a dataset for open-domain question answering, containing real questions from Google Search and answers from Wikipedia. This class handles loading the dataset, evaluating answers, and computing metrics like exact match and F1 score.
Each NQ example has the following structure: { "id": str, "question": str, "answers": List[str] }
The benchmark evaluates answers using exact match, F1 score, and accuracy metrics.
Source code in evoagentx/benchmark/nq.py
HotPotQA ¶
Bases: Benchmark
Benchmark class for evaluating multi-hop question answering on HotPotQA dataset.
Each HotPotQA example has the following structure: { "_id": str, "question": str, "answer": str, "context": [["context_title", ["context_sentence", "another_sentence"]]], "supporting_facts": [["supporting_title", supporting_sentence_index]], "type": str, "level": str }
The benchmark evaluates answers using exact match, F1 score, and accuracy metrics.
Source code in evoagentx/benchmark/hotpotqa.py
AFlowHotPotQA ¶
Bases: HotPotQA
AFlow-specific implementation of HotPotQA benchmark.
Source code in evoagentx/benchmark/hotpotqa.py
GSM8K ¶
Bases: Benchmark
Benchmark class for evaluating math reasoning on GSM8K dataset.
GSM8K (Grade School Math 8K) is a dataset of math word problems that test a model's ability to solve grade school level math problems requiring multi-step reasoning. This class handles loading the dataset, evaluating solutions, and computing metrics based on answer accuracy.
Each GSM8K example has the following structure: { "id": "test-1", "question": "the question", "answer": "the answer" }
The benchmark evaluates answers by extracting the final numerical value and comparing it to the ground truth answer.
Source code in evoagentx/benchmark/gsm8k.py
extract_last_number ¶
Extract the last number from a text.
Source code in evoagentx/benchmark/gsm8k.py
AFlowGSM8K ¶
Bases: GSM8K
AFlow-specific implementation of GSM8K benchmark.
This class extends the GSM8K benchmark with features specific to the AFlow framework, including loading from AFlow-formatted data files and supporting asynchronous evaluation for workflows.
Attributes:
Name | Type | Description |
---|---|---|
path |
Path to the directory containing AFlow-formatted GSM8K files. |
|
mode |
Data loading mode ("train", "dev", "test", or "all"). |
|
_train_data |
Optional[List[dict]]
|
Training dataset loaded from AFlow format. |
_dev_data |
Optional[List[dict]]
|
Development dataset loaded from AFlow format. |
_test_data |
Optional[List[dict]]
|
Test dataset loaded from AFlow format. |
Source code in evoagentx/benchmark/gsm8k.py
MBPP ¶
Bases: CodingBenchmark
Benchmark class for evaluating code generation on the MBPP dataset.
MBPP (Mostly Basic Python Programming) is a collection of Python programming problems designed to test a model's ability to generate functionally correct code from natural language descriptions. This class handles loading the dataset, evaluating solutions, and computing metrics such as pass@k.
The original MBPP format is transformed to be compatible with the HumanEval benchmark format, allowing for consistent evaluation infrastructure.
Each MBPP example has the following structure: { "task_id" (int): 2, "prompt" (str): "Write a function to find the shared elements from the given two lists.", "code" (str): "def similar_elements(test_tup1, test_tup2): res = tuple(set(test_tup1) & set(test_tup2)) return (res) ", "test_imports": [] "test_list" (List[str]): ['assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))', 'assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))', 'assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))'] }
Attributes: k: An integer or list of integers specifying which pass@k metrics to compute
Source code in evoagentx/benchmark/mbpp.py
evaluate ¶
Evaluate the solution code.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prediction
|
str | List[str]
|
The solution code(s). |
required |
label
|
dict | List[dict]
|
The unit test code(s). |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The evaluation metrics (pass@k). |
Source code in evoagentx/benchmark/mbpp.py
AFlowMBPP ¶
AFlowMBPP(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)
Bases: MBPP
AFlow-specific implementation of MBPP benchmark.
Source code in evoagentx/benchmark/mbpp.py
evaluate ¶
Evaluate the solution code.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prediction
|
str | List[str]
|
The solution code(s). |
required |
label
|
dict | List[dict]
|
The unit test code(s). |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The evaluation metrics (pass@k). |
Source code in evoagentx/benchmark/mbpp.py
MATH ¶
Bases: Benchmark
Benchmark class for evaluating mathematical reasoning on the MATH dataset.
MATH is a dataset of challenging competition mathematics problems, spanning various difficulty levels and subject areas. This class handles loading the dataset, extracting answers, evaluating solutions through symbolic and numerical comparisons, and computing accuracy metrics.
The dataset includes problems across 7 subject areas (Algebra, Geometry, etc.) and 5 difficulty levels. Each problem contains LaTeX-formatted questions and solutions.
Each MATH example has the following structure: { "id": "test-1", "problem": "the problem", "solution": "the solution", "level": "Level 1", # "Level 1", "Level 2", "Level 3", "Level 4", "Level 5", "Level ?" "type": "Algebra", # 'Geometry', 'Algebra', 'Intermediate Algebra', 'Counting & Probability', 'Precalculus', 'Number Theory', 'Prealgebra' }
The benchmark evaluates answers using symbolic math equality checking and numerical approximation to handle equivalent mathematical expressions.
Source code in evoagentx/benchmark/math.py
HumanEval ¶
HumanEval(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)
Bases: CodingBenchmark
Benchmark class for evaluating code generation on HumanEval.
HumanEval is a collection of Python programming problems designed to test
a model's ability to generate functionally correct code from natural language
descriptions. This class handles loading the dataset, evaluating solutions,
and computing metrics such as pass@k.
Each HumanEval example has the following structure:
{
"task_id": "HumanEval/0",
"prompt": "from typing import List
def func_name(args, *kwargs) -> return_type "function description"
", "entry_point": "func_name", "canonical_solution": "canonical solution (code)", "test": "METADATA = {xxx}
def check(candidate): assert candidate(inputs) == output " }
Attributes:
k: An integer or list of integers specifying which pass@k metrics to compute
Source code in evoagentx/benchmark/humaneval.py
handle_special_cases ¶
Handle special cases for HumanEval.
Source code in evoagentx/benchmark/humaneval.py
evaluate ¶
Evaluate the solution code.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prediction
|
str | List[str]
|
The solution code(s). |
required |
label
|
dict | List[dict]
|
The unit test code(s). |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The evaluation metrics (pass@k). |
Source code in evoagentx/benchmark/humaneval.py
AFlowHumanEval ¶
AFlowHumanEval(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)
Bases: HumanEval
AFlow-specific implementation of HumanEval benchmark.
Source code in evoagentx/benchmark/humaneval.py
extract_test_cases_with_entry_point ¶
Extract test cases with the given entry point.
Source code in evoagentx/benchmark/humaneval.py
LiveCodeBench ¶
LiveCodeBench(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, num_process: int = 6, scenario: str = 'code_generation', version: str = 'release_latest', start_date: str = None, end_date: str = None, use_cot_for_execution: bool = False, **kwargs)
Bases: CodingBenchmark
Benchmark class for evaluating LLM capabilities on real-world programming tasks.
LiveCodeBench provides a framework for evaluating different scenarios of code-related tasks: 1. Code Generation: generating code from problem descriptions 2. Test Output Prediction: predicting test outputs given test code 3. Code Execution: generating code that executes correctly
The benchmark supports different evaluation modes, metrics, and can be customized with various parameters like timeouts, sample dates, and processing options.
Attributes:
Name | Type | Description |
---|---|---|
k |
An integer or list of integers specifying which pass@k metrics to compute |
|
version |
Release version of the dataset to use |
|
num_process |
Number of processes to use for evaluation |
|
start_date |
Filter problems to those after this date |
|
end_date |
Filter problems to those before this date |
|
scenario |
Type of programming task to evaluate ("code_generation", "test_output_prediction", or "code_execution") |
|
use_cot_for_execution |
Whether to use chain-of-thought processing for code execution |
Source code in evoagentx/benchmark/livecodebench.py
evaluate ¶
Evaluate the solution code.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prediction
|
str | List[str]
|
The solution code(s). |
required |
label
|
dict | List[dict]
|
The test cases and expected outputs. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The evaluation metrics (pass@k). |