🧪 Benchmark ¶

evoagentx.benchmark ¶

NQ ¶

NQ(path: str = None, mode: str = 'all', **kwargs)

Bases: Benchmark

Benchmark class for evaluating question answering on Natural Questions dataset.

Natural Questions (NQ) is a dataset for open-domain question answering, containing real questions from Google Search and answers from Wikipedia. This class handles loading the dataset, evaluating answers, and computing metrics like exact match and F1 score.

Each NQ example has the following structure: { "id": str, "question": str, "answers": List[str] }

The benchmark evaluates answers using exact match, F1 score, and accuracy metrics.

Source code in evoagentx/benchmark/nq.py

def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/nq")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

HotPotQA ¶

HotPotQA(path: str = None, mode: str = 'all', **kwargs)

Bases: Benchmark

Benchmark class for evaluating multi-hop question answering on HotPotQA dataset.

Each HotPotQA example has the following structure: { "_id": str, "question": str, "answer": str, "context": [["context_title", ["context_sentence", "another_sentence"]]], "supporting_facts": [["supporting_title", supporting_sentence_index]], "type": str, "level": str }

The benchmark evaluates answers using exact match, F1 score, and accuracy metrics.

Source code in evoagentx/benchmark/hotpotqa.py

def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/hotpotqa")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

AFlowHotPotQA ¶

AFlowHotPotQA(path: str = None, mode: str = 'all', **kwargs)

Bases: HotPotQA

AFlow-specific implementation of HotPotQA benchmark.

Source code in evoagentx/benchmark/hotpotqa.py

def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/hotpotqa")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

GSM8K ¶

GSM8K(path: str = None, mode: str = 'all', **kwargs)

Bases: Benchmark

Benchmark class for evaluating math reasoning on GSM8K dataset.

GSM8K (Grade School Math 8K) is a dataset of math word problems that test a model's ability to solve grade school level math problems requiring multi-step reasoning. This class handles loading the dataset, evaluating solutions, and computing metrics based on answer accuracy.

Each GSM8K example has the following structure: { "id": "test-1", "question": "the question", "answer": "the answer" }

The benchmark evaluates answers by extracting the final numerical value and comparing it to the ground truth answer.

Source code in evoagentx/benchmark/gsm8k.py

def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/gsm8k")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

extract_last_number ¶

extract_last_number(text: str) -> float

Extract the last number from a text.

Source code in evoagentx/benchmark/gsm8k.py

def extract_last_number(self, text: str) -> float:
    """
    Extract the last number from a text.
    """
    matches = regex.findall(r"[-+]?\d+(?:,\d{3})*(?:\.\d+)?|\d+\.\d+", str(text))
    if matches:
        last_number = matches[-1].replace(",", "").strip()
        try:
            last_number = float(last_number)
            return last_number
        except ValueError:
            return None
    return None

AFlowGSM8K ¶

AFlowGSM8K(path: str = None, mode: str = 'all', **kwargs)

Bases: GSM8K

AFlow-specific implementation of GSM8K benchmark.

This class extends the GSM8K benchmark with features specific to the AFlow framework, including loading from AFlow-formatted data files and supporting asynchronous evaluation for workflows.

Attributes:

Name	Type	Description
`path`		Path to the directory containing AFlow-formatted GSM8K files.
`mode`		Data loading mode ("train", "dev", "test", or "all").
`_train_data`	`Optional[List[dict]]`	Training dataset loaded from AFlow format.
`_dev_data`	`Optional[List[dict]]`	Development dataset loaded from AFlow format.
`_test_data`	`Optional[List[dict]]`	Test dataset loaded from AFlow format.

Source code in evoagentx/benchmark/gsm8k.py

def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/aflow/gsm8k")
    super().__init__(path=path, mode=mode, **kwargs)

MBPP ¶

MBPP(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)

Bases: CodingBenchmark

Benchmark class for evaluating code generation on the MBPP dataset.

MBPP (Mostly Basic Python Programming) is a collection of Python programming problems designed to test a model's ability to generate functionally correct code from natural language descriptions. This class handles loading the dataset, evaluating solutions, and computing metrics such as pass@k.

The original MBPP format is transformed to be compatible with the HumanEval benchmark format, allowing for consistent evaluation infrastructure.

Each MBPP example has the following structure: { "task_id" (int): 2, "prompt" (str): "Write a function to find the shared elements from the given two lists.", "code" (str): "def similar_elements(test_tup1, test_tup2): res = tuple(set(test_tup1) & set(test_tup2)) return (res) ", "test_imports": [] "test_list" (List[str]): ['assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))', 'assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))', 'assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))'] }

Attributes: k: An integer or list of integers specifying which pass@k metrics to compute

Source code in evoagentx/benchmark/mbpp.py

def __init__(self, path: str = None, mode: str = "all", timeout: int = 60, k: Union[int, list] = 1,**kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/mbpp")
    self.k = k 
    super().__init__(name=type(self).__name__, path=path, mode=mode, timeout=timeout, **kwargs)

evaluate ¶

evaluate(prediction: Any, label: Any) -> dict

Evaluate the solution code.

Parameters:

Name	Type	Description	Default
`prediction`	`str \| List[str]`	The solution code(s).	required
`label`	`dict \| List[dict]`	The unit test code(s).	required

Returns:

Name	Type	Description
`dict`	`dict`	The evaluation metrics (pass@k).

Source code in evoagentx/benchmark/mbpp.py

def evaluate(self, prediction: Any, label: Any) -> dict:
    """
    Evaluate the solution code.

    Args:
        prediction (str | List[str]): The solution code(s).
        label (dict | List[dict]): The unit test code(s).

    Returns:
        dict: The evaluation metrics (pass@k).
    """
    prediction, label = self._check_evaluation_inputs(prediction, label)

    results = []
    for solution in prediction:
        solution_states = []
        for label_data in label:
            task_id = label_data["task_id"]
            prompt = self.get_example_by_id(task_id)["prompt"]
            unit_test = label_data["test"]
            entry_point = label_data["entry_point"]
            state, message = self.check_solution(
                task_id=task_id, 
                solution=prompt + "\n" + solution,
                test=unit_test, 
                entry_point=entry_point
            )
            if state != self.SUCCESS:
                break 
            solution_states.append(state)
        results.append(len(solution_states)==len(label) and all(state==self.SUCCESS for state in solution_states))

    k_list = [self.k] if isinstance(self.k, int) else self.k
    pass_at_k = self.compute_pass_at_k(results, k_list)

    return pass_at_k

AFlowMBPP ¶

AFlowMBPP(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)

Bases: MBPP

AFlow-specific implementation of MBPP benchmark.

Source code in evoagentx/benchmark/mbpp.py

def __init__(self, path: str = None, mode: str = "all", timeout: int = 60, k: Union[int, list] = 1,**kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/aflow/mbpp")
    super().__init__(path=path, mode=mode, timeout=timeout, k=k, **kwargs)

evaluate ¶

evaluate(prediction: Any, label: Any) -> dict

Evaluate the solution code.

Parameters:

Name	Type	Description	Default
`prediction`	`str \| List[str]`	The solution code(s).	required
`label`	`dict \| List[dict]`	The unit test code(s).	required

Returns:

Name	Type	Description
`dict`	`dict`	The evaluation metrics (pass@k).

Source code in evoagentx/benchmark/mbpp.py

def evaluate(self, prediction: Any, label: Any) -> dict:
    """
    Evaluate the solution code.

    Args:
        prediction (str | List[str]): The solution code(s).
        label (dict | List[dict]): The unit test code(s).

    Returns:
        dict: The evaluation metrics (pass@k).
    """
    prediction, label = self._check_evaluation_inputs(prediction, label)

    results = []
    for solution in prediction:
        solution_states = []
        for label_data in label:
            task_id = label_data["task_id"]
            prompt = self.get_example_by_id(task_id)["prompt"]
            unit_test = label_data["test"]
            entry_point = label_data["entry_point"]
            state, message = self.check_solution(
                task_id=task_id, 
                solution=prompt + "\n" + solution,
                test=unit_test, 
                entry_point=entry_point,
                use_entrypoint_as_input=False
            )
            if state != self.SUCCESS:
                break 
            solution_states.append(state)
        results.append(len(solution_states)==len(label) and all(state==self.SUCCESS for state in solution_states))

    k_list = [self.k] if isinstance(self.k, int) else self.k
    pass_at_k = self.compute_pass_at_k(results, k_list)

    return pass_at_k

MATH ¶

MATH(path: str = None, mode: str = 'all', **kwargs)

Bases: Benchmark

Benchmark class for evaluating mathematical reasoning on the MATH dataset.

MATH is a dataset of challenging competition mathematics problems, spanning various difficulty levels and subject areas. This class handles loading the dataset, extracting answers, evaluating solutions through symbolic and numerical comparisons, and computing accuracy metrics.

The dataset includes problems across 7 subject areas (Algebra, Geometry, etc.) and 5 difficulty levels. Each problem contains LaTeX-formatted questions and solutions.

Each MATH example has the following structure: { "id": "test-1", "problem": "the problem", "solution": "the solution", "level": "Level 1", # "Level 1", "Level 2", "Level 3", "Level 4", "Level 5", "Level ?" "type": "Algebra", # 'Geometry', 'Algebra', 'Intermediate Algebra', 'Counting & Probability', 'Precalculus', 'Number Theory', 'Prealgebra' }

The benchmark evaluates answers using symbolic math equality checking and numerical approximation to handle equivalent mathematical expressions.

Source code in evoagentx/benchmark/math_benchmark.py

def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/math")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

HumanEval ¶

HumanEval(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)

Bases: CodingBenchmark

Benchmark class for evaluating code generation on HumanEval.

HumanEval is a collection of Python programming problems designed to test
a model's ability to generate functionally correct code from natural language
descriptions. This class handles loading the dataset, evaluating solutions,
and computing metrics such as pass@k.

Each HumanEval example has the following structure:
{
    "task_id": "HumanEval/0", 
    "prompt": "from typing import List

def func_name(args, *kwargs) -> return_type "function description"

", "entry_point": "func_name", "canonical_solution": "canonical solution (code)", "test": "METADATA = {xxx}

def check(candidate): assert candidate(inputs) == output " }

Attributes:
    k: An integer or list of integers specifying which pass@k metrics to compute

Source code in evoagentx/benchmark/humaneval.py

def __init__(self, path: str = None, mode: str = "all", timeout: int = 60, k: Union[int, list] = 1, **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/humaneval")
    self.k = k 
    super().__init__(name=type(self).__name__, path=path, mode=mode, timeout=timeout, **kwargs)

handle_special_cases ¶

handle_special_cases(task_id: str, solution: str, test: str) -> bool

Handle special cases for HumanEval.

Source code in evoagentx/benchmark/humaneval.py

def handle_special_cases(self, task_id: str, solution: str, test: str) -> bool:
    """
    Handle special cases for HumanEval.
    """
    if task_id == "HumanEval/50":
        solution = (
            '\n\ndef encode_shift(s: str):\n    """\n    returns encoded string by shifting every character by 5 in the alphabet.\n    """\n    return "".join([chr(((ord(ch) + 5 - ord("a")) % 26) + ord("a")) for ch in s])\n\n\n'
            + solution
        )
        return solution, test 

    return super().handle_special_cases(task_id=task_id, solution=solution, test=test)

evaluate ¶

evaluate(prediction: Any, label: Any) -> dict

Evaluate the solution code.

Parameters:

Name	Type	Description	Default
`prediction`	`str \| List[str]`	The solution code(s).	required
`label`	`dict \| List[dict]`	The unit test code(s).	required

Returns:

Name	Type	Description
`dict`	`dict`	The evaluation metrics (pass@k).

Source code in evoagentx/benchmark/humaneval.py

def evaluate(self, prediction: Any, label: Any) -> dict:
    """
    Evaluate the solution code.

    Args:
        prediction (str | List[str]): The solution code(s).
        label (dict | List[dict]): The unit test code(s).

    Returns:
        dict: The evaluation metrics (pass@k).
    """
    prediction, label = self._check_evaluation_inputs(prediction, label)

    results = []
    for solution in prediction:
        solution_states = []
        for label_data in label:
            task_id = label_data["task_id"]
            prompt = self.get_example_by_id(task_id)["prompt"]
            unit_test = label_data["test"]
            entry_point = label_data["entry_point"]
            state, message = self.check_solution(
                task_id=task_id, 
                solution=prompt + solution,
                test=unit_test, 
                entry_point=entry_point
            )
            if state != self.SUCCESS:
                break 
            solution_states.append(state)
        results.append(len(solution_states)==len(label) and all(state==self.SUCCESS for state in solution_states))

    k_list = [self.k] if isinstance(self.k, int) else self.k
    pass_at_k = self.compute_pass_at_k(results, k_list)

    return pass_at_k

AFlowHumanEval ¶

AFlowHumanEval(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)

Bases: HumanEval

AFlow-specific implementation of HumanEval benchmark.

Source code in evoagentx/benchmark/humaneval.py

def __init__(self, path: str = None, mode: str = "all", timeout: int = 60, k: Union[int, list] = 1, **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/aflow/humaneval")
    super().__init__(path=path, mode=mode, timeout=timeout, k=k, **kwargs)

extract_test_cases_with_entry_point ¶

extract_test_cases_with_entry_point(entry_point: str)

Extract test cases with the given entry point.

Source code in evoagentx/benchmark/humaneval.py

def extract_test_cases_with_entry_point(self, entry_point: str):
    """
    Extract test cases with the given entry point.
    """

    hardcoded_cases = {
        "find_zero": "",
        "decode_cyclic": "",
        "decode_shift": "",
        "by_length": "",
        "add": "",
        "triangle_area": "",
        "correct_bracketing": "",
        "solve": "",
        "sum_squares": "",
        "starts_one_ends": "",
    }
    if entry_point in hardcoded_cases:
        return hardcoded_cases[entry_point]

    for case in self._test_cases:
        if case["entry_point"] == entry_point:
            return case["test"]

    return None

LiveCodeBench ¶

LiveCodeBench(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, num_process: int = 6, scenario: str = 'code_generation', version: str = 'release_latest', start_date: str = None, end_date: str = None, use_cot_for_execution: bool = False, **kwargs)

Bases: CodingBenchmark

Benchmark class for evaluating LLM capabilities on real-world programming tasks.

LiveCodeBench provides a framework for evaluating different scenarios of code-related tasks: 1. Code Generation: generating code from problem descriptions 2. Test Output Prediction: predicting test outputs given test code 3. Code Execution: generating code that executes correctly

The benchmark supports different evaluation modes, metrics, and can be customized with various parameters like timeouts, sample dates, and processing options.

Attributes:

Name	Type	Description
`k`		An integer or list of integers specifying which pass@k metrics to compute
`version`		Release version of the dataset to use
`num_process`		Number of processes to use for evaluation
`start_date`		Filter problems to those after this date
`end_date`		Filter problems to those before this date
`scenario`		Type of programming task to evaluate ("code_generation", "test_output_prediction", or "code_execution")
`use_cot_for_execution`		Whether to use chain-of-thought processing for code execution

Source code in evoagentx/benchmark/livecodebench.py

def __init__(
    self, 
    path: str = None, 
    mode: str = "all", 
    timeout: int = 60, 
    k: Union[int, list] = 1, 
    num_process: int = 6, 
    scenario: str = "code_generation", 
    version: str = "release_latest", 
    start_date: str = None, 
    end_date: str = None, 
    use_cot_for_execution: bool = False, 
    **kwargs
):
    path = os.path.expanduser(path or "~/.evoagentx/data/livecodebench")
    self.k = k 
    self.version = version
    self.num_process = num_process
    self.start_date = start_date
    self.end_date = end_date
    self.scenario = scenario 
    self.use_cot_for_execution = use_cot_for_execution
    assert scenario in VALID_SCENARIO, f"Invalid scenario: {scenario}. Available choices: {VALID_SCENARIO}." 
    super().__init__(name=type(self).__name__, path=path, mode=mode, timeout=timeout, **kwargs)

evaluate ¶

evaluate(prediction: Any, label: Any) -> dict

Evaluate the solution code.

Parameters:

Name	Type	Description	Default
`prediction`	`str \| List[str]`	The solution code(s).	required
`label`	`dict \| List[dict]`	The test cases and expected outputs.	required

Returns:

Name	Type	Description
`dict`	`dict`	The evaluation metrics (pass@k).

Source code in evoagentx/benchmark/livecodebench.py

def evaluate(self, prediction: Any, label: Any) -> dict:
    """
    Evaluate the solution code.

    Args:
        prediction (str | List[str]): The solution code(s).
        label (dict | List[dict]): The test cases and expected outputs. 

    Returns:
        dict: The evaluation metrics (pass@k).
    """
    prediction, label = self._check_evaluation_inputs(prediction, label)
    k_list = [self.k] if isinstance(self.k, int) else self.k

    if self.scenario == "code_generation":
        solutions: List[str] = [extract_code_blocks(pred)[0] for pred in prediction]
        metrics, results, metadatas = codegen_metrics(
            samples_list=label, # label is already a list 
            generations_list=[solutions], # for a single example. 
            k_list=k_list, 
            num_process_evaluate=self.num_process,
            timeout=self.timeout
        )

    elif self.scenario == "test_output_prediction":
        pred_outputs = [extract_test_output_code(pred) for pred in prediction]
        metrics, results = test_output_metrics(
            samples=label, 
            generations=[pred_outputs], 
            k_list=k_list, 
        )
    elif self.scenario == "code_execution":
        pred_outputs = [extract_execution_code(pred, self.use_cot_for_execution) for pred in prediction]
        metrics, results = code_execution_metrics(
            samples=label, 
            generations=[pred_outputs], 
        )
    else:
        raise ValueError(f"Invalid scenario: {self.scenario}. Available choices: {VALID_SCENARIO}.")

    pass_at_k = {f"pass@{k}": float(metrics[f"pass@{k}"]) for k in k_list}
    return pass_at_k