Skip to content

🧪 Benchmark

evoagentx.benchmark

NQ

NQ(path: str = None, mode: str = 'all', **kwargs)

Bases: Benchmark

Benchmark class for evaluating question answering on Natural Questions dataset.

Natural Questions (NQ) is a dataset for open-domain question answering, containing real questions from Google Search and answers from Wikipedia. This class handles loading the dataset, evaluating answers, and computing metrics like exact match and F1 score.

Each NQ example has the following structure: { "id": str, "question": str, "answers": List[str] }

The benchmark evaluates answers using exact match, F1 score, and accuracy metrics.

Source code in evoagentx/benchmark/nq.py
def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/nq")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

HotPotQA

HotPotQA(path: str = None, mode: str = 'all', **kwargs)

Bases: Benchmark

Benchmark class for evaluating multi-hop question answering on HotPotQA dataset.

Each HotPotQA example has the following structure: { "_id": str, "question": str, "answer": str, "context": [["context_title", ["context_sentence", "another_sentence"]]], "supporting_facts": [["supporting_title", supporting_sentence_index]], "type": str, "level": str }

The benchmark evaluates answers using exact match, F1 score, and accuracy metrics.

Source code in evoagentx/benchmark/hotpotqa.py
def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/hotpotqa")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

AFlowHotPotQA

AFlowHotPotQA(path: str = None, mode: str = 'all', **kwargs)

Bases: HotPotQA

AFlow-specific implementation of HotPotQA benchmark.

Source code in evoagentx/benchmark/hotpotqa.py
def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/hotpotqa")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

GSM8K

GSM8K(path: str = None, mode: str = 'all', **kwargs)

Bases: Benchmark

Benchmark class for evaluating math reasoning on GSM8K dataset.

GSM8K (Grade School Math 8K) is a dataset of math word problems that test a model's ability to solve grade school level math problems requiring multi-step reasoning. This class handles loading the dataset, evaluating solutions, and computing metrics based on answer accuracy.

Each GSM8K example has the following structure: { "id": "test-1", "question": "the question", "answer": "the answer" }

The benchmark evaluates answers by extracting the final numerical value and comparing it to the ground truth answer.

Source code in evoagentx/benchmark/gsm8k.py
def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/gsm8k")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

extract_last_number

extract_last_number(text: str) -> float

Extract the last number from a text.

Source code in evoagentx/benchmark/gsm8k.py
def extract_last_number(self, text: str) -> float:
    """
    Extract the last number from a text.
    """
    matches = regex.findall(r"[-+]?\d+(?:,\d{3})*(?:\.\d+)?|\d+\.\d+", str(text))
    if matches:
        last_number = matches[-1].replace(",", "").strip()
        try:
            last_number = float(last_number)
            return last_number
        except ValueError:
            return None
    return None

AFlowGSM8K

AFlowGSM8K(path: str = None, mode: str = 'all', **kwargs)

Bases: GSM8K

AFlow-specific implementation of GSM8K benchmark.

This class extends the GSM8K benchmark with features specific to the AFlow framework, including loading from AFlow-formatted data files and supporting asynchronous evaluation for workflows.

Attributes:

Name Type Description
path

Path to the directory containing AFlow-formatted GSM8K files.

mode

Data loading mode ("train", "dev", "test", or "all").

_train_data Optional[List[dict]]

Training dataset loaded from AFlow format.

_dev_data Optional[List[dict]]

Development dataset loaded from AFlow format.

_test_data Optional[List[dict]]

Test dataset loaded from AFlow format.

Source code in evoagentx/benchmark/gsm8k.py
def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/aflow/gsm8k")
    super().__init__(path=path, mode=mode, **kwargs)

MBPP

MBPP(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)

Bases: CodingBenchmark

Benchmark class for evaluating code generation on the MBPP dataset.

MBPP (Mostly Basic Python Programming) is a collection of Python programming problems designed to test a model's ability to generate functionally correct code from natural language descriptions. This class handles loading the dataset, evaluating solutions, and computing metrics such as pass@k.

The original MBPP format is transformed to be compatible with the HumanEval benchmark format, allowing for consistent evaluation infrastructure.

Each MBPP example has the following structure: { "task_id" (int): 2, "prompt" (str): "Write a function to find the shared elements from the given two lists.", "code" (str): "def similar_elements(test_tup1, test_tup2): res = tuple(set(test_tup1) & set(test_tup2)) return (res) ", "test_imports": [] "test_list" (List[str]): ['assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))', 'assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))', 'assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))'] }

Attributes: k: An integer or list of integers specifying which pass@k metrics to compute

Source code in evoagentx/benchmark/mbpp.py
def __init__(self, path: str = None, mode: str = "all", timeout: int = 60, k: Union[int, list] = 1,**kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/mbpp")
    self.k = k 
    super().__init__(name=type(self).__name__, path=path, mode=mode, timeout=timeout, **kwargs)

evaluate

evaluate(prediction: Any, label: Any) -> dict

Evaluate the solution code.

Parameters:

Name Type Description Default
prediction str | List[str]

The solution code(s).

required
label dict | List[dict]

The unit test code(s).

required

Returns:

Name Type Description
dict dict

The evaluation metrics (pass@k).

Source code in evoagentx/benchmark/mbpp.py
def evaluate(self, prediction: Any, label: Any) -> dict:
    """
    Evaluate the solution code.

    Args:
        prediction (str | List[str]): The solution code(s).
        label (dict | List[dict]): The unit test code(s).

    Returns:
        dict: The evaluation metrics (pass@k).
    """
    prediction, label = self._check_evaluation_inputs(prediction, label)

    results = []
    for solution in prediction:
        solution_states = []
        for label_data in label:
            task_id = label_data["task_id"]
            prompt = self.get_example_by_id(task_id)["prompt"]
            unit_test = label_data["test"]
            entry_point = label_data["entry_point"]
            state, message = self.check_solution(
                task_id=task_id, 
                solution=prompt + "\n" + solution,
                test=unit_test, 
                entry_point=entry_point
            )
            if state != self.SUCCESS:
                break 
            solution_states.append(state)
        results.append(len(solution_states)==len(label) and all(state==self.SUCCESS for state in solution_states))

    k_list = [self.k] if isinstance(self.k, int) else self.k
    pass_at_k = self.compute_pass_at_k(results, k_list)

    return pass_at_k

AFlowMBPP

AFlowMBPP(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)

Bases: MBPP

AFlow-specific implementation of MBPP benchmark.

Source code in evoagentx/benchmark/mbpp.py
def __init__(self, path: str = None, mode: str = "all", timeout: int = 60, k: Union[int, list] = 1,**kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/aflow/mbpp")
    super().__init__(path=path, mode=mode, timeout=timeout, k=k, **kwargs)

evaluate

evaluate(prediction: Any, label: Any) -> dict

Evaluate the solution code.

Parameters:

Name Type Description Default
prediction str | List[str]

The solution code(s).

required
label dict | List[dict]

The unit test code(s).

required

Returns:

Name Type Description
dict dict

The evaluation metrics (pass@k).

Source code in evoagentx/benchmark/mbpp.py
def evaluate(self, prediction: Any, label: Any) -> dict:
    """
    Evaluate the solution code.

    Args:
        prediction (str | List[str]): The solution code(s).
        label (dict | List[dict]): The unit test code(s).

    Returns:
        dict: The evaluation metrics (pass@k).
    """
    prediction, label = self._check_evaluation_inputs(prediction, label)

    results = []
    for solution in prediction:
        solution_states = []
        for label_data in label:
            task_id = label_data["task_id"]
            prompt = self.get_example_by_id(task_id)["prompt"]
            unit_test = label_data["test"]
            entry_point = label_data["entry_point"]
            state, message = self.check_solution(
                task_id=task_id, 
                solution=prompt + "\n" + solution,
                test=unit_test, 
                entry_point=entry_point,
                use_entrypoint_as_input=False
            )
            if state != self.SUCCESS:
                break 
            solution_states.append(state)
        results.append(len(solution_states)==len(label) and all(state==self.SUCCESS for state in solution_states))

    k_list = [self.k] if isinstance(self.k, int) else self.k
    pass_at_k = self.compute_pass_at_k(results, k_list)

    return pass_at_k

MATH

MATH(path: str = None, mode: str = 'all', **kwargs)

Bases: Benchmark

Benchmark class for evaluating mathematical reasoning on the MATH dataset.

MATH is a dataset of challenging competition mathematics problems, spanning various difficulty levels and subject areas. This class handles loading the dataset, extracting answers, evaluating solutions through symbolic and numerical comparisons, and computing accuracy metrics.

The dataset includes problems across 7 subject areas (Algebra, Geometry, etc.) and 5 difficulty levels. Each problem contains LaTeX-formatted questions and solutions.

Each MATH example has the following structure: { "id": "test-1", "problem": "the problem", "solution": "the solution", "level": "Level 1", # "Level 1", "Level 2", "Level 3", "Level 4", "Level 5", "Level ?" "type": "Algebra", # 'Geometry', 'Algebra', 'Intermediate Algebra', 'Counting & Probability', 'Precalculus', 'Number Theory', 'Prealgebra' }

The benchmark evaluates answers using symbolic math equality checking and numerical approximation to handle equivalent mathematical expressions.

Source code in evoagentx/benchmark/math.py
def __init__(self, path: str = None, mode: str = "all", **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/math")
    super().__init__(name=type(self).__name__, path=path, mode=mode, **kwargs)

HumanEval

HumanEval(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)

Bases: CodingBenchmark

Benchmark class for evaluating code generation on HumanEval.

HumanEval is a collection of Python programming problems designed to test
a model's ability to generate functionally correct code from natural language
descriptions. This class handles loading the dataset, evaluating solutions,
and computing metrics such as pass@k.

Each HumanEval example has the following structure:
{
    "task_id": "HumanEval/0", 
    "prompt": "from typing import List

def func_name(args, *kwargs) -> return_type "function description"

", "entry_point": "func_name", "canonical_solution": "canonical solution (code)", "test": "METADATA = {xxx}

def check(candidate): assert candidate(inputs) == output " }

Attributes:
    k: An integer or list of integers specifying which pass@k metrics to compute
Source code in evoagentx/benchmark/humaneval.py
def __init__(self, path: str = None, mode: str = "all", timeout: int = 60, k: Union[int, list] = 1, **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/humaneval")
    self.k = k 
    super().__init__(name=type(self).__name__, path=path, mode=mode, timeout=timeout, **kwargs)

handle_special_cases

handle_special_cases(task_id: str, solution: str, test: str) -> bool

Handle special cases for HumanEval.

Source code in evoagentx/benchmark/humaneval.py
def handle_special_cases(self, task_id: str, solution: str, test: str) -> bool:
    """
    Handle special cases for HumanEval.
    """
    if task_id == "HumanEval/50":
        solution = (
            '\n\ndef encode_shift(s: str):\n    """\n    returns encoded string by shifting every character by 5 in the alphabet.\n    """\n    return "".join([chr(((ord(ch) + 5 - ord("a")) % 26) + ord("a")) for ch in s])\n\n\n'
            + solution
        )
        return solution, test 

    return super().handle_special_cases(task_id=task_id, solution=solution, test=test)

evaluate

evaluate(prediction: Any, label: Any) -> dict

Evaluate the solution code.

Parameters:

Name Type Description Default
prediction str | List[str]

The solution code(s).

required
label dict | List[dict]

The unit test code(s).

required

Returns:

Name Type Description
dict dict

The evaluation metrics (pass@k).

Source code in evoagentx/benchmark/humaneval.py
def evaluate(self, prediction: Any, label: Any) -> dict:
    """
    Evaluate the solution code.

    Args:
        prediction (str | List[str]): The solution code(s).
        label (dict | List[dict]): The unit test code(s).

    Returns:
        dict: The evaluation metrics (pass@k).
    """
    prediction, label = self._check_evaluation_inputs(prediction, label)

    results = []
    for solution in prediction:
        solution_states = []
        for label_data in label:
            task_id = label_data["task_id"]
            prompt = self.get_example_by_id(task_id)["prompt"]
            unit_test = label_data["test"]
            entry_point = label_data["entry_point"]
            state, message = self.check_solution(
                task_id=task_id, 
                solution=prompt + solution,
                test=unit_test, 
                entry_point=entry_point
            )
            if state != self.SUCCESS:
                break 
            solution_states.append(state)
        results.append(len(solution_states)==len(label) and all(state==self.SUCCESS for state in solution_states))

    k_list = [self.k] if isinstance(self.k, int) else self.k
    pass_at_k = self.compute_pass_at_k(results, k_list)

    return pass_at_k

AFlowHumanEval

AFlowHumanEval(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, **kwargs)

Bases: HumanEval

AFlow-specific implementation of HumanEval benchmark.

Source code in evoagentx/benchmark/humaneval.py
def __init__(self, path: str = None, mode: str = "all", timeout: int = 60, k: Union[int, list] = 1, **kwargs):
    path = os.path.expanduser(path or "~/.evoagentx/data/aflow/humaneval")
    super().__init__(path=path, mode=mode, timeout=timeout, k=k, **kwargs)

extract_test_cases_with_entry_point

extract_test_cases_with_entry_point(entry_point: str)

Extract test cases with the given entry point.

Source code in evoagentx/benchmark/humaneval.py
def extract_test_cases_with_entry_point(self, entry_point: str):
    """
    Extract test cases with the given entry point.
    """

    hardcoded_cases = {
        "find_zero": "",
        "decode_cyclic": "",
        "decode_shift": "",
        "by_length": "",
        "add": "",
        "triangle_area": "",
        "correct_bracketing": "",
        "solve": "",
        "sum_squares": "",
        "starts_one_ends": "",
    }
    if entry_point in hardcoded_cases:
        return hardcoded_cases[entry_point]

    for case in self._test_cases:
        if case["entry_point"] == entry_point:
            return case["test"]

    return None

LiveCodeBench

LiveCodeBench(path: str = None, mode: str = 'all', timeout: int = 60, k: Union[int, list] = 1, num_process: int = 6, scenario: str = 'code_generation', version: str = 'release_latest', start_date: str = None, end_date: str = None, use_cot_for_execution: bool = False, **kwargs)

Bases: CodingBenchmark

Benchmark class for evaluating LLM capabilities on real-world programming tasks.

LiveCodeBench provides a framework for evaluating different scenarios of code-related tasks: 1. Code Generation: generating code from problem descriptions 2. Test Output Prediction: predicting test outputs given test code 3. Code Execution: generating code that executes correctly

The benchmark supports different evaluation modes, metrics, and can be customized with various parameters like timeouts, sample dates, and processing options.

Attributes:

Name Type Description
k

An integer or list of integers specifying which pass@k metrics to compute

version

Release version of the dataset to use

num_process

Number of processes to use for evaluation

start_date

Filter problems to those after this date

end_date

Filter problems to those before this date

scenario

Type of programming task to evaluate ("code_generation", "test_output_prediction", or "code_execution")

use_cot_for_execution

Whether to use chain-of-thought processing for code execution

Source code in evoagentx/benchmark/livecodebench.py
def __init__(
    self, 
    path: str = None, 
    mode: str = "all", 
    timeout: int = 60, 
    k: Union[int, list] = 1, 
    num_process: int = 6, 
    scenario: str = "code_generation", 
    version: str = "release_latest", 
    start_date: str = None, 
    end_date: str = None, 
    use_cot_for_execution: bool = False, 
    **kwargs
):
    path = os.path.expanduser(path or "~/.evoagentx/data/livecodebench")
    self.k = k 
    self.version = version
    self.num_process = num_process
    self.start_date = start_date
    self.end_date = end_date
    self.scenario = scenario 
    self.use_cot_for_execution = use_cot_for_execution
    assert scenario in VALID_SCENARIO, f"Invalid scenario: {scenario}. Available choices: {VALID_SCENARIO}." 
    super().__init__(name=type(self).__name__, path=path, mode=mode, timeout=timeout, **kwargs)

evaluate

evaluate(prediction: Any, label: Any) -> dict

Evaluate the solution code.

Parameters:

Name Type Description Default
prediction str | List[str]

The solution code(s).

required
label dict | List[dict]

The test cases and expected outputs.

required

Returns:

Name Type Description
dict dict

The evaluation metrics (pass@k).

Source code in evoagentx/benchmark/livecodebench.py
def evaluate(self, prediction: Any, label: Any) -> dict:
    """
    Evaluate the solution code.

    Args:
        prediction (str | List[str]): The solution code(s).
        label (dict | List[dict]): The test cases and expected outputs. 

    Returns:
        dict: The evaluation metrics (pass@k).
    """
    prediction, label = self._check_evaluation_inputs(prediction, label)
    k_list = [self.k] if isinstance(self.k, int) else self.k

    if self.scenario == "code_generation":
        solutions: List[str] = [extract_code_blocks(pred)[0] for pred in prediction]
        metrics, results, metadatas = codegen_metrics(
            samples_list=label, # label is already a list 
            generations_list=[solutions], # for a single example. 
            k_list=k_list, 
            num_process_evaluate=self.num_process,
            timeout=self.timeout
        )

    elif self.scenario == "test_output_prediction":
        pred_outputs = [extract_test_output_code(pred) for pred in prediction]
        metrics, results = test_output_metrics(
            samples=label, 
            generations=[pred_outputs], 
            k_list=k_list, 
        )
    elif self.scenario == "code_execution":
        pred_outputs = [extract_execution_code(pred, self.use_cot_for_execution) for pred in prediction]
        metrics, results = code_execution_metrics(
            samples=label, 
            generations=[pred_outputs], 
        )
    else:
        raise ValueError(f"Invalid scenario: {self.scenario}. Available choices: {VALID_SCENARIO}.")

    pass_at_k = {f"pass@{k}": float(metrics[f"pass@{k}"]) for k in k_list}
    return pass_at_k