Skip to content

Benchmarks

fast-rlm includes evaluation scripts under benchmarks/ for testing against standard long-context datasets.

Setup

Install benchmark dependencies:

uv sync --extra benchmarks

This adds the datasets library from Hugging Face.

Available Benchmarks

LongBench (NarrativeQA)

Dataset: THUDM/LongBench — a multi-task benchmark for long context understanding.

What it tests: Reading comprehension over long narratives. The agent receives a full story plus a question, and must find the answer by exploring the text through its REPL.

uv run benchmarks/longbench_benchmark.py
Full source: benchmarks/longbench_benchmark.py
import fast_rlm
from datasets import load_dataset

ds = load_dataset("THUDM/LongBench",
                  "narrativeqa",
                  split="test",
                  trust_remote_code=True)
idx = 140

example = ds[idx]

query = f"""
{example['input']}

{example['context']}
"""

data = fast_rlm.run(query, prefix=f"longbench_hotpot_idx{idx}")
print("Expected answer: ", example['answers'])

To test a different example, change idx:

idx = 100  # try different indices

Oolong Synth

Dataset: oolongbench/oolong-synth — synthetic long-context tasks including timeline ordering, user tracking, and counting.

What it tests: Precise information extraction from very long synthetic contexts. Tasks include tracking timelines, counting occurrences, and following user actions across large documents.

uv run benchmarks/oolong_synth_benchmark.py
Full source: benchmarks/oolong_synth_benchmark.py
import fast_rlm
from datasets import load_dataset

ds = load_dataset("oolongbench/oolong-synth",
                  split="test")
idx = 100

example = ds[idx]
print(example['answer'])

query = f"""
{example['context_window_text_with_labels']}

{example['question']}
"""

data = fast_rlm.run(query, prefix=f"oolong_synth_idx{idx}")
print("Expected answer: ", example['answer'])

You can filter by task type:

# Available task groups: 'timeline', 'user', 'counting'
ds = ds.filter(lambda x: x['task_group'] == 'counting')

Adding New Benchmarks

Create a new file in benchmarks/. The pattern is simple:

import fast_rlm
from datasets import load_dataset

# 1. Load a dataset
ds = load_dataset("your/dataset", split="test")

# 2. Pick an example
example = ds[0]

# 3. Build a query (question + context)
query = f"{example['question']}\n\n{example['context']}"

# 4. Run it
result = fast_rlm.run(query, prefix="my_benchmark")

# 5. Compare
print("Got:", result["results"])
print("Expected:", example["answer"])
print("Cost:", result["usage"]["cost"])

The usage field in every result gives you per-run cost and token tracking, useful for comparing efficiency across models and configurations.