Langfuse just got faster →

Get Started with Evaluation

This guide helps you set up your first evaluation. If you want to understand what evaluation is and why it matters, check out the Evaluation Overview first. For details on concepts like scores, datasets, and experiments, see Core Concepts.

Get API keys

  1. Create Langfuse account or self-host Langfuse.
  2. Create new API credentials in the project settings.

Set up your AI agent

Use the Langfuse Skill in your editor's agent mode to automatically set up evaluations for your application.

What is a Skill? A reusable instruction package for AI coding agents. It gives your agent Langfuse-specific workflows and best practices out of the box.

Install the Langfuse Skill in your coding tool:

Langfuse has a Cursor Plugin that includes the skill automatically.

Claude stores its skills in a .claude/skills directory, you can install skills either globally or per project.

Copy the Langfuse Skill to your local claude skills folder. We'd recommend using a symlink to keep the skill up to date.

You can do this using npm (skills CLI):

npx skills add langfuse/skills --skill "langfuse" --agent "claude-code"
Alternatively you can do this manually
  1. Clone repo somewhere stable
git clone https://github.com/langfuse/skills.git /path/to/langfuse-skills
  1. Make sure Claude skills dir exists (common location)
mkdir -p ~/.claude/skills
  1. Symlink the skill folder
ln -s /path/to/langfuse-skills/skills/langfuse ~/.claude/skills/langfuse

Codex stores its skills in a .agents/skills directory, you can install skills either globally or per project. See Codex docs: Where to save skills.

Copy the Langfuse Skill to your local codex skills folder. We'd recommend using a symlink to keep the skill up to date.

You can do this using npm (skills CLI):

npx skills add langfuse/skills --skill "langfuse" --agent "codex"
Alternatively you can do this manually
  1. Clone repo somewhere stable
git clone https://github.com/langfuse/skills.git /path/to/langfuse-skills
  1. Make sure Codex skills dir exists (common location)
mkdir -p ~/.agents/skills
  1. Symlink the skill folder
ln -s /path/to/langfuse-skills/skills/langfuse ~/.agents/skills/langfuse

For other AI coding agents, the skill folder structure is:

<agent-skill-root> depends on your tool. The npm command below installs to the correct location automatically.

For other AI coding agents, install via npm (skills CLI):

npx skills add langfuse/skills --skill "langfuse"

If you want to target a specific agent directly:

npx skills add langfuse/skills --skill "langfuse" --agent "<agent-id>"
Alternatively you can do this manually
  1. Clone repo somewhere stable
git clone https://github.com/langfuse/skills.git /path/to/langfuse-skills
  1. Make sure your agent's skills dir exists
mkdir -p /path/to/<agent-skill-root>/skills
  1. Symlink the skill folder
ln -s /path/to/langfuse-skills/skills/langfuse /path/to/<agent-skill-root>/skills/langfuse

Set up evals

Start a new agent session, then prompt it to set up evaluations:

"Set up Langfuse evaluations for this application. Help me choose the right evaluation approach and implement it."

The agent will analyze your codebase, recommend the best evaluation method, and help you implement it.

Pick your starting point

Different teams need different evaluation approaches. Pick the one that matches what you want to do right now — you can always add more later.

Not sure which to pick? Here's a rule of thumb:

  • Already have traces in Langfuse? Start with Monitor Production — you'll get scores on your existing data within minutes.
  • Building something new or changing prompts? Start with Test Before Shipping — create a dataset and run experiments to validate changes.
  • Need ground truth or expert review? Start with Human Review — build a labeled dataset from real traces.

Monitor Production

Use LLM-as-a-Judge to automatically evaluate live traces. An LLM scores your application's outputs against criteria you define — no code changes required.

Prerequisites: Traces flowing into Langfuse and an LLM connection configured.

Create an evaluator

Navigate to Evaluators in the sidebar and click + Set up Evaluator. Choose a managed evaluator (e.g., Hallucination, Helpfulness) or write your own evaluation prompt.

Select your target data

Choose Live Observations to evaluate individual operations (recommended) or Live Traces to evaluate complete workflows. Add filters to target specific operations — for example, only evaluate observations named chat-response.

Map variables and activate

Map the evaluator's variables (like {{input}} and {{output}}) to the corresponding fields in your traces. Preview how the evaluation prompt looks with real data, then save.

New matching traces will be scored automatically. Check the Scores tab on any trace to see results.


Test Before Shipping

Run your application against a fixed dataset and evaluate the outputs. This is how you catch regressions before deploying.

Prerequisites: Langfuse SDK installed (Python v3+ or JS/TS v4+).

Define test data

Start with a few representative inputs and expected outputs. You can use local data or create a dataset in Langfuse.

Run an experiment

Use the experiment runner SDK to execute your application against every test case and optionally score the results.

from langfuse import get_client, Evaluation
from langfuse.openai import OpenAI

langfuse = get_client()

def my_task(*, item, **kwargs):
    response = OpenAI().chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": item["input"]}],
    )
    return response.choices[0].message.content

def check_answer(*, output, expected_output, **kwargs):
    is_correct = expected_output.lower() in output.lower()
    return Evaluation(name="correctness", value=1.0 if is_correct else 0.0)

result = langfuse.run_experiment(
    name="My First Experiment",
    data=[
        {"input": "What is the capital of France?", "expected_output": "Paris"},
        {"input": "What is the capital of Germany?", "expected_output": "Berlin"},
    ],
    task=my_task,
    evaluators=[check_answer],
)

print(result.format())
import { OpenAI } from "openai";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { LangfuseClient, ExperimentItem } from "@langfuse/client";
import { observeOpenAI } from "@langfuse/openai";
import { LangfuseSpanProcessor } from "@langfuse/otel";

const otelSdk = new NodeSDK({ spanProcessors: [new LangfuseSpanProcessor()] });
otelSdk.start();

const langfuse = new LangfuseClient();

const testData: ExperimentItem[] = [
  { input: "What is the capital of France?", expectedOutput: "Paris" },
  { input: "What is the capital of Germany?", expectedOutput: "Berlin" },
];

const myTask = async (item: ExperimentItem) => {
  const response = await observeOpenAI(new OpenAI()).chat.completions.create({
    model: "gpt-4.1",
    messages: [{ role: "user", content: item.input as string }],
  });
  return response.choices[0].message.content;
};

const checkAnswer = async ({ output, expectedOutput }) => ({
  name: "correctness",
  value: expectedOutput && output.toLowerCase().includes(expectedOutput.toLowerCase()) ? 1.0 : 0.0,
});

const result = await langfuse.experiment.run({
  name: "My First Experiment",
  data: testData,
  task: myTask,
  evaluators: [checkAnswer],
});

console.log(await result.format());
await otelSdk.shutdown();

Review results

The experiment runner prints a summary table. If you used a Langfuse dataset, results are also available in the Langfuse UI under Datasets where you can compare runs side by side.


Human Review

Set up annotation queues so domain experts can review traces and add scores manually. This is the best way to build ground truth data and calibrate automated evaluators.

Prerequisites: Traces in Langfuse and at least one score config.

Create a score config

Go to SettingsScore Configs and create a config that defines what you want to measure. For example, a categorical config with values correct, partially_correct, and incorrect.

Create an annotation queue

Navigate to Annotation Queues and click New Queue. Give it a name, attach your score config, and optionally assign team members.

Add traces and start reviewing

Select traces from the Traces table and click ActionsAdd to queue. Open the queue and work through items — score each one, add comments, then click Complete + next.

Next steps

Now that you have your first evaluation running, here are recommended next steps:

Looking for something specific? Check the Evaluation Methods and Experiments sections for detailed guides.


Was this page helpful?