How-to guides: Evaluation

This section contains how-to guides related to evaluation.

📄️ Evaluate an LLM Application

Before diving into this content, it might be helpful to read the following:

📄️ Bind an evaluator to a dataset in the UI

While you can specify evaluators to grade the results of your experiments programmatically (see this guide for more information), you can also bind evaluators to a dataset in the UI.

📄️ Run an evaluation from the prompt playground

While you can kick off experiments easily using the sdk, as outlined here, it's often useful to run experiments directly in the prompt playground.

📄️ Evaluate on intermediate steps

While, in many scenarios, it is sufficient to evaluate the final output of your task, in some cases you might want to evaluate the intermediate steps of your pipeline.

📄️ Use LangChain off-the-shelf evaluators (Python only)

Before diving into this content, it might be helpful to read the following:

📄️ Compare experiment results

Oftentimes, when you are iterating on your LLM application (such as changing the model or the prompt), you will want to compare the results of different experiments.

📄️ Evaluate an existing experiment

Currently, evaluate_existing is only supported in the Python SDK.

📄️ Test LLM applications (Python only)

LangSmith functional tests are assertions and expectations designed to quickly identify obvious bugs and regressions in your AI system. Relative to evaluations, tests typically are designed to be fast and cheap to run, focusing on specific functionality and edge cases.

📄️ Run pairwise evaluations

Before diving into this content, it might be helpful to read the following:

📄️ Audit evaluator scores

LLM-as-a-judge evaluators don't always get it right. Because of this, it is often useful for a human to manually audit the scores left by an evaluator and correct them where necessary. LangSmith allows you to make corrections on evaluator scores in the UI or SDK.

How-to guides: Evaluation

📄️ Evaluate an LLM Application

📄️ Bind an evaluator to a dataset in the UI

📄️ Run an evaluation from the prompt playground

📄️ Evaluate on intermediate steps

📄️ Use LangChain off-the-shelf evaluators (Python only)

📄️ Compare experiment results

📄️ Evaluate an existing experiment

📄️ Test LLM applications (Python only)

📄️ Run pairwise evaluations

📄️ Audit evaluator scores

📄️ Create few-shot evaluators

📄️ Fetch performance metrics for an experiment

📄️ Run evals with the REST API

📄️ Upload experiments run outside of LangSmith with the REST API

Was this page helpful?

You can leave detailed feedback on GitHub.