Eval Observability

Tool	Category	Segment	Platform / Tool	Plan / License	Monthly Price USD	Pricing Model	Free Tier / OSS	Included Usage / Limits	Evaluation Capabilities	Observability / Tracing	Prompt / Dataset / Experiment Features	Integrations / Frameworks	Deployment / Hosting	Security / Privacy	Team / Governance	Best Fit	Main Limits / Caveats
LangSmith Developer No tagline	Eval Observability	LLM observability and eval platform	LangSmith	Developer	$0/seat/mo then pay-as-you-go	Per-seat plus trace/Fleet usage	✓	1 free seat; up to 5k base traces/month included; 50 Fleet runs/month; community support	Online/offline evals, annotation queues, prompt improvement and monitoring	Tracing for agent execution, run trees, monitoring and alerting	Prompt Hub, Playground, Canvas, datasets and human feedback workflows	LangChain, LangGraph and custom SDK/API integrations	LangSmith Cloud; self/hybrid hosting on Enterprise	Data retention and support are limited on Developer; enterprise has custom hosting/security	1 seat only; community support	Solo developers debugging/evaluating LangChain or LangGraph apps	5k traces/month and one-seat limit make it a prototype tier
LangSmith Plus / Enterprise No tagline	Eval Observability	LLM observability and eval platform	LangSmith	Plus / Enterprise	$39/seat/mo Plus; Enterprise custom	Per-seat plus pay-as-you-go trace/Fleet usage	Developer free plan exists	Plus includes 10k base traces/month, one dev-sized agent deployment, 500 Fleet runs/month, unlimited seats and up to 3 workspaces; Enterprise custom	Full online/offline evals, annotation workflows, prompt improvement and monitoring	Production tracing, monitoring/alerting and agent deployment visibility	Prompt Hub, Playground, datasets, annotation queues and Fleet workflows	LangChain, LangGraph, SDK/API and enterprise deployments	Cloud plus Enterprise hybrid/self-hosting options	Enterprise SSO/RBAC, support SLA and alternative hosting	Unlimited seats on Plus; Enterprise custom workspaces/seats/security	Teams operating LangChain/LangGraph apps in production	Trace overage and retention choices can drive cost; advanced hosting requires Enterprise
Langfuse Hobby No tagline	Eval Observability	Open-source LLM observability platform	Langfuse	Hobby Cloud / OSS	$0	Freemium units or free self-hosted OSS	✓	Cloud Hobby includes 50k units/month, 30 days data access, 2 users and all platform features with limits; self-hosted full product is open source	Online/offline evaluation, datasets, experiments, scores, LLM-as-judge evaluators and human annotation	LLM/agent tracing, sessions, token/cost tracking, OpenTelemetry and proxy logging	Prompt versioning/fetching/release management, playground and prompt experiments	Python/JS SDKs, OpenTelemetry Java/Go/custom, LiteLLM proxy and framework integrations	Langfuse Cloud or self-hosted Docker/Kubernetes/cloud deployment	Data regions US/EU/JP; Hobby has limited users/support and 30-day data access	2 users on Hobby; GitHub community support	Indie projects and teams wanting open-source LangSmith alternative	Unit-based pricing counts observations/scores too, not only top-level traces
Langfuse Core / Pro / Enterprise No tagline	Eval Observability	Open-source LLM observability platform	Langfuse	Core / Pro / Enterprise	$29/mo Core; $199/mo Pro; $2,499/mo Enterprise	Monthly subscription plus unit overage	Hobby free plan exists	Core/Pro include 100k units/month and $8/100k additional; Core has 90 days data access; Pro has 3 years; Enterprise includes Pro+Teams and enterprise controls	Evaluation datasets, experiments, scores, LLM-as-judge, human annotation queues and external evaluation pipelines	High-volume tracing, token/cost tracking, multimodal beta, proxy/OpenTelemetry logging	Prompt management, prompt experiments, release labels, playground and webhooks/Slack	SDKs, OpenTelemetry, LiteLLM, framework integrations and public API	Cloud or self-host; Enterprise custom volume/marketplace/invoice options	Pro has SOC2/ISO27001 reports and BAA available; Enterprise adds audit logs, SCIM, SLA and dedicated support	Unlimited users from Core; Teams add-on adds SSO/RBAC/support on Pro	Teams needing open-source-friendly observability with predictable cloud tiers	Core/Pro both start with 100k units; Teams add-on and overages can materially change price
Braintrust Starter No tagline	Eval Observability	AI eval and observability platform	Braintrust	Starter	$0 platform fee	Usage-based free tier plus overage	✓	1 GB processed data/month then $4/GB; 10k scores/month then $2.50/1k; 14 days retention; $10/month Topics credit	Experiments, scorers, online scoring, eval datasets, human review scorers and sandbox evals by plan	Production logs, tracing, dashboards, topics and monitoring	Prompt playgrounds, datasets, experiments, exports and environment workflows	SDK/API, custom functions, AI provider gateway and app integrations	Braintrust Cloud; self-hosted customers can adjust some system limits	Starter has Google-only SSO, owner-only permission group and no SOC2/DPA/BAA	Unlimited users/projects/datasets in current Starter model, but limited advanced governance	Small teams starting evals without per-seat fees	Processed-data and score overages can appear once usage exceeds free allocation
Braintrust Pro / Enterprise No tagline	Eval Observability	AI eval and observability platform	Braintrust	Pro / Enterprise	$249/mo Pro; Enterprise custom	Monthly platform plus usage overage	Starter free tier exists	Pro includes 5 GB processed data/month then $3/GB, 50k scores/month then $1.50/1k, 30 days retention and launch Topics credit; Enterprise custom	Advanced evals, custom charts, environments, dataset snapshots, playground annotations and sandbox evals	Production observability, topics, dashboards, logs and monitoring	Datasets, experiments, prompts, functions, environments and exports	SDK/API, gateway, provider integrations and custom functions	Cloud; enterprise/self-host options by contract	Enterprise adds SAML/OIDC SSO, custom permission groups, retention, exports, SOC2, BAA and custom legal terms	Pro has Owner/engineer/viewer permission groups; Enterprise custom	Growing production teams needing eval/observability plus gateway workflows	Retention is 30 days on Pro; custom retention/SAML/BAA require Enterprise
Arize AX Free / Pro No tagline	Eval Observability	Hosted observability and eval platform	Arize AX / Phoenix Cloud	AX Free / AX Pro	$0 Free; $50/mo Pro	Hosted SaaS with span/GB quotas and overages	✓	AX Free: 25k spans/month, 1 GB/month, 15 days retention; AX Pro: 50k spans/month, 10 GB/month, 30 days retention, higher limits and email support	Online/offline evaluations, datasets, experiments, LLM-as-judge/code evals, session/agent path evals and labeling queues	Hosted tracing, product observability, custom metrics, monitors and Alyx agent assistance	Prompt management, prompt serving, prompt environment tags, replay and optimization	SDKs, OpenTelemetry and framework integrations	Hosted SaaS; Enterprise SaaS or self-hosted	AX Free/Pro regions US/EU/CA; Enterprise adds SOC2/HIPAA, SLA, dedicated support and self-host add-on	1 organization on AX Free/Pro; Enterprise custom	Teams wanting hosted Phoenix with simple span/GB pricing	AX Pro span/GB overage is separate; Enterprise required for advanced governance
Helicone Hobby No tagline	Eval Observability	LLM request observability	Helicone	Hobby	$0	Free request/storage quota	✓	10,000 requests/month, 1 GB storage, 1 seat and 1 organization	Prompt/request analysis and regression-style evaluation workflows depending feature use	Request logging, usage/cost tracking, metrics, caching, alerts/reporting on paid tiers	Prompts, experiments and query language stronger on Pro+	OpenAI-compatible proxy style, provider integrations and app SDK/API workflows	Hosted Helicone; self-host/open-source options should be checked in docs/repo	Free plan has one seat/org and limited storage	1 seat/1 org on Hobby	Indie apps tracking LLM request costs and latency quickly	Free quota is request-limited and storage-limited; team features start paid
Helicone Pro / Team No tagline	Eval Observability	LLM request observability	Helicone	Pro / Team	$79/mo Pro; $799/mo Team	Monthly plan plus usage-based pricing	Hobby free plan exists	Pro and Team include 10k free requests plus usage-based pricing; Pro has unlimited seats, alerts/reports and HQL; Team adds 5 organizations and scaling-company features	Evaluation and prompt iteration workflows through request logs, reports and query language	LLM request tracing, usage, cost, latency, alerts, reports and storage	HQL query language, prompt/request analysis and team reporting	Provider proxy/integration workflows for LLM apps	Hosted Helicone	Enterprise adds unlimited orgs and custom terms; Team/Enterprise for broader governance	Unlimited seats on Pro/Team; org count rises from 1 to 5 on Team	Teams needing request-level observability with simple fixed starting price	Usage-based charges apply after included quota; Team price jumps sharply
W&B Free No tagline	Eval Observability	AI app and model tracking platform	Weights & Biases / Weave	Free	$0/mo	Free cloud plan for personal/small projects	✓	Free plan includes AI application evaluations/tracing/scorers, experiment tracking, registry/lineage, CI/CD automations, Slack/email alerts, 5 GB storage and 1 GB/month Weave ingestion	AI application evaluations and scorers	Weave tracing for GenAI applications plus W&B experiment tracking	Datasets/registry/lineage and CI/CD automations	W&B SDK, Weave integrations, model tracking and app tracing workflows	Cloud-hosted Free or local personal server; corporate use rules differ for personal self-host	Free lacks enterprise security; academic research gets separate free program	Free is for personal development/small projects	Developers combining ML experiment tracking with LLM app tracing	Corporate/professional team use generally moves to Pro/Enterprise; ingestion/storage limits apply
W&B Pro / Enterprise No tagline	Eval Observability	AI app and model tracking platform	Weights & Biases / Weave	Pro / Enterprise	Starts at $60/mo Pro; Enterprise custom	Monthly plan plus storage/Weave ingestion/inference usage	Free plan exists	Pro starts at $60/month, includes up to 10 model seats, 100 GB storage and 1.5 GB/month Weave ingestion; additional storage $0.03/GB and Weave ingestion $0.10/MB	AI app evaluations/scorers plus ML experiment and model evaluation workflows	Weave production/development tracing and W&B experiment observability	Registry, lineage, automations, datasets and team collaboration	W&B SDK, Weave, model/inference integrations	Cloud or enterprise/private-hosted deployment options	Enterprise adds SSO, SCIM, audit logs, HIPAA option, customer-managed encryption and enterprise support	Pro for teams under stated guidelines; Enterprise for compliance/security	Teams already using W&B for ML who need GenAI app tracing	Weave ingestion overage can be expensive at high trace payload volume
Opik OSS / Free Cloud No tagline	Eval Observability	Open-source GenAI observability/eval platform	Comet Opik	OSS / Free Cloud	$0	Open source or free hosted cloud	✓	OSS full feature set; Free Cloud up to 10 team members, 25k spans/month and 60-day retention	Test suites, assertions, agent testing, evaluations and prompt/trace analysis	Agent tracing, execution graphs, sessions, token/cost tracking and multimedia logging	Agent Playground, prompts/configuration, datasets/experiments and comments	Python/TypeScript SDKs, public API and MCP server	Self-host OSS or Comet-hosted cloud	Free Cloud has usage limits; OSS data stays self-hosted	Free Cloud up to 10 team members	Teams wanting a very generous free LLM observability/eval stack	Cloud span quota is lower than some competitors; self-hosting requires ops
Opik Pro / Enterprise No tagline	Eval Observability	Open-source GenAI observability/eval platform	Comet Opik	Pro Cloud / Enterprise	$19/mo Pro; Enterprise custom	Monthly cloud plan or custom enterprise	OSS and Free Cloud exist	Pro Cloud includes up to 50 team members, 100k spans/month and 60-day retention; Enterprise custom usage and unlimited team members	Test suites/assertions, agent testing, playground and evaluation workflows	Tracing, execution graphs, sessions, token/cost tracking and error surfacing	Prompt/config management, datasets, experiments, annotations and export	SDKs, public API, MCP server and Comet ecosystem	Hosted cloud, self-host OSS or enterprise flexible deployments	Enterprise SSO, dedicated support/SLA and compliance reports	Up to 50 team members on Pro; Enterprise unlimited/custom	Small teams wanting low-cost hosted eval/observability	Pro retention is still 60 days; advanced compliance/deployment requires Enterprise
Promptfoo Community No tagline	Eval Observability	LLM security and eval CLI	Promptfoo	Community	$0	Open-source local/self-hosted tool	✓	All LLM evaluation features, all model providers/integrations, red teaming up to 10k probes/month, custom app integration and vulnerability scanning	Prompt/model/RAG evaluations, red teaming, factuality, hallucination and vulnerability testing	Local reports and scans rather than hosted trace observability by default	YAML/config-driven test cases, assertions, model comparison and CI integration	All model providers, custom integrations, CI/CD, app targets and security plugins	Run locally or self-host on own infrastructure	Data stays local/self-hosted in Community; community support	Individual/small team use; no hosted team collaboration	Developers adding eval and red-team tests to CI without SaaS	10k free red-team probes/month; team dashboards/API/cloud require Enterprise
Promptfoo Enterprise / On-Premise No tagline	Eval Observability	LLM security and eval platform	Promptfoo	Enterprise / On-Premise	Custom	Custom enterprise subscription/deployment	Community free plan exists	Custom red-team limits, team sharing, continuous monitoring, security/compliance dashboard, SSO, API access, managed cloud or on-prem deployment	Advanced LLM security testing, monitoring, red-teaming and evaluations at org scale	Continuous monitoring and centralized dashboards	Saved targets, attack profiles, API access and organization-specific configs	CI/CD, model providers, app integrations, Promptfoo API and managed/on-prem infrastructure	Managed cloud deployment or on-premise deployment with complete data isolation	SSO, granular permissions, compliance dashboard, support and SLA guarantees	Teams-based access controls and custom roles	Organizations needing formal AI security testing and red-team monitoring	Pricing is custom; advanced cloud/on-prem features unavailable in Community
DeepEval OSS No tagline	Eval Observability	Open-source LLM unit testing	DeepEval	Open source framework	$0 software	Open-source local framework; provider API costs separate	✓	Runs local evals/CI; most metrics are LLM-as-judge and default to OpenAI unless configured; can use Anthropic, Gemini, Ollama, Azure OpenAI or custom LLM	LLM unit tests, RAG/agent/multi-turn/safety/MCP metrics, synthetic data and benchmarks	Local testing reports; can integrate with Confident AI for hosted observability	Pytest-like CLI, evaluate(), metrics, datasets and CI/CD workflows	Python framework with provider/model integrations and Confident AI integration	Local/open-source; optional Confident AI cloud	Basic non-identifying telemetry by default can be opted out; cloud data stored in private AWS per FAQ	OSS used by developers/CI; no team governance locally	Engineering teams adding test assertions to LLM apps	Judge model calls can cost money; dependency/runtime compatibility matters in CI
Confident AI Free / Starter / Premium No tagline	Eval Observability	AI quality platform	Confident AI	Free / Starter / Premium	$0 Free; from $19.99/user/mo Starter; from $49.99/user/mo Premium	Per-user plus project and GB-month/eval-run overage	✓	Free: 2 users, 1 project, 5 test runs/week, 1 GB-month trace spans and 1 week retention; Starter/Premium add paid users/projects, online eval metric runs and retention controls	LLM eval benchmark/testing reports, unit/regression tests, online evals and custom metrics	LLM tracing, monitoring, alerts and trace span storage by GB-month	Prompt versioning, cloud dataset annotation, no-code workflows and pre-commit evals on Premium	DeepEval, DeepTeam, OpenTelemetry, TypeScript SDK and APIs	Hosted Confident AI; Enterprise dedicated on-prem available	SOC2/HIPAA/GDPR listed; data stored in private AWS per docs	Free limited to 2 seats/1 project; paid per user/project; Team custom	Teams wanting hosted DeepEval-style eval workflows	Self-serve plan math includes user/project/GB/eval overages; Free test runs are capped
Ragas OSS No tagline	Eval Observability	RAG and LLM evaluation framework	Ragas	Open source	$0 software	Open-source Python framework; optional services/consulting separate	✓	Library for systematic evaluation loops, metrics, experiments, datasets and testset generation; no hosted quota on docs page	RAG metrics such as context precision/recall, faithfulness, response relevancy plus agent/tool, SQL and general-purpose metrics	Integrates with observability tools including Arize and LangSmith; not a full tracing SaaS by itself	Experiments, evaluation datasets, metrics, prompt evaluation and test data generation	LangChain, LlamaIndex, Haystack, LangGraph, Gemini, Bedrock, Vertex AI and other integrations	Local Python library; can plug into external observability tools	Data handling depends on your runner/model providers; open-source code visible	No built-in team governance unless integrated with another platform	Teams evaluating RAG quality with standardized metrics	LLM-as-judge/testset generation can incur model costs; no hosted collaboration tier captured
OpenAI Evals OSS No tagline	Eval Observability	Open-source eval registry/framework	OpenAI Evals	Open source	$0 software	Open-source framework; model/API costs separate	✓	Framework for evaluating LLMs and LLM systems plus open-source benchmark registry	Benchmark and custom eval workflows for LLM systems	Not a tracing/production observability platform	Eval registry, custom eval definitions and scripts	OpenAI API and Python-based eval workflows; can be adapted to other model calls	Local/open-source repo	Data sent to configured model/API providers; repo license governs source	No team governance; repo/CI handles collaboration	Developers creating repeatable model/system benchmarks	Older/evolving repo; may require adaptation for modern agent app evals
EleutherAI lm-eval OSS No tagline	Eval Observability	Open-source model benchmark harness	lm-evaluation-harness	Open source	$0 software	Open-source benchmark harness	✓	Few-shot evaluation harness for language models with many tasks/backends; run costs depend on model backend/API	Standardized model benchmark evaluation and task suites	No production tracing; focused on offline benchmark runs	Task configs, metrics, model adapters and result reporting	Local/HF/vLLM/API-style backends depending harness support	Local/open-source	Self-managed data and model access	No SaaS governance	Researchers benchmarking base/instruct models	Best for model benchmarks, not app-level RAG/agent observability
OpenCompass OSS No tagline	Eval Observability	Open-source LLM benchmark platform	OpenCompass	Apache-2.0 / open source	$0 software	Open-source platform; API/model costs separate	✓	Supports many models and over 100 datasets; can evaluate open-source and API models with CLI or Python scripts	General/scientific/reasoning benchmarks, LLM judge, math evaluation and long-context benchmarks	No production app tracing; offline/leaderboard-oriented evaluation	Dataset configs, model configs, summarizers and benchmark result workflows	HuggingFace, vLLM, LMDeploy, OpenAI/API, ModelScope and other backends	Local/open-source; leaderboard/community infra separate	Self-managed API keys/data; Apache-2.0 repo	Community/open-source governance	Model evaluation teams needing broad benchmark coverage	Setup/dataset prep can be heavy; not app observability
EvalScope OSS No tagline	Eval Observability	Open-source model evaluation framework	EvalScope	Open source	$0 software	Open-source framework; model/API costs separate	✓	Streamlined/customizable framework for efficient LLM, VLM and AIGC evaluation and performance benchmarking	Model and application benchmark/evaluation workflows	No hosted tracing platform by default	Benchmarks, reports and performance testing workflows	ModelScope ecosystem plus local/model backends depending configuration	Local/open-source	Self-managed data/model/API usage	No SaaS governance by default	Teams evaluating LLM/VLM/AIGC model performance	Best for benchmark/performance eval, not production trace management
LightEval OSS No tagline	Eval Observability	Open-source LLM eval toolkit	Hugging Face LightEval	Open source	$0 software	Open-source toolkit; model/API costs separate	✓	All-in-one toolkit for evaluating LLMs across multiple backends	Offline model benchmark/evaluation toolkit	No production observability/tracing	Task configuration and evaluation reporting	Hugging Face ecosystem and multiple model backends	Local/open-source	Self-managed data/model execution	No team governance unless combined with HF/CI workflows	Researchers and model builders using Hugging Face workflows	Model-centric eval rather than app/RAG/agent observability
TruLens OSS No tagline	Eval Observability	Open-source LLM experiment tracking/evaluation	TruLens	Open source	$0 software	Open-source framework; model/API costs separate	✓	Evaluation and tracking for LLM experiments and AI agents	Feedback functions, RAG/agent app evaluation and experiment comparison	Tracking/tracing within local/app workflows; hosted governance depends on external platform	Experiment records, feedback functions, leaderboards and app-level eval workflows	Python ecosystem, LlamaIndex/LangChain style app integrations	Local/open-source	Self-managed data unless connected to external services	No built-in cloud team governance in OSS row	Developers evaluating RAG/agent apps with feedback functions	Requires custom setup and model/provider calls; not a turnkey SaaS dashboard alone
LLM Comparator OSS No tagline	Eval Observability	LLM response comparator	LLM Comparator	Open source	$0 software	Open-source visualization tool	✓	Interactive data visualization tool for evaluating and analyzing LLM responses side-by-side	Side-by-side response comparison and qualitative eval analysis	No request tracing or monitoring	Datasets/outputs visualization rather than prompt registry	Browser/data visualization workflow	Local/open-source	Self-managed datasets/output files	No SaaS governance	Teams comparing model outputs and evaluator disagreement visually	Narrower than full eval suites; needs prepared outputs
agenttrace OSS No tagline	Eval Observability	AI coding agent observability	agenttrace	Open source	$0 software	Local-first TUI	✓	Local-first TUI observability for AI coding agents; tracks cost, tokens, tool failures, anomalies, health and CI gates across agent exports per local resource	Evaluation/quality gates for agent sessions and CI evidence	Cost/token/tool-failure observability for Claude Code, Codex CLI, Gemini CLI, Aider and Cursor exports	Session exports and CI gates rather than LLM prompt datasets	Claude Code, Codex CLI, Gemini CLI, Aider and Cursor exports	Local/self-hosted CLI/TUI	Local-first; data stays in workspace unless exported	Governance through local repo/CI policies	Developers monitoring AI coding agent runs and failures	Not a general LLM app observability platform; project maturity depends on repo
PR Triage OSS No tagline	Eval Observability	PR quality evaluation	PR Triage	MIT / BYOK	$0 software	Open-source BYOK web app	✓	Open-source PR evaluation tool scoring pull requests on six quality dimensions with diff evidence; bring your own key per local resource	PR/code-quality evaluation and evidence-backed scoring	No runtime LLM app tracing; focused on code review evidence	Score reports over PR diffs rather than prompt/dataset management	Git/PR diff workflows and BYOK model access	Hosted demo/web app or self-host from source if available	BYOK; data exposure depends on where hosted/model provider	No enterprise governance captured	Developers wanting lightweight PR eval reports	Narrow code-review use case; not general model/app eval platform