Prompt & EvalOpen Source✦ Free Tier

DeepEval

LLM evaluation framework — 14+ metrics

⭐ 5,500 stars● Health 95/100 — Active· commit recency (40 pts) · star momentum (30 pts) · issue ratio (20 pts) · forks (10 pts)App Infrastructure

Open in Builder →Website ↗GitHub ↗

About

Open-source evaluation framework with 14+ metrics including faithfulness, relevancy, and hallucination detection. Integrates with CI/CD.

Choose DeepEval when…

•You want a pytest-style framework for LLM testing
•Unit-test-like evals for LLM outputs fit your workflow
•You need RAG-specific metrics like faithfulness and relevancy

Builder Slot

How do you know it's working?Optional for most stacks

Tests, evals, and experiment tracking to measure and improve your AI output quality

Dev Tools

Not applicable

App Infra

Recommended

Hybrid

Optional

Other tools in this slot:

PromptFoo RAGAS Vellum PromptLayer Agenta TruLens Humanloop Inspect

Stack Genome Detection

AIchitect's Genome scanner detects DeepEval in your project via these signals:

pip packages

deepeval

env vars

CONFIDENT_API_KEY

Integrates with (2)

LangfuseObservability

DeepEval sends evaluation results to Langfuse as trace scores via its Langfuse integration.

→ Quality metrics — faithfulness, hallucination rate, G-Eval scores — visible alongside the raw traces that produced them.

Compare →

OpenAI APILLM Infrastructure

DeepEval uses OpenAI's API as the judge model to score generated outputs on metrics like faithfulness, relevance, and hallucination rate.

→ LLM-as-judge quality metrics powered by GPT-4o — structured, reproducible evaluation scores for any AI output.

Compare →

Often paired with (1)

PromptFoo

Alternatives to consider (4)

RAGAScompare →TruLenscompare →Inspectcompare →Galileocompare →

Pricing

✦ Free tier available

Recent Activity

Pricing updated

3 weeks ago

↗

Health ↑ 80 → 95

4 weeks ago

↗

Pricing updated

5 weeks ago

↗

View all activity for this tool →

In 1 stack

AI Red-Team / Security Stack

Ruled out by 1 stack

Evaluation & Quality Stack

“Promptfoo covers the CI regression testing role; DeepEval shines in Python-only stacks where it's the sole eval tool rather than one of several.”

Badge

Add to your GitHub README

[![DeepEval](https://www.aichitect.dev/badge/tool/deepeval)](https://www.aichitect.dev/tool/deepeval)

Explore the full AI landscape

See how DeepEval fits into the bigger picture — browse all 207 tools and their relationships.

Explore graph →