LLM evaluation framework — 14+ metrics
Open-source evaluation framework with 14+ metrics including faithfulness, relevancy, and hallucination detection. Integrates with CI/CD.
Tests, evals, and experiment tracking to measure and improve your AI output quality
AIchitect's Genome scanner detects DeepEval in your project via these signals:
deepevalCONFIDENT_API_KEYDeepEval sends evaluation results to Langfuse as trace scores via its Langfuse integration.
→ Quality metrics — faithfulness, hallucination rate, G-Eval scores — visible alongside the raw traces that produced them.
DeepEval uses OpenAI's API as the judge model to score generated outputs on metrics like faithfulness, relevance, and hallucination rate.
→ LLM-as-judge quality metrics powered by GPT-4o — structured, reproducible evaluation scores for any AI output.
Add to your GitHub README
[](https://aichitect.dev/tool/deepeval)Explore the full AI landscape
See how DeepEval fits into the bigger picture — browse all 207 tools and their relationships.