agent-evaluation - 技术专题深度解读

Giskard-AI / giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Jul 29, 2026
Python

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability

Updated Jul 30, 2026
Go

ifixai-ai / iFixAi

Independent Auditing of AI Agents. Run by human or the agent itself, to answer the most crucial question in the AI Agent Economy. Is the agent doing what is supposed to do? With iFixAi you can have this answer in less than 120 seconds.

Updated Jul 27, 2026
Python

truera / trulens

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Jul 29, 2026
Python

mozilla-ai / any-agent

A single interface to use and evaluate different agent frameworks

ai mcp agents a2a agent-evaluation

Updated Jul 1, 2026
Python

Ricky-7-Yan / intelligent-audit-system

AuditPilot: auditable enterprise AI agents for evidence-grounded workflows, governed tools, evaluation harnesses, human review, and remediation delivery.

python mcp audit multi-agent knowledge-graph human-in-the-loop rag fastapi ai-agent llmops agentic-rag agent-evaluation agent-runtime evaluation-harness

Updated Jul 28, 2026
Python

benchflow-ai / awesome-evals

A curated, non-BS library of the best resources for building and evaluating AI agents — papers, blogs, talks, tools, benchmarks. Maintained by BenchFlow.

awesome benchmarks awesome-list ai-agents rl-environments llm evals llm-evaluation agent-evaluation

Updated Jul 1, 2026

chirpz-ai / pandaprobe

open source agent engineering platform: traces, evals, and metrics to debug and improve your AI agents. Integrates with LangGraph, CrewAI, Claude Agent SDK, and more.

open-source monitoring self-hosted tracing crewai langgraph agentic-ai agent-evaluation agent-engineering openai-agents-sdk agent-observability claude-agent-sdk

Updated Jul 12, 2026
Python

TIGER-AI-Lab / ClawBench

Open-source benchmark for browser AI agents on daily tasks.

Updated Jul 28, 2026
Python

alphadl / AdaRubrics

AdaRubric: Adaptive Dynamic Rubric Evaluator for Agent Trajectories

rubric rlhf reward-model llm-evaluation agent-evaluation

Updated Jun 7, 2026
Python

rungalileo / agent-leaderboard

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated May 21, 2026
Jupyter Notebook

hwfengcs / DM-Code-Agent

Lightweight, auditable Python code agent (~1500 LOC) — ReAct + Planner + Reflexion + Hybrid RAG, with SWE-bench Lite eval and trace replay.

agent mcp rag llm llm-agent react-agent agent-skills agent-evaluation reflexion-agent code-agent swe-bench

Updated Jul 29, 2026
Python

amitshekhariitbhu / ai-agents-tutorial

Learn AI Agents step by step, from scratch - from function calling to agent loops to multi-agent systems, orchestration, and evaluation.

multi-agent-systems ai-agents ai-agent agent-orchestration agent-evaluation ai-agent-tutorial agent-loop harness-engineering tool-calling-agent

Updated Jul 2, 2026

Raidriar7170 / hermes-skilleval

Verification-gated skill routing and self-improvement harness for Hermes-style agent skills

benchmark retrieval ci python-cli reranking llm-agents agent-evaluation skill-routing release-gate

Updated Jul 21, 2026
Python

hidai25 / eval-view

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

python testing cli mcp evaluation pytest regression-testing ai-agents autogen llm anthropic langchain-agent openai-assistants crewai langgraph agentic-ai agent-evaluation agent-benchmark

Updated Jul 26, 2026
Python

samarailly51-pixel / claimpilot-harness

Crash-test insurance claim AI agents before production.

python testing insurance ai-agents prompt-injection llm-evals agent-evaluation

Updated Jul 28, 2026
Python

huangyiminghappy / ai-eval-platform

开源 AI 应用评测平台，支持 RAG、AI Agent、多轮对话、LLM-as-Judge、接口评测、评测报告和人工盲测。Open-source AI evaluation platform for RAG, AI Agents, multi-turn conversations, LLM-as-Judge, endpoint evaluation

rag blind-test human-evaluation ai-infra ai-agent ai-evaluation llm-evaluation rag-evaluation llm-as-judge openai-compatible agent-evaluation multi-turn-conversation

Updated Jul 20, 2026
Python

AMD-AGI / AgentKernelArena

AgentKernelArena provides an end-to-end siloed-benchmarking environment where different LLM-powered agents—such as Cursor Agent, Claude Code, Codex, SWE-agent, and GEAK—can be evaluated side-by-side on the same GPU kernel tasks, using objective and reproducible metrics.

llamas gpu-kernels agent-evaluation

Updated Jul 29, 2026
Python

UiPath / coder_eval

Test that your Claude Code skills, MCP servers, and CLIs actually work when an agent uses them — sandboxed YAML suites, activation checks, A/B experiments, CI gates.

gemini regression-testing codex evaluation-framework claude github-actions anthropic anthropic-claude coding-agents antigravity agent-skills claude-code agent-evaluation swe-bench agent-testing claude-agent-sdk claude-skills claude-code-skills skill-testing

Updated Jul 30, 2026
Python

ray-r-ren / forsy-trace-skill

Open skill for capturing AI agent work as structured traces.

reinforcement-learning process-supervision ai-agents post-training tool-use trajectory-data llm-agents agent-evaluation agent-workflows agent-traces

Updated Jun 6, 2026
Python

agent-evaluation - 技术专题

Here are 537 public repositories matching this topic...