Live evaluation harness for production agents.
Eval-as-you-go: run regression tests on your live agent traffic, replay failed conversations, A/B prompts. Powered by Nebius for the eval LLM.