George co-founded Plataformatec (with José Valim and others), the company behind the Elixir programming language.
He now works as a Member of Technical Staff at New Generation, bringing agents to e-commerce with Elixir.
You can’t assert response == “expected” when your LLM rephrases things every time. So how do you actually test AI features?
Tribunal is an open-source evaluation framework for Elixir that brings LLM testing into ExUnit.
It provides two modes for two problems: tests that block your deploys (safety checks, hallucination detection, faithfulness to source context), and evaluations that track quality over time (batch scoring across hundreds of inputs with pass thresholds).
In this talk, I’ll walk through building a real test suite for a RAG pipeline: deterministic assertions for the easy stuff, LLM-as-judge for faithfulness and hallucination, semantic similarity for fuzzy matching, and red team testing to find holes before users do.
You’ll leave with a practical playbook for CI/CD quality gates on LLM features.
Key takeaways:
Target audience: