Evaluate a support ticket classifier with OpenAI GPT-4o-mini and n8n evaluations

Created by

Last update

Last update 15 hours ago

What you'll do

Open the workflow and review the production path (webhook receives a ticket, AI Agent classifies it by category and urgency, response is returned). Open the Evaluations tab and click Run Test to feed each Data Table row through the AI Agent. Inspect per-test-case scores and aggregate metrics to see which tickets the classifier got right and which it missed. Tweak the prompt or model, re-run, and compare runs side by side.

What you'll learn

How n8n's Evaluation Trigger, Data Tables, and Evaluation node fit together How to use the "Check if Evaluating" operation to keep evaluation traffic out of production How to score structured AI outputs against known correct answers using exact match How to seed a test set from real execution history rather than synthetic examples

Why it matters

Classification accuracy that looked great in testing can quietly drop the moment your inputs shift. Building an evaluation path next to your production workflow gives you a repeatable way to measure quality, catch regressions before users do, and ship prompt changes with data instead of vibes.

This template is a learning companion to the Production AI Playbook, a series that explores strategies, shares best practices, and provides practical examples for building reliable AI systems in n8n.