Story temporarily unavailable

We are having trouble reaching this story. Please try again shortly.

LLM evaluation platform Arena launches Agent Mode to benchmark GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro on multi-step tasks · Digg