How we evaluate LLMs for Generative UI

Rabi

June 26th, 2025⋅4 mins read

Introduction

Large Language Models (LLMs) are transforming human-computer interaction by enabling intelligent, real-time understanding and response across domains like language, code, and reasoning. A key application is Generative User Interfaces (GenUI), where LLMs dynamically create and adapt interfaces based on user intent, preferences, and context—replacing static UIs with personalised, intelligent interfaces.

While working on the Thesys C1 API to enable Generative Interfaces, we encountered a fundamental problem - there are no existing frameworks to evaluate the subjective GenUI capabilities of Large Language Models. While there exist popular human-driven evaluation tools like LMArena, we could not use these tools as they are mostly text or code based. With LLM providers introducing newer, more intelligent foundational models, it is becoming increasingly important to have a framework to enable GenUI evaluation for these models that allows us to shape our roadmap in terms of what are the next set of models that we need to support.

We now introduce Arena by Thesys - a tool designed to help us and our users use state-of-the-art LLMs and see their GenUI capabilities in action. While our internal evals (which probably need their own post) and end-to-end tests allow us to ensure a stable experience for our customers, Arena enables us and our users to evaluate and rank different LLMs on their GenUI capabilities with the Thesys C1 API, giving us strong insights as to which models cater to user requirements better. We aim to utilise this information to decide which LLMs are the most compatible with the C1 API by Thesys, all while enabling our users to evaluate which C1+LLM offering suits their use cases.

Design

Arena is built on top of the C1 API, with a familiar A-B testing format that existing popular LLM evaluation tools use (like LMArena and WebDev Arena). Combining the C1 API output with ELO ratings, we establish a system to conduct human evaluation with a robust ranking system. This provides us with a tool that enables users to evaluate the generation capabilities of different LLMs on top of our C1 API, aiding us in cherry-picking models that suit the C1 user’s needs the most.

ELO Rating Formula with defined constants

The intuitive interface allows for a straightforward evaluation methodology - users provide a common prompt that is passed to two randomly selected and anonymised LLMs. The LLMs use the C1 API to respond with a generated interface, and the user then chooses their preferred response (Model A/Model B/Tie). Additionally, we enable the users to provide custom system prompts and ask as many follow-ups as they want before selecting their preference. This enables an accurate simulation of how the C1 API will perform when tailored to their exact use cases, and how different LLMs interpret the task of generating interfaces in accordance with the user’s prompts. The user’s choice of a winner (or tie) is then used to update the ELO ratings of the two models that competed using the ELO formula described above.

Screenshot of a battle in the GenUI LLM Arena

Results

With enough user preferences registered on Arena, the leaderboard gives a clear picture and ranking as to which models tend to deliver a more satisfactory response. We use this data internally to evaluate and prioritise the next models to support in the C1 API, and hope this helps companies building Generative UI solutions to pick the right model for their deployments.

Arena results so far show that the Claude LLM lineup by Anthropic has emerged as the top scorers across a wide range of user prompts, which is in-line with the leaderboard rankings of WebDev Arena - a tool used to evaluate static UI generation capabilities of existing LLMs. This is generally expected as LLMs that are optimised for Web Development show a higher degree of understanding of various UI components and how to use them versus other LLMs sometimes simply fallback to creating a text block and dumping the entire response via text in it.

Future Work

The initial release of Arena equips us with the right tools for measuring and quantifying the GenUI capabilities of different models on top of our middleware, which brings us one step closer to our north-star of leveraging LLMs to accurately generate intelligent interfaces at runtime. However, the results of the Arena leaderboard also point us to our next milestone - bringing existing LLMs to the performance of the Claude models on the C1 API. Arena’s data will allow us to establish where some models fall short, and how we can improve our internal implementation of the C1 API to boost their performance. This effort is enormous, and every data point will help us make our offerings better. We are really excited to see where Arena rankings take us, and we hope the improvements that come from this effort will make C1 better and more consistent.

We believe everyone should be able to see the GenUI capabilities of the C1 API, and experience different LLMs generate interfaces, hence the use of Arena is free for all users. Sign up and see different LLMs generating interfaces at https://console.thesys.dev/tools/arena now!