Thesys Logo
Pricing
Solutions
Resources
Company
Documentation
Company
  • Blogs
  • About
  • Careers
  • Contact us
  • Trust Center
For Developers
  • GitHub
  • API Status
  • Documentation
  • Join community
Product
  • Pricing
  • Startups
  • Enterprise
  • Case Studies
Legal
  • DPA
  • Terms of use
  • Privacy policy
  • Terms of service
355 Bryant St, San Francisco, CA 94107© 2023 - 2025 Thesys Inc. All rights reserved.

GPT 5 vs. GPT 4.1

Rabi

August 12nd, 2025⋅6 mins read

OpenAI recently announced their latest model, GPT 5, on August 7. Given its capabilities, we’re very excited to announce that we have added support for GPT 5 as an experimental model with the Thesys C1 API. This blog is intended to help users better explore and understand GPT 5’s new behavioural characteristics and how they work with C1. We shall be covering the insights from our internal testing of GPT 5 with the C1 API, and support it with external, publicly available benchmarks and data collected so far.

While adding models to the C1 API, we focus on 3 main categories that impact the model’s performance in a use case like generative UI - instruction following, tool calling and UI generation capabilities. We shall cover each of these in the following sections.

Instruction Following and Tool Calls

GPT 5 has seen improvements in multiple areas when it comes to developer experience and intent alignment. In order for a model to work well with the C1 API, it is important that it has strong capabilities when it comes to instruction following and tool calls. This seems to be widely applicable to the standard GPT 5 API. Based on external benchmarks measuring instruction following, GPT 5 ‘minimal’ reasoning scores the same as GPT 4.1, with other reasoning_effort values ‘low’, ‘medium’ and ‘high’ scoring significantly higher. 

Model Instruction Following Score
GPT-5 (minimal reasoning) 76.86
GPT-4.1 77.05

Instruction Following Scores (source: LiveBench)

We use GPT 5 with reasoning_effort set to ‘minimal’ for the C1 API, which has an instruction following score virtually identical to GPT 4.1. However, in real life performance, GPT 5 tends to outperform GPT 4.1 for C1-specific instructions. With our internal system prompts for C1, we observed that we could reduce the amount of C1-specific instructions and still get better obedience and performance when compared to GPT 4.1. 

Both GPT 4.1 and GPT 5 are very capable of following tool call instructions and invoke the right tools wherever needed. This can further be aided with additional system prompts to get the desired behaviour.

Thesys C1 LLM Arena

Thesys LLM Arena is our internal benchmarking environment for evaluating the human side of model performance - how much people actually like the generations they see. Beyond standard benchmarks, it tests models on the same prompts, tools, and C1 components, then factors in human subjectivity to assess which outputs feel more engaging, useful, or visually appealing in real-world Generative UI scenarios.

Based on the results of the Arena , GPT-5 is already performing better compared to GPT 4.1. As of writing this, GPT 5 is at an ELO score of 1072, while GPT 4.1 is at 1025. For reference, our best performing stable model (Claude Sonnet 4) is at 1122. This shows that GPT 5 tends to slightly better match user expectations when it comes to generating User Interfaces, and we are continuously evaluating its performance to further improve GPT 5’s output for C1. 

To see GPT 5 in action with the C1 API, the Thesys Playground is an excellent source to try it out and compare its performance with other models, along with tool calls, MCP integration and custom system prompts to match any use case! 

UI Generation Capabilities

The following test was carried out with multiple use cases in Thesys Playground, and is meant to scope the subjective insights for C1 with GPT 5 and GPT 4.1 APIs.

Model settings:

  • GPT-4.1 is the default offered by OpenAI.
  • GPT-5 is set to use ‘minimal’ reasoning effort (emulates non-thinking) since GPT 4.1 is a non-thinking model,

Metrics: 

  1. C1 Components generated: For the same user prompt with no additional custom system prompts, how many components does each model generate. This is tested with intentionally vague prompts (eg: how to deep dive into astronomy). Open-ended prompts like these can be used to measure how explorative the model is in terms of using the components available to generate a subjective answer.
  2. Total tokens generated: For the same user prompt with no guiding system prompts, how many tokens did the model use? This is a general indication of how talkative the model is. This can be adjusted with system prompts.

All metrics are calculated by giving the exact same user and system prompts to both GPT-4.1 API and GPT-5 API. Both are on C1 version 20250709 and averaged across multiple iterations.

Test Case #1 - Subjective Open Ended

SYSTEM: “You are a bot from the company "superteachers". Students will ask you questions, respond to them how a teacher would respond to their student.”
USER: “How does an airplane work?” (a subjective, open-ended question)

GPT-5

It divided its answers into 4 sections: Forces, Pilot Controls, Engines, Flight Phases. It briefly explained each section, and then proceeded to explain wing shapes. It gave a real-life example of how you can experience “lift” and “drag” - the core concepts that allow a plane to fly. It gave the key physics idea behind flights, the theorems involved, and then summarised the learnings.

GPT 5 Generative UI response for query "How does an airplane work"

GPT-4.1

It gave a brief explanation of the forces responsible, then explained the main concept of lift. It then explained why engines are needed, and how the forces at play keep the plane from falling down.

GPT 4.1 Generative UI response for query "How does an airplane work"

Test Case #2: More precise, specific question

SYSTEM: “You are a bot from the company "superteachers". Students will ask you questions, respond to them how a teacher would respond to their student.”
USER: “What are the three laws of motion?”

GPT-5

It started with a brief explanation of each of the three laws of motion. Then, it divided its answer into 4 sections: First law, second law, third law, and quick check. In the first three sections, it explained each law in detail with examples, along with their formulae. In the quick check section, it gave 3 questions and a tip. It concluded with a follow up question.

GPT 5 Generative UI response for query "What are the three laws of motion?"

GPT-4.1

It began by informing the background of why these laws exist. Used lists to list each law, and gave a brief explanation of each law with its formulae, with examples. 

GPT 4.1 Generative UI response for query "What are the three laws of motion?"

Results

Component Utilisation

GPT-5 takes fuller advantage of C1’s components to create richer, more visual outputs.
Compared to previous models, GPT-5 delivers greater depth in its visual generations, offering users more context and detail for the same query—especially in educational or information-dense use cases.

Component Utilisation for GPT 4.1 vs GPT 5

Token Usage

GPT-5 uses more tokens for similar queries, resulting in denser responses.
In our tests, GPT-5 consistently generated longer outputs for the same prompts, packing in additional details, explanations, and context. This makes its responses especially useful for scenarios where completeness matters, though it can also increase token consumption compared to earlier models.

Token Usage for GPT 4.1 vs GPT 5

Conclusion

Based on our analysis and internal testing, we recommend:


Use GPT 5 for:

  • High verbosity/subjectivity use cases (audits, teaching, explanations)
  • Complex use cases requiring strong system prompt obedience
  • Rich UI generation with diverse component mixing

Use GPT 4.1 for:

  • Latency-sensitive applications
  • Simpler use cases with limited task sets
  • Direct, to-the-point interactions
  • When response speed is critical
GPT 5 vs GPT 4.1 model characteristics
GPT-5 with C1 (v-20250709) GPT-4.1 with C1 (v-20250709)
Verbosity Tends to be more verbose Tends to be concise
Model Characteristic Exploratory and open-ended To-the-point and objective
C1 component utilization High Average
Speed of response Average Very high
System prompt obedience Very high High

All visualisations were generated with C1.
To see GPT 5 in action with the C1 API, visit the Thesys Playground to try it out and compare its performance with other models!

Learn more

Related articles

How to build Generative UI applications

July 26th, 202515 mins read

Implementing Generative Analytics with Thesys and MCP

July 21th, 20257 mins read

Evolution of Analytics: From Static Dashboards to Generative UI

July 14th, 20259 mins read

Why Generating Code for Generative UI is a bad idea

July 10th, 20255 mins read

Building the First Generative UI API: Technical Architecture and Design Decisions Behind C1

July 10th, 20255 mins read

How we evaluate LLMs for Generative UI

June 26th, 20254 mins read

Generative UI vs Prompt to UI vs Prompt to Design

June 2nd, 20255 mins read

What is Generative UI?

May 8th, 20257 mins read