At ClodHost, every user's Claude Code instance runs through our proxy server, which lets us swap the underlying AI model. We decided to put this to the test: give 16 different models the exact same complex prompt and see what happens.

The results surprised us. Most models produced working websites. But the quality gap was enormous — and the best-looking site was built by the free model.

The Challenge

We used a real customer prompt — the one that originally built WW2 Obscura, an obscure World War II trivia site dedicated to a friend named Mike. The prompt asks Claude Code to configure a server from scratch and build a complete content-rich website — no placeholders, no dummy data.

This is a genuinely complex task: server configuration, database design, backend API, frontend design, and extensive content generation — all from a single prompt.

The Exact Prompt

Every model received this identical prompt (only the domain/directory name varied):

Do not ask me anything, attempt to one-shot the whole site with no follow-up questions.

configure apache to serve an ssl site from /var/www/ww2trivia-{model}.clodhost.com/public (I'm using cloudflare for dns with "full" ssl). remember to open up the server firewall for ssl too.. then completely build the following in that directory, do not leave any parts to be implemented later or fill in with dummy data.. install a real database if you need to, whatever type makes sense, and install any other necessary server tools to complete the entire thing:

I have a friend named Mike Paynter who is a World War II buff and knows everything! He's also a Freemason and he is a bit religious, but that is beside the point. I want to create a site that he can go to, and it will have trivia of really obscure history and facts about World War II things that may not otherwise be easily or readily available with a Google search. I'd also like to scrape pictures and stories that are rare things that are likely not easily found by him that he would get a kick out of. They went on a trip to the Normandy coast and his wife hired a PhD level tour guide with great extensive knowledge of World War II and actually new names of people on the beaches and things that were really obscure. This is what I'm looking for I want something that he will get really excited about that. He can search and read things that he would normally not have found. You have to understand that his wife is the type of person who is very intense for example, she wrote a journal during the pandemic during Covid and it now resides in the Smithsonian library. Mike is a very intense person. He's very smart and very loving and he had a near death experience with his heart a few years back so now he cherishes every moment he's on the planet and he has traveled extensively with his wife which they want to do before they get too old. They have been to France and all over Europe and he can really recite some major things that happened with the Nazis and is very knowledgeable. He would love to hear heartfelt stories of people who were heroes or should've been heroes or people who have fought against adversity

The Results

All 16 models ran through Claude Code on identical Hetzner servers (2 vCPU, 2GB RAM), routed through our proxy to OpenRouter. Here's every model ranked by our quality grade, with full specs and API metrics. The top row shows the model info; the bottom row shows how it performed in the test:

Model Grade Cost Released Code Score Context In $/1M Out $/1M Speed Active Time Nudges API Calls s/Call Tokens In Tokens Out Cache %
Qwen 3.6+ B+ FREE Apr 2 41.6 1M $0.00 $0.00 ~158 50m 51s 1 263 11.6 10.4M 244K N/A
Claude Opus 4.6 B $3.20 Feb 5 48.1 1M $5.00 $25.00 ~25 36m 50s 0 82 26.9 3.6M 60K 98.2%
Claude Sonnet 4.6 B $1.62 Feb 17 50.9 1M $3.00 $15.00 ~50 15m 31s 0 82 11.4 3.4M 171K 91.6%
MiniMax M2.7 C+ $1.07 Mar 18 41.9 205K $0.30 $1.20 ~47 30m 20s 2 159 11.4 13.1M 154K 94.6%
Qwen 3.5+ C+ $0.69 Feb 16 41.3 1M $0.50 $1.50 ~100 18m 19s 0 46 23.9 2.2M 69K 0%
ChatGPT 5.4 C $5.04 Mar 5 57.3 1M $2.50 $15.00 ~73 40m 29s 2 203 12.0 13.8M 117K 99.1%
Kimi K2.5 C $0.55 Jan 27 39.5 262K $0.38 $1.72 ~40 34m 25s 0 37 55.8 1.3M 62K 48.0%
Gemini 3.1 Pro D+ $0.70 Mar 26 55.5 1M $2.00 $12.00 ~131 5m 38s 0 22 15.4 450K 42K 58.3%
Grok 4.20 Beta D $0.34 Feb 17 42.2 2M $2.00 $6.00 ~233 3m 57s 0 35 6.8 650K 10K 87.4%
Qwen3 Coder Flash D $0.46 Jul 22 50.3 1M $0.20 $0.98 ~130 21m 49s 0 116 11.3 4.5M 35K 85.7%
Nemotron 3 Super D FREE Mar 11 31.2 262K $0.00 $0.00 ~293 29m 44s 3 103 17.3 3.0M 52K N/A
GPT-5.4-mini D- $0.13 Mar 17 54.4 400K $0.75 $4.50 ~190 2m 34s 0 32 4.8 797K 10K 94.6%
Gemini 3.1 Flash F $0.05 Mar 26 42.6 1M $0.50 $3.00 ~218 2m 4s 0 40 3.1 384K 5K 57.7%
GPT-5.4-nano F $0.03 Mar 17 52.4 400K $0.20 $1.25 ~200 5m 24s 0 23 14.1 903K 11K 47.6%
GPT-5.4 Pro F $12.47 Mar 5 57.7 1M $30.00 $180.00 ~15 32m 3s 0 17 113.1 233K 30K 0%
Deepseek v3.2 F $0.30 Dec 1 36.7 164K $0.26 $0.38 ~41 29m 57s 2 121 14.9 2.4M 37K 16.8%

Click any column header to sort. Released = model release date. Code Score = SWE-bench Verified (higher = better at coding). Speed = output tokens/sec. Cache % = cached tokens / total input (N/A for free models that don't report caching). Costs reflect only API calls during the build.

Total cost for the entire experiment: $26.75

The Grades

B+ — Qwen 3.6+ Free ($0.00, 50m 51s)

The winner. 163 trivia facts and 22 in-depth stories (though some appear to be 2-3x duplicated). Beautiful warm earth-tone design with olive and brown hues, gold accents, a "Built for Mike" badge, and category tags with article counts across 12 categories. Working search, quiz functionality, a timeline feature, and a "Surprise Me" button. The most feature-complete site in the competition.

It needed one nudge and took the longest to build (50 minutes), but produced a site that looks like it had a professional designer behind it. For zero dollars.

Verdict: Best bang-for-buck ratio in history — infinite, because the buck is zero. Slight ding for duplicate content.

B — Claude Opus 4.6 ($3.20, 36m 50s)

60 trivia facts and 12 in-depth stories. Professional dark navy and gold design, stat counters, "Explore Categories" and "Random Fact" buttons. Search works well. No nudges needed — Opus worked through errors autonomously, adapting its approach when file writes failed.

Nice design but slightly marred by overuse of emojis in headers. No quiz functionality.

Verdict: Reliable and autonomous. Twice the price of Sonnet for similar quality. Worth it when you need zero babysitting.

B — Claude Sonnet 4.6 ($1.62, 15m 31s)

50 trivia facts and 11 long-form stories. Elegant dark theme called "The Hidden History of WW2" with gold accents, browseable category cards (Animals in War, Bizarre Weapons, Forgotten Battles, Home Front Secrets, Medical Miracles). Working quiz functionality, interesting layout with emoji logos done more tastefully than Opus. A bit busy visually.

Completed in just 15 minutes with zero nudges — the fastest autonomous completion of any high-quality site.

Verdict: The best value in the competition. Fast, cheap, autonomous, and solid output. The default pick for production use.

C+ — MiniMax M2.7 ($1.07, 30m 20s)

22 articles and 27 trivia facts. Dark, atmospheric design with gold text reading "The WWII Stories They Didn't Teach You." Nice welcome page and decent layout with a moody, magazine-style aesthetic. No quiz functionality. Simple but presentable.

Needed 2 nudges to complete.

Verdict: Cheap and stylish, but thin on content and features. Good aesthetics, limited substance.

C+ — Qwen 3.5+ ($0.69, 18m 19s)

16 stories, 12 heroes honored, and 28 trivia quiz questions. A well-structured PHP site called "WW2 Trivia Archive" with Playfair Display fonts, hero section with CTA buttons, stat cards, and a clean dark theme. Features include separate Stories, Heroes, Trivia Quiz, and Search pages. No nudges needed — fully autonomous.

Interestingly, this is the older, cheaper version of Qwen — and while it produced fewer facts than its 3.6+ sibling (which had 163), the overall site is more polished and bug-free. A solid mid-tier result.

Verdict: Good value at $0.69. Fewer facts than Qwen 3.6+ but a clean, working site with quiz functionality. The B-team sibling that shows up and delivers.

C — ChatGPT 5.4 ($5.04, 40m 29s)

13 essays and 24 trivia facts. A weird single-page layout with no proper navigation bar. Search doesn't appear to work. Has an admin area and image upload functionality — but launched with a bunch of blank image placeholders instead of pulling actual images from Wikipedia or other sources. Why build the upload feature but not populate it?

The most expensive model at $5.04, needed 2 nudges, and took 40 minutes.

Verdict: Spent the most money and built the most infrastructure — admin panel, image uploads — but forgot the basics. A site with broken search and blank images isn't usable, regardless of how sophisticated the backend is.

D+ — Gemini 3.1 Pro ($0.70, 5m 38s)

Only 8 trivia facts and basically no features beyond a basic article display. Clean newspaper-style design called "The Observer's Dispatch" with a personalized welcome message for Mike and numbered archive entries. It looks nice, but there's almost nothing there.

Finished in under 6 minutes with zero nudges — by far the fastest model. But speed came at the cost of depth.

Verdict: Gemini treated this like a quick homework assignment. Clean design, minimal effort. Fast and cheap, but you get a skeleton, not a site.

D — Grok 4.20 Beta ($0.34, 3m 57s)

Only 9 short trivia facts, no real features, no search functionality. Super simple site layout called "WW2 Obscura" with a dedication to Mike Paynter. Done in under 4 minutes — the second-fastest in the competition — but the result is a bare-bones page with almost no content.

Zero nudges needed, but only because there was so little to build.

Verdict: Lightning fast and dirt cheap, but 9 facts is not a trivia site. Grok sprinted through a marathon by skipping most of the course.

D — Nemotron 3 Super ($0.00, 29m 44s)

Has trivia facts, but the site is super bare-bones — just a list with a filter/search. Blue and white utilitarian design with no visual personality. The content is displayed as a plain list without engaging presentation.

Needed 3 nudges — the most of any model.

Verdict: Free, and it shows. Functional in the most literal sense, but no one would enjoy browsing this. Qwen (also free) produced a dramatically better result.

D — Qwen3 Coder Flash ($0.46, 21m 49s)

Only 7 facts in the database, a simple design with broken image links, and basically no features beyond a basic list page. Has a navigation menu (Home, Stories, Facts, About) but the content sections are empty. Built with PostgreSQL instead of MySQL — an unusual choice that didn't help the outcome. The "flash" version of Qwen's coding model, and it shows: fast inference but shallow output.

Zero nudges needed, but 22 minutes for 7 facts is poor efficiency.

Verdict: Cheap at $0.46, but the result is a skeleton with barely any content. The "fast and cheap" coder model traded quality for speed and came up short.

C — Kimi K2.5 ($0.55, 34m 25s)

Kimi spent 34 minutes building an ambitious PHP site with separate pages for facts, heroes, Normandy stories, and operations. It set up MySQL, created a proper routing system, and populated the database with 24 facts across multiple categories. Working search functionality. However, the site launched with a fatal PHP error and broken SSL — both required manual fixes.

No quiz feature, and the navigation breaks on mobile. But the underlying architecture is solid and the content categories are well-organized.

Verdict: Decent content with working search, but needed help getting to a functional state. No mobile nav hurts.

D- — GPT-5.4-mini ($0.13, 2m 34s)

Called "WWII Atlas" — only 11 facts, broken images pulling from Wikipedia URLs that don't resolve, broken SSL, and non-functional search. Nice dark design aesthetic, but nothing actually works. Built in 2.5 minutes, and it shows. Also firewalled off SSH port 22 during setup, locking itself out.

Zero nudges — but only because it ran so fast there was no time to intervene.

Verdict: The cheapest paid model at $0.13, but you get a broken site with broken images and broken search. Mini treated this like a speed run, not a real task.

F — Gemini 3.1 Flash ($0.05, 2m 4s)

The cheapest and fastest failure. Flash 3.1 configured Apache, created a MariaDB database, and wrote a single PHP file in 2 minutes. But it never created any database tables or inserted any data. The result: a page with a search bar, a title ("Obscure WWII Knowledge"), and the text "No results found." Zero facts, zero stories, zero content.

It also didn't set up the database user permissions correctly, causing a 500 error until we manually fixed it. Even then — empty database.

Verdict: Flash spent $0.05 to create a search form that searches nothing. A monument to speed without substance.

F — GPT-5.4-nano ($0.03, 5m 24s)

Complete failure. 23 API calls over 5 minutes and nothing was built — no Apache, no database, no website files. The server was left essentially untouched. Nano couldn't even get past basic server configuration.

Verdict: At $0.03, the financial cost is trivial. But "no website" is the worst possible result. Nano is simply not capable of this kind of autonomous multi-step task.

F — GPT-5.4 Pro ($12.47, 32m — abandoned)

We had to pull the plug. After 32 minutes and $12.47, GPT-5.4 Pro had made 17 API calls and accomplished exactly one thing: checking what software was installed on the server. It hadn't run apt install. It hadn't created a single file. It hadn't configured Apache. It was still "tinkering" — planning its approach while burning through $30/1M input tokens and $180/1M output tokens.

At 113 seconds per API call with 0% cache hits, this was on track to cost $50-100+ and take 3+ hours. We killed it.

Verdict: The most expensive model in the world, and it couldn't get past whoami. GPT-5.4 Pro may excel at complex reasoning tasks, but autonomous server setup isn't one of them. The $12.47 it burned doing nothing is more than Sonnet spent building an entire working site.

F — Deepseek v3.2 ($0.30, 29m 57s)

No facts added to the database. The site has a proper header, banner image, and section layouts — but the "Featured Obscure Stories" and "Test Your Knowledge" sections are completely empty. The framework is there; the content isn't. 30 minutes of build time that produced an empty shell.

Needed 2 nudges.

Verdict: Built a house with no furniture. At $0.30 the financial cost is trivial, but the time cost (30 minutes for nothing usable) is not.

Autonomy: Could Any Model "One-Shot" It?

Nine models completed the entire task with zero human intervention:

Model Nudges Notes
Claude Opus 4.6 0 Fully autonomous, self-recovered from errors
Claude Sonnet 4.6 0 Fully autonomous, best value
Gemini 3.1 Pro 0 Fully autonomous, fastest — but minimal output
Grok 4.20 Beta 0 Fully autonomous, very fast — but 9 facts
Kimi K2.5 0 No nudges, but site had fatal PHP error
GPT-5.4-mini 0 No nudges, but broken images/search/SSL
Gemini 3.1 Flash 0 No nudges — also no content (0 facts)
GPT-5.4-nano 0 No nudges — also no website
Qwen3 Coder Flash 0 No nudges, but only 7 facts and broken images
GPT-5.4 Pro 0 Abandoned — $12.47 spent with zero files created
Qwen 3.5+ 0 Fully autonomous, decent content depth
Qwen 3.6+ (Free) 1 Nearly autonomous, best overall result
Deepseek v3.2 2 Stalled twice, still produced nothing usable
MiniMax M2.7 2 Stalled twice, delivered after nudging
ChatGPT 5.4 2 Needed intervention, expensive for the result
Nemotron 3 Super 3 Most nudges, bare-bones output

Note: Zero nudges doesn't mean the result was good. Grok, Mini, Nano, Gemini Flash, and Qwen3 Coder Flash all ran without intervention but produced poor or non-functional sites. GPT-5.4 Pro was abandoned after burning $12.47 without creating a single file. True autonomy requires both independence and quality — only Opus, Sonnet, and Qwen 3.5+ delivered both.

Key Takeaways

  1. Free can beat paid. Qwen (free, B+) outscored every paid model including Opus ($3.20) and ChatGPT ($5.04). The era of "you get what you pay for" in AI is over.
  2. Price doesn't predict quality. ChatGPT was the most expensive at $5.04 and scored a C. Sonnet scored a B for $1.62. Cost and quality are decoupled.
  3. Coding benchmarks don't predict real-world results. ChatGPT 5.4 has the highest coding score (57.3) but scored a C. GPT-5.4 Pro ($30/1M input) spent $12.47 without creating a single file. Benchmarks measure isolated problem-solving, not end-to-end project execution.
  4. Speed kills quality. The four fastest models (Mini 2m34s, Grok 3m57s, Nano 5m24s, Gemini Pro 5m38s) all scored D or worse. The two best sites (Qwen, Opus) took 37-51 minutes. You can't rush a complex build.
  5. Content depth varies wildly. Qwen generated 163 facts; Grok generated 9. Same prompt, same tools, 18x difference in content output.
  6. Autonomy without quality is worthless. Eleven models needed zero nudges — but six of them produced broken or empty sites. GPT-5.4 Pro autonomously burned $12.47 doing nothing. The real differentiator is autonomous and high-quality.
  7. Expensive doesn't mean better. GPT-5.4 Pro ($30/$180 per million tokens) was abandoned after spending $12.47 with zero output. Meanwhile, Sonnet built a working B-grade site for $1.62 and the free Qwen model scored the highest grade of all.

Our Recommendations

For ClodHost users choosing a model:

Live Sites

Visit each site yourself and judge the quality:

Screenshots

Here's what each model built (click to visit the live site):

Qwen 3.6+ site screenshot

Qwen 3.6+ (Free) — B+

Claude Opus 4.6 site screenshot

Claude Opus 4.6 — B

Claude Sonnet 4.6 site screenshot

Claude Sonnet 4.6 — B

MiniMax M2.7 site screenshot

MiniMax M2.7 — C+

ChatGPT 5.4 site screenshot

ChatGPT 5.4 — C

Gemini 3.1 Pro site screenshot

Gemini 3.1 Pro — D+

Grok 4.20 Beta site screenshot

Grok 4.20 Beta — D

Nemotron 3 Super site screenshot

Nemotron 3 Super — D

Kimi K2.5 site screenshot

Kimi K2.5 — C

GPT-5.4-mini site screenshot

GPT-5.4-mini — D-

Qwen 3.5+ site screenshot

Qwen 3.5+ — C+

GPT-5.4-nano site screenshot

GPT-5.4-nano — F

Gemini 3.1 Flash site screenshot

Gemini 3.1 Flash — F

Qwen3 Coder Flash site screenshot

Qwen3 Coder Flash — D

Deepseek v3.2 site screenshot

Deepseek v3.2 — F

Test conducted April 4-5, 2026 using Claude Code on identical Hetzner CPX11 servers (2 vCPU, 2GB RAM), routed through ClodHost's proxy to OpenRouter. Active time measured from API call timestamps. Costs reflect only API calls during the actual build. GPT-5.4 Pro was abandoned after 32 minutes due to excessive cost ($12.47) with no progress.


Round 2: Same Test, Fresh Servers

We ran the experiment again — same prompt (slightly edited to remove personal details), same server specs, same models — but on freshly rebuilt servers. We also dropped the three models that were discontinued or not released in 2026 (GPT-5.4 Pro, Deepseek v3.2, Qwen3 Coder Flash) and added MiMo v2 Pro (Xiaomi).

The question: how reproducible are these results? Does a model that scored a B in Round 1 reliably produce B-level output, or was that a lucky run?

Prompt Changes

We made two small edits to protect privacy: removed Mike's last name and removed one personal sentence. Everything else was identical.

Round 2 Results

14 models, identical Hetzner CPX11 servers, routed through the same OpenRouter proxy. Results sorted by quality grade:

Model Grade Cost Released Code Score Context In $/1M Out $/1M Speed Active Time Nudges API Calls s/Call Tokens In Tokens Out Cache %
Claude Sonnet 4.6 B+ $4.14 Feb 17 50.9 1M $3.00 $15.00 ~51 55m 15s 0 79 55.5 6.1M 170K 89.8%
Claude Opus 4.6 B $3.19 Feb 5 48.1 1M $5.00 $25.00 ~20 46m 4s 1 61 44.6 2.9M 56K 96.5%
Qwen 3.5+ B- $0.83 Feb 16 41.3 1M $0.50 $1.50 ~57 17m 14s 0 57 17.2 2.8M 59K 0%
Qwen 3.6+ C+ FREE Apr 2 41.6 1M $0.00 $0.00 ~15 43m 19s 1 64 18.0 2.5M 39K N/A
MiniMax M2.7 C+ $0.31 Mar 18 41.9 205K $0.30 $1.20 ~19 46m 41s 1 75 25.8 3.4M 53K 95.6%
Gemini 3.1 Pro C+ $0.46 Mar 26 55.5 1M $2.00 $12.00 ~63 2m 56s 0 20 7.3 416K 11K 67.9%
Nemotron 3 Super C FREE Mar 11 31.2 262K $0.00 $0.00 ~36 16m 40s 0 42 23.7 1.8M 36K N/A
ChatGPT 5.4 D+ $0.68 Mar 5 57.3 1M $2.50 $15.00 ~8 45m 4s 2 31 11.4 957K 23K 95.2%
Grok 4.20 Beta D $0.22 Feb 17 42.2 2M $2.00 $6.00 ~105 1m 45s 0 24 3.1 410K 11K 90.2%
GPT-5.4-nano D- $0.04 Mar 17 52.4 400K $0.20 $1.25 ~8 42m 17s 3 15 8.2 380K 21K 78.3%
GPT-5.4-mini D- $0.21 Mar 17 54.4 400K $0.75 $4.50 ~6 127m 33s 3 47 2.4 1.3M 13K 93.7%
Gemini 3.1 Flash F $0.04 Mar 26 42.6 1M $0.50 $3.00 ~0.4 134m 54s 2 13 2.2 240K 3K 48.6%
Kimi K2.5 C $0.36 Jan 27 39.5 262K $0.38 $1.72 ~5 150m 57s 4 49 8.4 1.6M 43K 79.2%
MiMo v2 Pro D+ $1.64 Apr 1 1M $0.70 $2.80 ~29 57m 58s 5 55 35.3 4.8M 102K 90.0%

* MiMo v2 Pro initially ran on the wrong model due to a proxy configuration error (routed to Qwen 3.6+ instead). After fixing the proxy config, MiMo was re-run on the correct model (xiaomi/mimo-v2-pro) and produced a working site. The stats above reflect the corrected run.

Click any column header to sort. Active time = wall-clock span from first to last API call (includes idle time between nudges). Speed = output tokens / active seconds. For free models, cache % is N/A (not reported).

Total cost for Round 2: $12.12 (down from $26.75 in Round 1, due to dropping GPT-5.4 Pro which alone cost $12.47)

Round 2 vs Round 1: Key Differences

Round 2 Live Sites

Round 2 conducted April 5, 2026 on freshly rebuilt Hetzner CPX11 servers (2 vCPU, 2GB RAM). Same proxy and OpenRouter setup as Round 1. Prompt slightly edited for privacy (removed last name and one personal sentence).


Round 3: The Codex CLI Experiment

For Rounds 1 and 2, every model ran through Claude Code — Anthropic's CLI tool for Claude. But Claude Code isn't the only agentic coding CLI. OpenAI has Codex CLI and Google has Gemini CLI. We wanted to know: does the CLI tool matter as much as the model?

We reimaged the Round 2 servers via the Hetzner API (fresh Ubuntu installs), then ran 6 of our models through Codex CLI instead of Claude Code. Same prompt, same server specs — different tool.

The Setup

Codex CLI runs in a sandboxed environment by default. We used our proxy to route all 6 models through OpenRouter, with each model receiving the same WW2 trivia prompt. The key difference: Codex's sandbox restricts filesystem writes to a project directory, which immediately caused problems when models tried to write to /var/www/.

Round 3 Results

Only 2 of 6 models produced usable websites. The sandbox killed the rest.

Model Grade CLI Result
Claude Sonnet 4.6 B- Codex Working static site: "WWII Obscura — Dispatches from the Forgotten War." Vintage sepia design, 7 in-depth stories, working search with live filtering. 24KB of hand-crafted HTML+CSS.
ChatGPT 5.4 B- Codex Full SPA called "Signal & Salt | WWII Deep Cuts." JavaScript-driven with Google Fonts, CSS grid, and a modular architecture. Polished design, editorial tone. 4.5KB initial HTML with dynamic JS content loading.
Claude Opus 4.6 F Codex Produced only CSS files, no HTML. Codex sandbox blocked writes to /var/www/.
Qwen 3.6+ F Codex Same as Opus — CSS only, no HTML. Sandbox restriction.
MiniMax M2.7 F Codex Nothing produced. Process exited without creating any files.
Gemini 3.1 Pro F Codex Nothing produced. Process exited without creating any files.

What Went Wrong

Codex CLI's sandbox is the story here. Unlike Claude Code (which gives models full shell access), Codex runs tool calls in an isolated environment that restricts writes outside the project directory. When a model tries to apt install apache2 or write to /var/www/, the sandbox blocks it.

Sonnet and GPT-5.4 adapted — they figured out how to write files within the allowed directories and produced complete sites. The other four models couldn't navigate the restrictions and either produced partial output (CSS without HTML) or nothing at all.

This is a significant finding: the CLI tool's sandbox policy matters as much as the model. Opus and Qwen — both strong performers in Rounds 1 and 2 via Claude Code — scored F when running through Codex's restricted sandbox. The models aren't worse; the tool limited what they could do.

Round 3 Live Sites

Round 3 conducted April 5, 2026 using Codex CLI on freshly reimaged Hetzner CPX11 servers. Models routed through the ClodHost proxy to OpenRouter.


Round 4: The Gemini CLI Experiment

Next we tried Google's Gemini CLI — a tool designed specifically for the Gemini model family. We reimaged the same servers again and ran all 6 models through Gemini CLI with the -y flag (YOLO mode: auto-approve all tool calls) and -p for headless execution.

This required significant proxy engineering. Gemini CLI speaks Google's Gemini API format, not OpenAI's format. Our proxy had to translate between Gemini's contents/parts format and OpenAI's messages format, handle function call/response translation, convert responseMimeType/responseSchema structured output, and return a compatible models list. We also hit Cloudflare 524 timeout issues (100-second limit) that required switching the proxy to DNS-only mode.

Round 4 Results

Only 1 of 6 models produced a usable website. And it was the native one.

Model Grade CLI Result
Gemini 3.1 Pro B- Gemini Full PHP+SQLite site: "Mike's World War II Obscure History & Trivia." Bootstrap design with military olive navbar, hero section using real Normandy invasion photo from Wikimedia, 24 trivia cards with categories, working search, and a personal dedication to Mike. Set up Apache, SSL, and firewall — all autonomously.
Claude Opus 4.6 F Gemini Created db.php and empty SQLite database, then got stuck in a loop trying to use Gemini CLI's generalist sub-agent tool — which requires a request parameter that Opus never provided correctly. Ran for 20+ minutes making API calls but never created any frontend files.
Claude Sonnet 4.6 F Gemini Created only config.php before hitting Cloudflare 524 timeout errors and tool parameter mismatches (write_file missing file_path, read_file missing file_path). The model called tools with wrong parameter names.
ChatGPT 5.4 F Gemini Never used any tools. Simply output text describing what it would build — a proposal with bullet points about Apache, PHP, SQLite — but never executed a single action. Ran three times with the same result.
MiniMax M2.7 F Gemini Locked itself out of the server. The model ran ufw enable (firewall) which blocked SSH port 22, making the server permanently inaccessible. The Gemini CLI process then had no server to deploy to.
Qwen 3.6+ F Gemini Same as MiniMax — locked itself out by enabling the firewall. Also hit write_file parameter errors and 429 rate limits on the free tier before losing server access entirely.

What Went Wrong

Gemini CLI is designed for Gemini models. Its tools have very specific parameter schemas — for example, write_file requires a file_path parameter, generalist requires a request parameter. When non-Gemini models produce tool calls, they use different parameter names (like path instead of file_path), causing every tool call to fail with a validation error.

This created a cascade of failures:

Only Gemini 3.1 Pro — the model Gemini CLI was built for — could use the tools correctly. And it did a good job: full PHP application, SQLite database with 24 real trivia entries, Bootstrap frontend, Apache SSL configuration, and firewall setup. It went from D+ (Round 1, Claude Code) to B- (Round 4, Gemini CLI) — a two-grade improvement just from using its native tool.

The Firewall Problem

Two models (MiniMax and Qwen) ran ufw enable as part of server setup but didn't add an SSH allow rule first. This immediately blocked port 22, making the servers permanently inaccessible. The models were then stranded — they could still execute shell commands through the Gemini CLI tool (which was already running on the server), but the tool parameter mismatches meant they couldn't write any files. A particularly unfortunate failure mode: the server is changed but the task can't be completed.

Round 4 Live Sites

Round 4 conducted April 5, 2026 using Gemini CLI v0.36.0 on freshly reimaged Hetzner CPX11 servers. Models routed through ClodHost's Gemini-format proxy to OpenRouter. Required proxy modifications: Gemini-to-OpenAI format translation, structured output handling, Cloudflare DNS-only mode for timeout avoidance.


Cross-Round Comparison

Here's how each model performed across all rounds and CLI tools:

Model R1 (Claude Code) R2 (Claude Code) R3 (Codex CLI) R4 (Gemini CLI)
Claude Sonnet 4.6 B B+ B- F
Claude Opus 4.6 B B F F
ChatGPT 5.4 C D+ B- F
Gemini 3.1 Pro D+ C+ F B-
Qwen 3.6+ B+ C+ F F
MiniMax M2.7 C+ C+ F F

The Big Takeaway: CLI Tools Are Not Interchangeable

The most striking finding isn't about models — it's about tools. The same model can score a B on one CLI and an F on another.

Each model performs best on its maker's CLI tool. This isn't surprising in hindsight — the CLI tools are designed with specific tool schemas, sandbox policies, and interaction patterns that their native models have been trained on. But the magnitude of the effect is dramatic: Gemini went from D+ to B- just by switching from Claude Code to Gemini CLI.

Updated Recommendations

For ClodHost users:

Rounds 3 and 4 conducted April 5, 2026 on freshly reimaged Hetzner CPX11 servers. Codex CLI used default sandbox settings. Gemini CLI v0.36.0 used with -y (YOLO mode) and -p (headless). All models routed through ClodHost's proxy to OpenRouter.

🤖 Prompt Used to Generate This Post
I want to run a little test of the various models. Can you set up on my account a bunch of new servers; each running a different model, and then send the initial prompt used for ww2obscura to them? Make sure to track total input tokens output tokens, cache hit %, cost, and total time to complete for each, then make a blog post comparing them all.