TL;DR
- GPT5 was an upgrade, but not revolutionary. The hype lasted too long.
- All the top models are very good now. Pick by task, price, privacy, and tooling.
- Benchmarks often change with each launch. Treat them as marketing confetti, not commandments.
- Agent coding tools are where a lot of the magic lives: Cursor (IDE), qwen‑code, gemini‑cli, claude‑code (CLIs).
- Do a tiny, honest home test (≈20 examples) and measure accuracy, latency, and cost. Then choose.
Once upon a tokenizer, we stopped asking “Who’s the smartest?” and started asking “Who’s the most useful for what I’m doing right now?” That question is kinder, cheaper, and gets better software shipped.
The lineup (fast portraits, real use cases) Link to heading
-
OpenAI GPT‑5
- Strengths: Solid general reasoning, broad ecosystem, lots of SDKs and integrations.
- Good for: Mixed workloads (chat + code + docs), teams already on OpenAI tooling.
-
Anthropic Claude (current 3.x family)
- Strengths: Reliable long‑context reading, careful instruction following, strong coding ergonomics.
- Good for: Summarizing large docs, analysis with citations, writing/refactoring bigger code edits in one go.
-
Google Gemini 2.5 Pro
- Strengths: Native multimodality, creative writing tone, deep ties to Google Workspace and Cloud, excellent coding with very long contexts.
- Good for: Brainstorming, content generation, working inside Google’s productivity stack, complex coding tasks requiring long context windows.
-
Alibaba Qwen (3 family, plus coder variants)
- Strengths: Excellent coding options, strong open‑weight story for self‑hosting and fine‑tuning.
- Good for: Teams wanting control (on‑prem/VPC), cost tuning, or specialized coding agents.
-
DeepSeek
- Strengths: Efficient architectures and strong math/general reasoning in open‑weight form.
- Good for: High‑volume/self‑hosted scenarios, research and engineering teams optimizing inference.
All five can do everyday tasks well. The interesting differences show up under constraints—very long inputs, strict privacy, tight latency budgets, high concurrency, or unusual domains.
Why benchmarks feel slippery now Link to heading
- New test suites pop up whenever a new model drops. Surprise: the new model shines.
- Public leaderboards reflect crowd taste and prompt style as much as capability.
- Static academic sets tend to leak into pretraining corpora over time.
Use benchmarks as lightweight signals. Make decisions with your own data.
A tiny home test (takes an afternoon, saves a quarter) Link to heading
- Collect 20 representative examples (docs, prompts, code snippets). Hide the answers in a separate file.
- Define success (accuracy/F1, human 1–5 score, or acceptance rate) and a latency/cost budget.
- Try 3–4 contenders (mix of vendors and model sizes).
- Keep prompts identical except where the API demands changes. Log every change.
- Pick the cheapest model that meets your bar. Keep a runner‑up as fallback.
Re‑run quarterly or when something starts to feel “off.”
Agent coding tools: where your hands actually live Link to heading
-
Cursor (IDE)
- What it is: An AI‑native editor that understands your codebase, supports inline edits, and can run task‑oriented “agent” workflows.
- Why it matters: Captures your daily flow; backend models can be swapped without changing your habits.
-
qwen‑code (CLI)
- What it is: A command‑line coding agent from the Qwen ecosystem that reads/writes files, runs shell commands, and helps with multi‑step changes.
- Why it matters: Pairs well with Qwen coder models and self‑hosting; good for scripted, reproducible refactors.
-
gemini‑cli (CLI)
- What it is: Google’s terminal assistant with built‑in tools (search, web, file ops) and project memory.
- Why it matters: Great for brainstorming, content tasks, and teams already in Google’s stack.
-
claude‑code (CLI)
- What it is: Anthropic’s coding agent for the terminal: project guides, diffs, git workflows, careful edits.
- Why it matters: Strong for larger, coherent edits and analysis on long files.
Tip: Learn one IDE (Cursor) and one CLI (pick qwen‑code, gemini‑cli, or claude‑code). The muscle memory is the real moat.
Picking a model by job, not by logo Link to heading
- Long, careful reading of big docs → Claude or Gemini (large contexts). Also consider GPT if you’re already invested in its tools.
- General product work (docs + code + chat) → GPT or Claude; add Gemini for creative tone.
- Heavy coding workflows → Claude (coherent edits), GPT (refactors/tests), Qwen coder models (cost/control). Use the agent you like.
- Strict privacy or cost control → Qwen or DeepSeek open‑weights; self‑host in VPC; fine‑tune if needed.
- Multimodal creativity → Gemini; close second: GPT multimodal variants.
Cost and latency common sense Link to heading
- Prefer smaller models for rote tasks; reserve bigger models for gnarly steps.
- Keep prompts short. Use IDs or summaries instead of pasting novels.
- Ask for structured output (JSON, diffs) to cut clarification turns.
- Cache reusable bits (instructions, context) however your stack allows.
A few reusable prompt recipes Link to heading
-
Careful assistant “You are precise and concise. If uncertain, ask a clarifying question first. Prefer steps over flourishes.”
-
Extraction “Return JSON only with keys [field_one, field_two, …]. If a value is unknown, use null. Do not invent information.”
-
Code edits “Explain changes in 1–2 bullets, then produce a unified diff. If the request is risky, ask before editing.”
-
Long‑doc answers “Quote or cite section headings for each claim. If the answer is not present in the provided text, say ‘not found.’”
What to ignore (with love) Link to heading
- Leaderboard fireworks designed for launch week.
- Absolutes like “X is the best at everything.”
- One‑off Twitter threads with three screenshots and a vibe.
The kind ending Link to heading
The moat is flat now; the tables are many. Pick the one that fits your task, budget, and rules. Learn a tool you love for your hands. Measure a tiny bit, regularly. You’ll get excellent results with any of the top models and helpful software assistants with the agent that fits you best.