The grant application says "AI-powered artist management." Most technically-literate readers have used ChatGPT, uploaded a PDF to it, run a query against a spreadsheet in it, maybe built a custom GPT or a Project. Against that reference frame, "AI-powered company" is easy to picture as the operator uploading a deal memo into a chat, asking a question, pasting the answer somewhere, moving on. That shape produces useful individual answers. It does not produce a company.
What's probably less familiar is the thing running underneath Different Gear's operations: a carefully architected company repository — typed, versioned, append-only records that have been accumulating for months across every artist, deal, tour, and decision — with agentic harnesses (Claude Code, Codex) running frontier reasoning models over it. The harnesses route across the repository to find the exact bytes any given question needs, read those slices, and produce answers cited back to specific files. The combination — structured persistent substrate, targeting harness, frontier model — produces accuracy and cross-cutting insight at a different altitude than a general chat session reaches in practice, given how users actually interact with one.
This page lays the shape out for technically-literate readers.
The simplest chat interaction is a single round-trip:
Modern consumer chat interfaces have moved well past this — ChatGPT and Claude.ai can upload files, run code in a sandbox, call web search, persist shallow state across sessions via memory or Projects. The distinction that matters for operational work is not "can the AI read a file" (it can), and not "does it have any state at all" (it has some). The distinction is the combination of three things: a structured, versioned repository the same system reads from and writes to across weeks and months; a set of targeting primitives that let the model decide what to load rather than the user pasting or uploading; and a multi-turn, stop-safely agent posture instead of a one-shot completion.
An agentic harness is the program that orchestrates all three. It runs around a reasoning model in a loop: each turn, it gives the model a task and a set of tools; the model plans; it calls a tool (read a file, run a shell command, query an API, edit a file, run a test); the harness executes the tool and returns the result; the model reads the result, updates its plan, and either calls another tool or declares the task complete. Claude Code and Codex are two such harnesses, built by Anthropic and OpenAI respectively, running on their frontier reasoning models. They are purpose-built for software engineering work, which is why they also happen to be well-suited to running a structured company repository — the workloads are structurally similar enough that the same tools transfer.
The implications are large. A harness can open ten files in sequence, trace a bug through a codebase, run the test that verifies the fix, notice the test still fails, adjust, and iterate — all without a human in the loop for the individual steps. It can read yesterday's decision log before making today's decision. It can check a deal term against the signed contract in the filesystem rather than recall it from a conversation that may not have happened.
Critically, a well-designed harness has a stop-safely posture. When the model encounters something it isn't sure about — an ambiguous instruction, a destructive operation it doesn't have authorization for, a file it doesn't recognize — it stops and asks, rather than guessing. This is the opposite of the autocomplete reflex and is a major contributor to accuracy on consequential tasks.
A 1M-token context window is a real capability and it enables things smaller windows cannot — §4 below gets into that. But "big context" is only half the story. The other half is what the harness does before loading anything: it routes to the exact bytes that are relevant, through a small vocabulary of targeting commands that most chat interfaces do not expose. The two work together. Targeting keeps the context clean so the model's attention lands where it matters; large context means when the right move is to hold the whole record in one pass, that option is available.
Concretely, when asked "why is the Sophia tour itinerary serving stale data," the Claude Code harness does not paste the repo into context. It runs something like this:
# 1. find the relevant code — not by guessing, by searching
rg -n "serve_itinerary|itinerary_cache" src/ --type py
# 2. narrow to the call site — one file, not ten
rg -l "itinerary_cache" src/different_gear/
# 3. read only the relevant slice — lines 48–92, not the whole file
# (the Read tool accepts offset + limit; the harness passes both)
read("src/different_gear/tour/serve.py", offset=48, limit=45)
# 4. check what changed recently — scoped to this file, this week
git log --since="1 week ago" -- src/different_gear/tour/serve.py
# 5. check the last commit that touched the cache logic
git blame -L 60,75 src/different_gear/tour/serve.py
# 6. run the failing test to see the actual error
pytest tests/test_itinerary_serve.py::test_cache_invalidation -x
Six commands. Each one returns on the order of 10–200 tokens of output. Total loaded into the model's context for this diagnostic: roughly 2,000 tokens of highly targeted evidence. The model then reasons over that evidence, proposes a fix, and the harness calls the Edit tool with an old_string / new_string pair that changes only the specific bytes that need to change — it does not rewrite the file.
Contrast with a 1M-context chat window (Claude.ai desktop, the web app, the "cowork" surfaces). To ask the same question there, a user must either (a) paste the whole codebase into the window, which burns 500,000+ tokens of context on code that is 99% irrelevant to the bug, or (b) pre-select the files they think are relevant, which reintroduces the human-guesswork failure mode that a coding harness exists to reduce. In either case, the context is full of noise, the cache-friendly stable portion is gone because every turn reshuffles what's loaded, and the model's attention is spread across content it does not need.
The tools that make the difference are boring and unix-shaped on purpose:
rg (ripgrep) — find string or regex across the codebase in milliseconds. Returns file paths and line numbers, not file contents. The harness uses this to decide where to look before loading anything.glob — pattern-match file paths (**/*.py, src/**/tour/*.ts). Used to scope a subsequent search without reading any files.Read(path, offset, limit) — read a specific line range, not a whole file. A 3,000-line file costs 30,000 tokens if you read it all; reading the relevant 50 lines costs 500.git log --since --until -- <path> — scope history in time and location. "What changed in this file this week" is far cheaper than "here is the full git history."git blame -L <start>,<end> <file> — point at exactly the lines that are behaving badly and ask "who last touched these, and why." Returns author + commit message per line, not the whole history.Edit(file, old_string, new_string) — change bytes, not files. The diff that goes to the filesystem is exactly the diff the model intended, which is also the diff the model has to be confident about.Task(subagent, prompt) — dispatch an isolated sub-agent with its own context for open-ended exploration, and receive a compressed summary back. The parent harness's context never sees the sub-agent's 50,000-token exploration; it sees only the 300-token answer.Each of these is a targeting primitive — the harness asking "where should I look?" before asking "what does it say?", the same way experienced humans debug. For narrow operational questions, this is the whole game. "What's the status of Sophia's Primavera SP offer" needs the specific offer file and the last few log entries that reference it, not her full record.
Targeting breaks down when the question requires reasoning across many parts of the record simultaneously, because no single grep can find "everything this decision depends on." The interactions are the answer. Consider an inbound brand partnership offer for Sophia — a $75K deal from a beverage company, six-month exclusivity in beverages, North American territory, press announcement tied to a specific date. To advise on it responsibly, the operator needs to reason over, in one pass:
Each of those lives in a different part of the substrate. A targeting-first approach would make six separate retrieval decisions up front, which is exactly the question the harness is being asked to help answer — if the operator already knew which brand deals to check, they wouldn't need the system for this decision. This is the task profile where the harness loads the whole artist record into one pass and reasons across it:
# targeting still does the discovery — it finds what to load, not what to read
rg -l "" clients/sophia-stel/ --type md \
| head -n 200 # file list, not file contents
# then the harness loads the relevant slices in parallel into the context:
# clients/sophia-stel/deals/brand/ (all active brand deals)
# clients/sophia-stel/touring/2026/ (routing, hold dates, press cadence)
# clients/sophia-stel/releases/ (singles + album calendar)
# clients/sophia-stel/label/a24/ (marketing calendar, approvals)
# clients/sophia-stel/log/ (append-only decision history)
# clients/sophia-stel/positions/ (company stances — rate floor, brand fit)
# clients/sophia-stel/signal/spotify/ (territory-level listener data)
# notes/positions/ (partnership-level stances)
# ~350,000 tokens loaded. Cache hit on the stable portion (positions, log
# history) means follow-up questions in the same session pay only for the
# deltas. Model reasons over all of it simultaneously, not six fragments
# retrieved in sequence, and produces an answer that names the conflicting
# beverage exclusivity, the tour-announce collision, and a suggested counter.
A retrieval-first approach at this altitude degrades in a specific way: the retriever finds the pieces that match the query, the model reasons over those pieces, and it produces an answer that looks right because it is right given what the model saw — but the exclusivity conflict in the unretrieved contract was the thing that actually mattered. The 1M window exists to let the model hold the whole web of interactions at once, so no single retrieval decision is load-bearing on the accuracy of the final answer.
Targeting and large context are not competing primitives; they serve different question shapes. Narrow operational recall ("what did we agree with the hotel?") is a targeting task. Cross-cutting strategic decisions ("should we take this deal?") are context-density tasks. A good harness does both well, and the grant-funded deal modelling and territory intelligence tools are specifically the second kind of workload.
The harness is the program. The model is the reasoning engine. The two currently in production at Different Gear:
"Frontier" here means the largest, most capable, most expensive models the labs produce — the ones they sell API access to rather than the small free-tier chatbots. They cost roughly 10–50× more per token than smaller models, and for work where accuracy compounds (artist contracts, tour finances, decision records), the cost difference is not the right thing to optimize.
These models are also reasoning models in the specific technical sense: they are trained to spend compute on extended internal deliberation before producing a final answer, and modern harnesses expose that explicitly. On a hard question, the model can think for several minutes before acting, which materially reduces the "looked right, was wrong" failure mode.
The model's context window is how much information it can reason over in a single pass. Claude Opus 4.x has a 1,000,000-token window — roughly 750,000 words, or several novels. For artist management, this unlocks a specific capability: the harness can load an artist's full campaign state — goals, contracts in force, active deals, tour itinerary, streaming trends, recent log entries, open decisions — into a single reasoning pass, and answer a question grounded in all of it simultaneously.
Compare this to the retrieval-augmented-generation (RAG) approach that was the default pattern for production LLM systems in 2023–2024, where smaller context windows forced systems to retrieve document fragments and stitch them together. RAG is still widely deployed — it's the right answer when the corpus is larger than any context window can hold — but it introduces a retrieval-accuracy problem on top of a reasoning-accuracy problem: if the retriever pulls the wrong fragment, the model reasons correctly over the wrong evidence. A model with a large enough context window to hold the entire relevant state largely sidesteps that failure mode, at the cost of some attention dilution across long contexts — a real phenomenon, mitigated in practice by careful substrate structuring (the topic of §5) and by the fact that frontier models in 2026 hold accuracy at length much better than their 2024 predecessors.
Prompt caching is the economic enabler. Loading 500,000 tokens of context costs money every time you load it. Anthropic's prompt caching allows the harness to cache the stable portion of the context (the substrate, the artist's accumulated record) for up to five minutes, so follow-up turns in the same session pay for only the new tokens. This is what makes the "load the whole state every turn" pattern viable as everyday practice rather than a one-off query.
The non-obvious point about harnesses: because the model reasons only over what's in its context, the shape of the context directly determines the shape of the answers. Two harnesses with identical models will produce different-quality outputs depending on what's loaded in and how it's structured.
A naive pattern — dump the artist's email inbox into the context and ask a question — produces bad output, because the signal-to-noise ratio is terrible and the model spends its reasoning budget parsing formatting rather than deciding.
A structured pattern — load the relevant shape of the state, with fields labelled, dates normalized, and irrelevant history excluded — produces dramatically better output on the same underlying model.
This is why Different Gear invests in the substrate rather than just uploading ad-hoc documents into a general chat session. Every record in the substrate is designed for machine consumption as well as human consumption. A snippet from the actual system showing the entity pointer pattern — the glue that lets a harness discover and load canonical artist state:
# .entities/artists/sophia-stel.yml
kind: artist
slug: sophia-stel
canonical_repo: different-gear
canonical_path: clients/sophia-stel
governed_by: different-gear
partnership_stance_inheritance: all
active_since: 2026-04-11
primary_operator: matty
advisory_operator: jarett
signals:
- platform: spotify
status: adapter_ready_oauth_deferred
The pointer tells the harness: Sophia's canonical record is at different-gear/clients/sophia-stel/; the Spotify signal adapter is ready but not yet wired. The harness reads this first, then follows the pointer to load whichever slices of Sophia's state the task needs. The pointer layer is what prevents context bloat: the harness doesn't load every artist's full history on every question — it loads what the pointer says is relevant.
And the resolver that turns the pointer into a canonical path:
# punkbrainlib/entity_intelligence.py (excerpt)
def resolve_canonical_path(manifest, punk_brain_root):
"""Combine canonical_repo + canonical_path into an absolute path."""
repo_slug = manifest.get("canonical_repo")
canonical_rel = manifest.get("canonical_path")
if not repo_slug or not canonical_rel:
return None
repo_manifest = load_repo_manifest(punk_brain_root, repo_slug)
repo_path_str = repo_manifest.get("path")
if not repo_path_str:
return None
path = Path(repo_path_str) / canonical_rel
if not path.exists():
return None
return path
The function returns None on every missing-data case rather than raising an exception. In a general Python codebase this would be the wrong default — PEP 20 says errors should never pass silently, and deterministic callers want early, loud failure. But this resolver is called by an LLM planner operating inside a harness, and the failure mode changes shape: a loud exception mid-task invites the model to invent a plausible path around the error, while a quiet None forces the caller code to handle missing-data explicitly, which is also where any correct workflow wants the branch point. The pattern is scoped to harness-facing helpers; elsewhere in the codebase, raising is still the right default.
Described at the right altitude: a shared information substrate that holds canonical artist state, plus a set of views and tools that read from and write to it, orchestrated by harnesses (Claude Code and Codex) running frontier reasoning models.
The substrate is the product. The tools the company runs — tour logistics intake, live itinerary, automated advancing, deal modelling, territory intelligence — are views on the substrate. New tools are cheap to add because the hard part (the structured, accumulating record of what the company knows) is in place.
Signal adapters are the write-side: small, single-purpose modules structured to pull from each external platform and deposit structured records into the substrate. Four are in the repo today — Spotify, Resident Advisor, Meta Pixel, and a manual ticket-count adapter — with some wired live and others pending credential provisioning. Ticketing, email, calendar, and booking adapters are scoped into the grant-funded build. All adapters are append-only by design — a new day's Spotify snapshot never overwrites yesterday's; both are retained, which lets the harness reason over trends as well as current state.
Three specific failure modes this architecture addresses:
Hallucinated or ungrounded facts. A chat session asked about a deal term reasons over whatever was pasted or uploaded into that session. The uploaded copy has no link back to a canonical location and no version history inside the session — the session cannot tell whether a later version exists, and the operator has to manage that discipline manually. The harness reads the canonical file directly from the company repository, quotes the specific clause, and cites the path. The evidence is the version currently in the record, the record's history is inspectable via git, and the citation lets the operator verify the model didn't paraphrase.
Shallow context across sessions. ChatGPT Memory and Projects offer some persistence, but the shape is limited: Memory stores a list of user-level notes, Projects hold a flat file shelf — useful, but not a schema, not typed, and not structured to record why a particular decision was made. The harness reads from a structured, append-only company log — positions the company has taken, exceptions, prior decisions and the reasoning attached to them — before answering. Today's answer is consistent with last week's decisions not because the model remembers, but because it reads the record that remembers, and that record is versioned in git with a commit for every change.
Cross-artist data bleed. This is the production-grade concern that distinguishes a pilot tool from a company platform. In the substrate, each artist's canonical record lives in its own directory, and the resolver will only load paths inside the artist the harness is currently operating on. Role-based access at the view layer — so that, for example, Sophia's agent sees Sophia's deals but not Lovefoxy's — is a load-bearing deliverable of the grant-funded Workstream 1/2 build rather than a property of the system as it stands today; today the isolation is at the resolver layer and at operator discipline. The grant funds the production-grade enforcement.
Three additional accuracy properties fall out of harness design:
The system is explicitly not:
The company's claim is narrower and more defensible than "AI will transform artist management." It is: for a two-operator management company servicing a small developmental roster around a live breakout campaign, an agentic harness running a frontier reasoning model on a structured, artist-canonical substrate is designed to produce a qualitatively different shape of operational output than the same two operators working from inboxes, spreadsheets, and memory — more consistent across decisions, more grounded in the record, easier to audit — and the grant-funded build exists to both harden that shape into production form and measure its actual effect against Sophia's live campaign.