← Trust & Access

Your MCP server is not an API adapter

MCP servers can cache, pre-aggregate, and transform data independently of upstream services. Here's how and why.

A lot of MCP servers I see in the wild look like this:

@mcp.tool()
async def get_thing(id: str):
    resp = await httpx.get(f"https://api.example.com/things/{id}")
    return resp.json()

Fetch, forward, done. A thin HTTP proxy with a JSON Schema wrapper. For some use cases, that’s enough.

The servers I keep coming back to do something different. They hold state and pre-compute answers. An agent hitting a thin wrapper might need three round trips and 30 seconds. The same agent hitting a server that does real work gets its answer in one call, under a millisecond.

Preloaded in-memory index

Here’s a failure mode I run into constantly: the agent needs to find something but doesn’t know the exact ID. Most APIs only support exact lookups. No ID, no result. The conversation dead-ends with “I couldn’t find that resource” and the user gives up.

I built a server that wraps a CDN management API. Hundreds of properties, and the agent regularly needs to find which one handles a given hostname. The API has a search endpoint, but it’s slow, requires exact matches, and sometimes returns 403 depending on account permissions.

So the server loads every property into memory at startup:

class PropertyIndex:
    _entries: list[PropertyEntry] = field(default_factory=list)
    _name_index: dict[str, int] = field(default_factory=dict)

    async def load(self, refresh_interval: int = 300):
        await self._build_index()
        self._refresh_task = asyncio.create_task(self._refresh_loop())

    def search_by_name(self, query: str, limit: int = 50):
        names = [e.property_name for e in self._entries]
        matches = process.extract(
            query, names, scorer=fuzz.WRatio,
            limit=limit, score_cutoff=50,
        )
        return [self._entries[idx].to_dict() for _, score, idx in matches]

Builds once by fanning out parallel API calls, deduplicates, refreshes every five minutes in the background. Lookups take under a millisecond.

Without this, the agent guesses at exact property names, picks the wrong one, retries, burns three turns. With the index, someone types “the CDN config for checkout” and gets the right answer first try. That’s the kind of difference that decides whether people keep using the agent or go back to doing it manually.

I did the same thing for a CI/CD server. The API lets you fetch a build config by ID, but there’s no fuzzy search. If you don’t know the ID, you’re stuck. The server caches all build configurations at startup, runs fuzzy matching against them. The agent says “find the deploy job for the payments service” and gets a ranked list instantly, even though the CI system itself can’t do that.

Embedded analytical database

I have another server that sits in front of a relational database. Some tables have 20 million rows. The agent needs to answer analytical questions, things like “which providers have the highest volume in this region?” or “show me the top performers for a given category.”

The database wasn’t designed for these queries. It was built for a web UI with narrow, well-indexed lookups. The agent’s access patterns are different: it asks broad analytical questions that require joins across tables the application never joins. Adding indexes wasn’t an option either, because the database is owned by another team and optimizing it for an AI agent’s query patterns wasn’t on anyone’s roadmap. Some of these queries took 10 to 30 seconds on a read replica, and in an agent loop where that latency gets multiplied by however many tool calls the agent needs, the conversation times out before it gets anywhere.

The server embeds DuckDB in-process and loads pre-aggregated views and lookup tables at startup. Some are straight copies of small reference tables. Others are materialized summaries that flatten joins the source database was never designed to run efficiently, the kind of cross-table aggregations that make sense for an analytical question but would be expensive on a schema built for transactional web UI lookups:

class DuckDBCache:
    async def start(self):
        self._conn = duckdb.connect(":memory:")
        for key, config in fast_configs.items():
            await self._load_table(key, config)
        self._ready = True
        self._deferred_task = asyncio.create_task(
            self._load_deferred(deferred_configs)
        )
        self._refresh_task = asyncio.create_task(self._refresh_loop())

Each table has a fingerprint query (a cheap COUNT(*) or checksum) that the refresh loop checks before doing a full reload. Large tables load in the background after the server is already taking requests. If something asks for a table that hasn’t loaded yet, it falls back to the source database.

The 30-second query now takes under a millisecond. The agent can actually have a back-and-forth with the user instead of timing out after the first question.

There’s a query-result cache on top of this too. It has a prewarm manifest, basically a list of common queries that run at startup so the first person to use the agent on Monday morning doesn’t sit through a cold start.

class QueryCache:
    async def get_or_compute(self, cache_key, compute_fn, ttl=None):
        cached = self.get(cache_key)
        if cached is not None:
            return cached

        result = await compute_fn()
        if "error" not in result:
            self._put(cache_key, result, ttl or self._default_ttl)
        return result

It skips caching error responses. If a query fails because the database is temporarily overloaded, you don’t want that failure served for the next hour. That one took a production outage to figure out.

Data transformation

Every server I build strips the upstream API response before returning it. Token usage scales with response size, and most APIs return 10x more data than the agent will ever look at.

One API I work with returns objects with 60+ fields. The server keeps maybe 8:

def _slim_record(r: dict):
    return _strip_nulls({
        "id": r.get("id"),
        "name": r.get("name"),
        "total_value": _cents_to_major(r.get("total_value_cents")),
        "annual_value": _cents_to_major(r.get("annual_value_cents")),
        "start_date": r.get("start_date"),
        "end_date": r.get("end_date"),
        "status": _effective_status(r),
    })

_cents_to_major converts cents to dollars. The raw API stores monetary values in cents. Before I added this conversion, 100% of the reports the agent generated had wrong numbers. Every dollar amount was off by a factor of 100. A $2,000 contract showed up as $200,000 in the report because the agent treated cents as dollars. No amount of prompt engineering fixed it reliably. Moving the conversion into the server did.

_effective_status is the other one worth mentioning. The API’s status field can say “active” on a record that ended three months ago. The platform’s own UI derives the real status from multiple fields, so the MCP server does the same:

def _effective_status(r: dict) -> str:
    stage = r.get("stage")
    if stage in ("terminated", "not_renewed"):
        return "inactive"

    if r.get("end_date_not_applicable") or r.get("renewal_type") == "perpetual":
        return r.get("status", "undetermined")

    end_date = r.get("end_date")
    if end_date:
        if date.fromisoformat(end_date) < date.today():
            return "inactive"

    return r.get("status", "undetermined")

Now the agent gives the same answer a human would get looking at the UI. Stripping nulls across a list of 50 records also saves a few thousand tokens per response, which adds up.

A log aggregation server I built does something similar: auto-appends | json auto to queries that don’t have a field extraction operator, truncates raw log lines to 500 characters, converts epoch-millisecond timestamps to ISO 8601. Small fixes that add up to the agent not wasting turns fighting the format.

Download once, serve from cache

Some data is expensive to fetch. PDF documents. Code bundles in tgz archives. The pattern: download on first access, extract the text, build a line offset index, serve everything from memory after that.

class CachedFile:
    def __init__(self, content: str):
        self.content = content
        offsets = [0]
        pos = 0
        while True:
            pos = content.find("\n", pos)
            if pos == -1:
                break
            pos += 1
            if pos < len(content):
                offsets.append(pos)
        self._offsets = offsets

    def get_lines(self, start: int, end: int) -> str:
        return self.content[self._offsets[start-1] : self._line_end_offset(end)]

I use this for CDN edge function code bundles and PDF documents (extracted with PyMuPDF). After the first download, the agent reads by line range, searches with regex, lists the file tree. No repeat downloads. Reading through a 200-page document becomes “just read” instead of “download, extract, read” on every question.

When thin is fine

Not everything needs this treatment. A server that translates natural language to a query language and passes it to an API is fine as a thin wrapper. The translation is the value there. Same for simple lookup tools.

The question I ask: does the agent hit the same data twice? Does the API return more than the agent needs? Is the API response time slow enough that the agent loop feels broken? If yes, the server should be doing work.

The multiplier

When a person uses a web UI, they look at a page, think, click something else. One request at a time, processed by a human brain. An agent works differently. It makes five tool calls, stuffs all five responses into its context window, and reasons over them at once. A slow response gets multiplied by every call. A 60-field JSON blob gets multiplied by every call. It adds up fast.

I’ve measured the difference. CDN property lookups went from three agent turns to one once the fuzzy index was in place. Analytical queries went from timing out at 30 seconds to returning in under a millisecond from DuckDB. And every single dollar amount in every report was wrong until the server started converting cents for the agent.

You can try to fix that last one with prompt engineering. I tried for weeks. The agent still got it wrong often enough that I couldn’t trust the output.