Hatchet Documentation Audit
The Hatchet docs are dense and well-organized for a Mintlify-style developer platform — narrative concepts pages, a real llms.txt, a working SDK reference. But several platform-level cracks show up the moment you treat the site as a single corpus: a webhook example pointing at an undocumented domain, a SOC 2 posture undercut by 30-day audit log retention, duplicated URLs for the same concept pages, a fourth SDK that's promised on the home page and quietly absent in tab after tab, and copy-paste commands that ship literal UUIDs without marking them as placeholders.
1. Webhook example URL uses a different domain than the rest of the docs (critical)
Location: /v1/webhooks
Problem: Every Cloud reference in the docs uses cloud.hatchet.run (sign-up link, region availability page, quickstart). But the webhook documentation example URL is:
https://cloud.onhatchet.run/api/v1/stable/tenants/d60181b7-da6c-4d4c-92ec-8aa0fc74b3e5/webhooks/my-webhook
cloud.onhatchet.run is never introduced or explained anywhere else in the scraped content. There is no note clarifying whether this is a typo, an alternate domain, a legacy domain, or the real webhook ingress that differs from the dashboard domain. The GitHub cookbook ("Paste the Hatchet webhook URL") does not specify which domain, which means a developer copying the docs example will type the unfamiliar host into their upstream config.
Consequence: This is a silent production-failure mode. A developer wiring up Stripe or GitHub webhooks types the example domain (or copies the docs URL) and points the upstream at a hostname that either 404s or routes somewhere unexpected. Stripe and GitHub will retry against the bad host, eventually disable the endpoint, and drop events on the floor. The developer will be debugging the wrong end of the integration with no signal from the docs that there is a domain split at all.
The fix: If onhatchet.run is the real webhook ingress, add a section on the webhooks page (and the Stripe/GitHub/Slack cookbooks) explaining the domain split between dashboard and webhook ingestion. If it's a typo, change it to cloud.hatchet.run and grep the corpus for the same string.
2. 30-day audit log retention conflicts with the SOC 2 / HIPAA / GDPR posture (critical)
Location: /v1/security/audit-logs and /v1/security/index
Problem: The security index states: "Hatchet is SOC 2 Type II, HIPAA, and GDPR compliant." The audit-logs page then states: "Audit log entries are retained for 30 days. Entries older than 30 days are automatically removed." There is no documented export mechanism, no link to an audit-log API endpoint, and no guidance on how a customer is expected to retain audit evidence for the multi-year windows those frameworks expect.
Consequence: Regulated buyers will read these two pages back to back and reach opposite conclusions. Security teams will block adoption pending answers. Customers who do adopt may discover after an audit that they had no way to retrieve audit logs older than 30 days for evidence requests — a failure mode that surfaces only when it is most expensive.
The fix: Either document an audit-log export mechanism (API, S3 sink, SIEM forwarding) on the same page, or explicitly state "customers requiring long-term retention should export logs via X." Make the retention number a property of the Cloud plan, not an implicit hard limit on the product, and link to the Trust Center evidence for how the 30-day window is reconciled with the listed frameworks.
3. Duplicate URLs for core error-handling pages, plus a 404 in the same cluster (significant)
Location: /v1/retry-policies and /v1/error-handling/retry-policies (same pattern for /v1/timeouts vs /v1/error-handling/timeouts); also /v1/migration-guides
Problem: Both URLs return HTTP 200 for retry policies and timeouts (confirmed in the scraped status check: "/v1/error-handling/retry-policies": 200, "/v1/retry-policies": 200, "/v1/error-handling/timeouts": 200). The "From Celery to Hatchet" page links to the /v1/error-handling/... variants in its "Next steps" list, while other pages and the llms.txt-style canonical listing reference the flat /v1/retry-policies and /v1/timeouts paths. In the same path-cluster, https://docs.hatchet.run/v1/migration-guides returns 404 — a path that reads like an obvious internal-link target for a migration guide hub.
Consequence: Search engines and AI agents indexing the docs end up with two equally-weighted copies of the same page. Internal links split traffic, deep-linked bookmarks drift apart over time, and an agent ingesting llms.txt plus the migration guide will store two records for the same policy and may resolve to whichever it sees last. The /v1/migration-guides 404 means anything linking to that path (now or later) drops the developer on a dead page mid-migration.
The fix: Pick one canonical path per concept page, 301 the other, and update the "From Celery to Hatchet" next-steps list (and any other internal links) so every reference points at the canonical URL. Configure the canonical URL in Mintlify rather than hand-editing HTML. Either create a real /v1/migration-guides index or make sure nothing links to it.
4. Ruby is a first-class SDK on the home page but a placeholder across the user guide (significant)
Location: /v1 (home), /v1/tasks, /v1/timeouts, /v1/advanced-assignment/worker-affinity, /v1/inter-service-triggering, /v1/durable-event-waits, /v1/opentelemetry
Problem: The home page states: "It supports applications written in Python, Typescript, Go and Ruby." But across the user guide, the Ruby tab repeatedly appears as a bare #### Ruby heading with no content beneath it (visible in the scrape of tasks, timeouts, worker-affinity, and durable-event-waits). The OpenTelemetry page says outright: "OpenTelemetry support for the Ruby SDK is coming soon." The inter-service-triggering Ruby section is the only one with substantive prose, and that prose is a caveat about code duplication and type-safety risk, not example code.
Consequence: A Ruby developer who picks Hatchet on the strength of the home page lands on the first concept page and discovers they're a second-class citizen. Agents are worse — they will produce empty Ruby tabs, hallucinate Ruby snippets to fill the gap, or recommend Hatchet to Ruby teams that will hit feature parity walls in production (OTel, presumably others).
The fix: Either (a) qualify the home-page sentence ("Python, TypeScript, and Go, with Ruby in beta — see the Ruby SDK coverage matrix"), or (b) fill in the Ruby tabs with real code. A coverage matrix page listing per-feature SDK support (Python/TS/Go/Ruby × tasks/durable/OTel/streaming/…) would resolve this in one place.
5. Self-hosted docker-compose command hardcodes a tenant UUID with no "replace this" callout (significant)
Location: /self-hosting/docker-compose
Problem: The token-creation command reads literally:
docker compose run --no-deps setup-config /hatchet/hatchet-admin token create \
--config /hatchet/config --tenant-id 707d0855-80ab-4e1f-a156-f1c4546cbf52
The UUID 707d0855-80ab-4e1f-a156-f1c4546cbf52 is presented as part of the command with no surrounding text marking it as a placeholder, no <your-tenant-id> syntax, and no instruction on where to find your actual tenant ID.
Consequence: A human will eventually catch the mistake, but an AI coding agent extracting this snippet will run it verbatim and either error out or — worse, if that UUID happens to match a real tenant in a multi-tenant self-hosted install — issue a token against the wrong tenant. This is exactly the kind of copy-paste hazard agent-driven onboarding amplifies.
The fix: Replace the literal UUID with <your-tenant-id> (or $HATCHET_TENANT_ID) and add a one-liner above the snippet explaining where to obtain the tenant ID (dashboard URL, hatchet-admin tenant list, or wherever it lives).
6. Missed-schedule "no catch-up" behavior is buried in a side note (significant)
Location: /v1/scheduled-runs
Problem: The scheduled-runs page lists a one-liner under "Considerations": "Missed Schedules: If a scheduled task is missed (e.g., due to system downtime), Hatchet will not automatically run the missed instances when the service comes back online." This is a load-bearing semantic for a platform pitched on durability, and it is exactly the opposite of what Celery beat, cron, and most "durable" schedulers do by default. The migration guide ("From Celery to Hatchet") never highlights it.
Consequence: Teams migrating from Celery beat or system cron will assume catch-up runs by default and discover the difference only when an outage silently swallows a billing run, a digest, or a reconciliation job. Agents writing "migrate this cron to Hatchet" code will not surface the behavior change, because the docs treat it as a minor consideration rather than a contract.
The fix: Promote this into its own "Missed schedule semantics" subsection with a worked example, and add a callout to the Celery migration guide and the cron-runs page. If catch-up is configurable (or planned), say so; if it isn't, say that explicitly so callers can implement their own backstop.
7. No OpenAPI / machine-readable API spec surfaced from the docs (significant)
Location: /llms.txt, /reference/python/client, sitewide entry points
Problem: The llms.txt listing contains a sidebar of HTML pages — Client, Context, the cookbooks, self-hosting guides — and does not link to an OpenAPI document, a Swagger UI, or any machine-readable REST schema. The reference index exposes only Python Client and Context pages. The docs reference hatchet.runs.bulk_cancel, hatchet.runs.bulk_replay, and a REST API repeatedly (the bulk-retries page leads with "First, we'll start by fetching a task via the REST API"), but the actual REST endpoint, request shape, and auth contract are not surfaced in any parseable form from llms.txt or the reference sidebar.
Consequence: Coding agents (Claude Code, Cursor, Copilot) and OpenAPI-based tooling (openapi-generator, Stainless, Speakeasy, internal client generators) cannot discover Hatchet's REST surface from the documented entry points. Every integration becomes a prose-scraping exercise, and corrections to the API surface require manually editing the docs in the same patch.
The fix: Publish an OpenAPI 3.x document at a stable URL (e.g. /openapi.json), link it from llms.txt and the reference index, and either render it with a Swagger/Redoc page or use the Mintlify OpenAPI integration so the REST endpoints get first-class doc pages alongside the SDK reference.
8. 4 MB default payload size is documented only in a troubleshooting page (significant)
Location: /v1/troubleshooting/index ("Could not send task to worker")
Problem: The 4 MB default maximum payload size — a hard product limit that shapes how callers chunk inputs and outputs — appears only inside a troubleshooting bullet: "The payload is too large for the worker to accept or the Hatchet engine to send. The default maximum payload size is 4MB." There is no mention on /v1/tasks, no Limits page in the scraped sidebar, and the value is buried alongside the SERVER_GRPC_WORKER_STREAM_MAX_BACKLOG_SIZE self-host knob.
Consequence: Developers designing tasks (PDF pipelines, LLM context windows, file processing — all use cases Hatchet markets) don't learn the limit until a task fails in production. Agents asked "what's the max input size for a Hatchet task" will not find it on the Tasks page where it belongs.
The fix: Add an explicit Limits page (or a section on /v1/tasks) listing payload size, slot defaults, heartbeat windows, and audit retention in one place. Keep the troubleshooting bullet, but link it to the canonical limits documentation.
9. "Self-hosted quickstart" link target is ambiguous (minor)
Location: /v1/quickstart, /self-hosting/index
Problem: The Cloud quickstart says: "If you'd like to self-host Hatchet, please see the self-hosted quickstart instead" and links to /self-hosting. The /self-hosting index page then presents at least three on-ramps — the Hatchet CLI (wraps Hatchet Lite), Hatchet Lite directly, Docker Compose, and a Kubernetes Quickstart — without a clear "start here" arrow. The llms.txt listing surfaces both Hatchet Lite and Quickstart (under Kubernetes) as separate entries.
Consequence: A developer following the "self-hosted quickstart" promise lands on an index and has to make a deployment-strategy decision they weren't ready to make in order to follow the next step. Agents asked "how do I self-host Hatchet" will pick one of the four paths essentially at random.
The fix: Either (a) link the Cloud quickstart directly at the recommended self-hosted path (almost certainly the CLI / Hatchet Lite for first-run), or (b) add a short "If you're just trying Hatchet locally, start here; if you're deploying for real, start there" decision block at the top of /self-hosting/index.
10. Non-English characters slipped into the FAQ prose (minor)
Location: /v1/faq, the "Why am I seeing missed heartbeats" section
Problem: The bulleted reason reads: "Network disruption - the connection between the worker and the Hatchet engine is interrupted (DNS失败, firewall change, cloud network blip)." 失败 is the Chinese word for "failure" — almost certainly a translation tool artifact that was never reverted.
Consequence: Low impact, but it signals that the page was machine-translated or AI-edited and not proofread. For an English-speaking reader it looks unprofessional; for a non-English-speaking reader it raises questions about whether other sections were similarly tampered with.
The fix: Change to DNS failure. Grep the docs corpus for stray CJK characters while you're at it.
What they do well
- Real
llms.txtwith the full sidebar enumerated — most dev-tool sites still don't have one. - Reliability semantics are explicit and honest — "at least once," "tasks can run more than once, so your code should be idempotent," durable-task determinism warnings. Few queues are this candid.
- Tested-snippet promise on the home page — the docs claim inline code is generated from tested examples, which is the right model even where the Ruby tabs haven't caught up.
Top 3 recommendations
- Resolve the
cloud.onhatchet.runvscloud.hatchet.runsplit — either document the domain difference on the webhooks page and cookbooks, or fix the typo. This is the single highest-stakes ambiguity in the corpus. - Reconcile the audit-log story with the compliance posture — document an export path for SOC 2 / HIPAA / GDPR customers, or scope the 30-day retention to a specific plan.
- Publish an OpenAPI spec at a stable URL and link it from
llms.txtand the reference index. This is the biggest unlock for agent-driven adoption, and it would also force the REST surface (currently referenced casually in bulk-retries) into a single source of truth.