Don't want to fix it yourself?

Check out Manicule.

Visit Manicule

Report/May 15

Benchspan

docs.benchspan.com

Manicule Score

0100

Pages read22

Critical3

Significant7

Minor4

Surfacedocs.benchspan.com

Verdict

“two pages, two cutoffs for the same security boolean — and 0.5 lands on opposite sides”

Share on X

Benchspan Documentation Audit

Tight surface area (18 URLs, one API endpoint, two SDKs), but the docs already contradict themselves on the security-boolean contract, the SDK silently fills in a required field, and one integration example calls a role that the upstream API doesn't accept.

1. Injection-threshold contradiction between API reference and concept page (critical)

Location: /api-reference/scan vs /concepts/how-it-works

Problem: Two pages give two different cutoffs for the same field. The scan reference says injection is "true if the score crosses our injection threshold (score ≥ 0.5)". The how-it-works page says score (0–1): confidence level; >0.5 triggers detection. At a score of exactly 0.5, one page returns injection: true and the other returns benign.

Consequence: A developer tuning a custom warn/block policy on top of score will pick a boundary based on whichever doc they read first, and their behavior at the threshold will silently disagree with Benchspan's verdict field. For a security product whose entire job is a thresholded boolean, this is a contract bug, not a wording nit. Agents reading both pages get conflicting truth and can't reconcile.

The fix: Pick one (≥ 0.5 matches the example response "score":0.9999,"injection":true and is the standard convention). Replace >0.5 on the how-it-works page, and add a one-line sentence to both pages that says "the threshold is closed at 0.5".

2. Long-input handling contradicts itself: truncated vs 413 rejected (critical)

Location: /api-reference/scan vs /api-reference/errors

Problem: The scan reference says input has a "max 32,000 characters; longer inputs truncated." The errors page says 413 Payload Too Large: Input exceeds 32,000 character limit per request. Split content into chunks or truncate client-side. These are mutually exclusive behaviors — the API either silently truncates a 40,000-char input or rejects it with 413. It cannot do both.

Consequence: A developer scanning a 50KB Gmail body or Drive doc has no way to know whether the latter half of their content was scanned, dropped, or never reached the classifier. For an injection firewall, "silently dropped" and "rejected with 413" have very different threat-model implications: with truncation, a poisoned payload past byte 32k slips through unscanned. With 413, it fails closed. Pick one.

The fix: State the actual behavior on both pages, character-identically. If the API truncates, remove the 413 entry and document the truncation point. If it 413s, remove "longer inputs truncated" from the scan page. Whichever you pick, also document the byte/char counting rule (UTF-8 bytes vs Unicode code points vs JS string length).

3. Python SDK silently defaults `role="user"` for an API-required field (critical)

Location: /sdks/python vs /api-reference/scan and /concepts/roles

Problem: The scan endpoint documents role as required: role ("user" | "tool", required): Origin of text; tool-origin content follows dedicated classifier path. The Python SDK signature is scan(input, role="user", source=None) — role has a silent default. A developer who scans tool output but forgets the kwarg (guard.scan(email_body)) will route the call through the user-origin classifier path. The roles page explicitly says "the classifier treats tool-origin content as the dominant attack vector, trained specifically on injection patterns from scraped pages, emails, and documents."

Consequence: Tool output gets classified by the weaker, less specialized path with no error, no warning, and no dashboard signal. This is a silent misclassification on a security product — the exact failure mode the SDKs claim to prevent with "fail closed" defaults. Worse, it's invisible: scans still succeed, scores still come back, the developer never learns the wrong classifier ran.

The fix: Either make role required in the Python signature (raise TypeError on missing kwarg, matching the API contract), or add a prominent warning on /sdks/python and /concepts/roles that the default is user and call out the misclassification risk for tool output. The TypeScript SDK takes role inside ScanOptions with no documented default and should be clarified the same way.

4. API-key example length doesn't match the documented key length (significant)

Location: /api-reference/authentication

Problem: The page states: "Total length: 40 characters, Random component: 32 hexadecimal characters." The worked example immediately below is ag_live_1a2b3c4d5e6f7890abcdef1234567890ab. The prefix ag_live_ is 8 characters; the random component (1a2b3c4d5e6f7890abcdef1234567890ab) is 34 hex characters, making the total 42. The docs and the example disagree by two characters.

Consequence: Any developer writing key-format validation (regex, length check, secret scanner) by reading this page will reject real production keys, or accept keys that the server rejects, depending on which figure they trust. CI secret-scanners and pre-commit hooks built from the doc string will misfire. For a Bearer-token product, the canonical key shape is part of the wire contract.

The fix: Decide whether keys are 40 chars (32 hex random) or 42 chars (34 hex random) and fix the side that's wrong. Add a one-line regex (^ag_live_[0-9a-f]{32}$ or the right variant) to the page so machines can validate without prose-parsing.

5. Anthropic integration tells users to send a role that doesn't exist in the Anthropic API (significant)

Location: /integrations/anthropic

Problem: The Anthropic integration page instructs: "When Claude uses tools, outputs flow back as tool_result blocks. Include these in subsequent message arrays with role: 'tool' so Benchspan scans them." Anthropic's Messages API does not have a tool role — tool_result content blocks are nested inside user-role messages. A literal reading of this guidance produces a 400 invalid_request_error from Anthropic before Benchspan ever sees the payload.

Consequence: A developer who copies this pattern hits an upstream 400 and assumes Benchspan's middleware is broken, or worse, refactors their working Anthropic integration to "match the docs" and breaks it. For an integration page on a product that markets itself as a drop-in scanner, the upstream API shape has to be correct.

The fix: Rewrite the tool-handling sentence to match Anthropic's actual content-block shape: tool outputs return as tool_result blocks inside user messages, and Benchspan classifies the block content as role: "tool" internally (separate from the Anthropic message role). Show one correct end-to-end Anthropic + Benchspan example so the distinction is unambiguous.

6. "How it works" nav link 404s (significant)

Location: Docs landing page nav → /how-it-works

Problem: The landing page surfaces "How it works" as a top-level navigation section, but /how-it-works returns 404. The real page lives at /concepts/how-it-works. Similarly, /api-reference returns 404 (real entry is /api-reference/overview), and /introduction, /pricing, /changelog are all 404.

Consequence: Anyone typing the obvious URL, following a stale external link, or pasting a guessed path lands on 404s. Crawlers and search engines following the rendered nav will see broken canonical paths. For agents, missing index pages mean /api-reference can't be used as a directory link in summaries — they have to know /api-reference/overview is the entry point.

The fix: Either alias /how-it-works → /concepts/how-it-works (and /api-reference → /api-reference/overview) with 301s, or restructure the URL tree so the nav and the URL match. Pick one, not both.

7. Self-hosted deployment named in the Python SDK, present-but-unexplained in TS, no page either way (significant)

Location: /sdks/python and /sdks/typescript

Problem: The Python SDK explicitly frames its api_url constructor parameter as "Override for self-hosted deployments." The TypeScript SDK exposes the same option (apiUrl) with no documented purpose at all — it's just a constructor field. The sitemap has zero pages on self-hosted deployment: no install guide, no Docker image, no Kubernetes manifests, no licensing or eligibility note, no listing in the integrations section.

Consequence: Python users are told a feature exists with no way to actually use it. TypeScript users see a knob with no explanation of what it's for. Both outcomes generate the same support tickets, and enterprise buyers reading the Python page will ask sales about a product surface that has no documented existence.

The fix: If self-hosting is real, add /deployment/self-hosted with image coordinates, license terms, telemetry/key-management posture, and how to point either SDK at it. If it's only for staging/proxy use, remove the "self-hosted" wording from the Python page and add a one-line scope statement to both SDK pages explaining what apiUrl/api_url is actually for (regional endpoint, proxy, test stub).

8. Latency numbers in docs are ~7× the numbers on the marketing site (significant)

Location: /concepts/how-it-works, /api-reference/overview, /concepts/modes vs benchspan.com

Problem: Docs repeatedly say "sub-100 ms scan latency for typical tool outputs" and "typical latency is under 100 ms for inputs up to ~2,000 tokens." The marketing homepage says "average latency of 14ms with P99 at 42ms." Both are framed as observed performance, not theoretical ceilings.

Consequence: A developer evaluating Benchspan for a latency-sensitive path (voice agents, real-time chat — explicitly called out on /concepts/modes) needs to know whether their per-turn budget is 14ms or 100ms. The 7× gap changes whether block mode is viable inline.

The fix: Decide which figure is the ceiling and which is the typical, and use both consistently. Suggested: docs surface P50 ~14ms, P99 ~42ms, ceiling <100ms for ≤2,000 tokens with one canonical sentence reused on the three pages that currently say "sub-100ms."

9. SDK parity gap: Python `scan` throws, TypeScript `scan` doesn't (significant)

Location: /sdks/python vs /sdks/typescript

Problem: The TypeScript SDK exposes two clearly distinct methods: scan(input, options?) "evaluates text without throwing" and scanOrThrow(input, options?) "throws InjectionDetectedError when verdict is block." The Python SDK lists only scan(input, role, source) plus the note "InjectionDetectedError: Raised on block." There is no scan_or_throw and no documented way in Python to get a non-throwing scan when mode="block" is set on the constructor.

Consequence: Developers writing cross-language services will reasonably assume guard.scan(...) behaves the same way in both SDKs and write Python try/except where they wrote a TS conditional, or vice versa. Worse: a developer who wants a non-throwing "just give me the score" call in Python has no documented method to use — they have to construct a separate BenchGuard instance in warn mode just to avoid the exception.

The fix: Either ship scan_or_throw/scan_no_throw in Python and document the symmetry, or rename the TS methods to align with whichever Python ships. Add a one-line "throwing vs non-throwing semantics across SDKs" matrix to both SDK pages.

10. Rate limits documented only as a monthly quota; no per-second/per-minute cap (significant)

Location: /api-reference/errors, /api-reference/authentication

Problem: Both pages mention 429 Too Many Requests and reference a monthly quota ("Free tier allows 50,000 scans monthly"), but neither documents a per-second, per-minute, or burst rate limit. The errors page acknowledges a Retry-After header on 429 but never says what triggers one within the monthly budget.

Consequence: An agent processing a 1k-email Gmail backlog at 200 req/s has no documented way to know whether to throttle, what concurrency is safe, or whether the 50,000-scans/month math applies linearly. The first production load test discovers the limit by hitting it. Inline middleware-style integration (where every tool call hits the API) makes this especially load-sensitive.

The fix: Publish concrete numbers: requests/second per key, requests/minute per organization, concurrent-connection cap, and any per-IP throttling. Add them as a table on the errors page and a header on the auth page.

11. Mode precedence between constructor and per-request parameter is undocumented (minor)

Location: /sdks/python, /sdks/typescript, /api-reference/scan

Problem: mode can be set on the SDK constructor (BenchGuard(..., mode="block")) and also as a per-request body field on POST /v1/scan (mode ("block" | "warn", optional)). No page documents which wins when both are set, or whether the SDK forwards the constructor value as a per-request override.

Consequence: A team running a mixed deployment — block in production, warn for a specific evaluation crew — has no contract for how to compose the two settings. The first time someone sets mode="warn" per-call on a block-configured guard, the resulting behavior is an experiment, not a spec.

The fix: State the precedence rule explicitly on /api-reference/scan and both SDK pages: e.g., "per-request mode overrides constructor mode," then mirror that with an example on the modes concept page.

12. Classifier accuracy numbers live on the marketing site, not in the docs (minor)

Location: Docs (absent) vs benchspan.com homepage

Problem: The homepage advertises specific benchmark results: 99.9% catch rate on AgentDojo, 94% catch rate on InjecAgent, 0.19% false-alarm rate on production-like traffic. None of these numbers appear in /docs.benchspan.com. The model_version field returns classifier-v3, which implies v1 and v2 existed and were measured differently — but there's no per-version benchmark table.

Consequence: A security engineer evaluating Benchspan for compliance or red-team purposes has to cross-reference the marketing site to find the only quantitative claims about the product's accuracy, then has no way to know which model_version those numbers apply to. When classifier-v4 ships, customers on v3 have no documented baseline.

The fix: Add /concepts/accuracy (or a section on how-it-works) with the AgentDojo / InjecAgent / FPR numbers, pinned to a specific model_version, plus a note on how the numbers update across classifier versions.

13. No changelog despite versioned classifier in API responses (minor)

Location: /changelog (404) and /api-reference/scan

Problem: Every scan response includes model_version (e.g., classifier-v3). The sitemap has no changelog page, and /changelog returns 404. There's no record of when v2→v3 happened, what changed, or whether old API keys still target an old model.

Consequence: A customer who saw a score change for the same input between two weeks has no public reference to explain it. Auditors and red teams can't pin findings to a model revision. Agents summarizing "what's new in Benchspan" have nothing to cite.

The fix: Add /changelog with dated entries per classifier-v{N} bump, SDK release, and API behavior change. Link model_version in the scan-response docs to the corresponding changelog anchor.

14. No OpenAPI/machine-readable spec for the REST API (minor)

Location: /api-reference/* (and absent from /sitemap.xml)

Problem: The REST API has one endpoint (POST /v1/scan) with five request fields, six response fields, and five status codes. None of it is exposed as OpenAPI/JSON Schema — there is no /openapi.json, no Swagger UI, no machine-readable schema file anywhere in the sitemap. All parameter types, enums (role, mode, verdict), and constraints (32k char cap, score range, classifier version format) live only in prose.

Consequence: Coding agents (Cursor, Claude Code, Copilot) can't programmatically validate a generated request body. Codegen tools (openapi-generator, oazapfts, etc.) can't produce a typed client. Postman/Insomnia users have to hand-rebuild the collection. Less severe than a contradiction, but a missed nicety for an agent-first product.

The fix: Publish an OpenAPI 3.1 spec at /openapi.json (or /api-reference/openapi.json) covering the scan endpoint, both role enums, both mode enums, all three verdict values, and the 400/401/413/429/5xx error envelope. Link it from /llms-full.txt so agents discover it.

What they do well

Both llms.txt and llms-full.txt are present and serve real, non-trivial content — agents can index without scraping the rendered nav.
Roles model (user, tool, plus the explicit trust boundary excluding system and assistant) is clearly stated in /concepts/roles and consistent across SDK and integration pages.
Failure-mode posture is named explicitly ("SDKs default to failing closed") rather than left as folklore, which is rare for security middleware.

Top 3 recommendations

Fix the three correctness contradictions before anything else: threshold ≥0.5 vs >0.5, long-input truncation vs 413, and the 40-vs-42-character API key example. These are wire-contract bugs, not polish.
Either make Python's role required (matching the API contract) or loudly document the "user" default — a silent misclassification path on a security product is the worst kind of footgun.
Rewrite the Anthropic integration's tool-handling guidance to match Anthropic's actual tool_result-inside-user shape; a copy-paste from this page currently produces an upstream 400.

Code Verification

Runtime snippet checks

Completed

Total

PASS

FIXED

SKIP

FAIL

Failing pages

https://api.benchspan.com/v1/scan'`
https://docs.benchspan.com/quickstart
https://docs.benchspan.com/api-reference/overview
https://docs.benchspan.com/api-reference/scan

Summary

Verified 81 snippets across 17 pages on docs.benchspan.com using the user-supplied Benchspan API key. Install commands (pip/npm) resolved successfully against PyPI/npm, JSON sample payloads parse cleanly, and the Python/TypeScript SDK constructors instantiate without contacting the API. Every snippet that actually invokes POST https://api.benchspan.com/v1/scan (Python SDK quickstart, TypeScript SDK quickstart, the three API-reference scan examples — curl, httpx, fetch — and the overview curl example) fails: the endpoint returns HTTP 500 Internal Server Error (plaintext body, not JSON) for every call made with the supplied key, while requests with invalid keys return 401 Invalid or revoked API key and malformed bodies return 422 Unprocessable Entity. Authentication is therefore working, but the documented happy path on /v1/scan (which the docs say should return HTTP 200 with a JSON verdict) is broken. The 6 FAILs are all symptoms of that single upstream service fault rather than independent doc bugs. The 53 SKIPs are partial fragments (e.g. references to undefined guard, llm, messages, gmail.get_email, mailClient, userInput, type/interface declarations) or framework integrations that need third-party credentials (OpenAI, Anthropic, Google ADK) that were not provided.

Required credentials

BENCHSPAN_API_KEY (provided by user; prefix ag_live_…) — used for /v1/scan, Python BenchGuard, and TypeScript BenchGuard calls.
OPENAI_API_KEY — not provided; required to exercise /integrations/openai, /integrations/openai-agents, and /integrations/vercel-ai example flows end-to-end.
ANTHROPIC_API_KEY — not provided; required to exercise /integrations/anthropic and the langchain-anthropic example on /integrations/langchain.
Google ADK / Gemini credentials — not provided; required to exercise /integrations/google-adk.
LLM provider credentials for CrewAI — not provided; required to exercise /integrations/crewai.

Pages

https://docs.benchspan.com/quickstart

#	Lang	Status	Notes
1	bash	PASS	`pip install benchspan` resolved and installed `benchspan-0.3.0` from PyPI.
2	bash	PASS	`npm install @benchspan/sdk` resolved and installed from npm.
3	python	FAIL	`guard.scan("Ignore previous instructions…", role="tool")` raised `httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'https://api.benchspan.com/v1/scan'`. Diagnosis: upstream Benchspan scan service returns 500 for valid requests with the supplied valid API key (a bad key returns 401, so auth is fine). Suggested follow-up (not applied): Benchspan operators should investigate why `/v1/scan` 500s; until then the documented Python quickstart cannot succeed.
4	typescript	FAIL	`await guard.scan(...)` threw `Error: BenchGuard API error: 500 Internal Server Error`. Same root cause as #3.

https://docs.benchspan.com/api-reference/overview

#	Lang	Status	Notes
1	bash	FAIL	`curl -X POST .../v1/scan` with valid `Authorization: Bearer ag_live_…` returned `HTTP/1.1 500 Internal Server Error` and a plaintext body (`Internal Server Error`), not the documented JSON verdict. Reproduced on 3 sequential retries. Suggested follow-up (not applied): fix the `/v1/scan` upstream so it returns 200 + JSON as the docs promise.
2	json	PASS	Sample response body parses as valid JSON (`json.loads` succeeded).

https://docs.benchspan.com/api-reference/authentication

#	Lang	Status	Notes
1	text	SKIP	`Authorization: Bearer ag_live_<secret>` is a header template / partial fragment, not standalone runnable code.
2	text	SKIP	Example key `ag_live_1a2b3c4d…` is a format illustration, not runnable code.

https://docs.benchspan.com/api-reference/scan

#	Lang	Status	Notes
1	bash	FAIL	`curl -X POST .../v1/scan` with `source`+`agent` fields returned `HTTP 500 Internal Server Error`. Same upstream issue as overview curl.
2	python	FAIL	`httpx.post(...)` returned 500; subsequent calls hit `httpx.ReadTimeout`. `r.raise_for_status()` would raise per the snippet's own contract. Same root cause.
3	typescript	FAIL	`fetch(...)` returned 500 with plaintext body; `await r.json()` then threw `SyntaxError: Unexpected token 'I', "Internal S"... is not valid JSON`. Same upstream issue. Note: the docs separately claim 4xx/5xx responses include a JSON `detail` field, but this 500 returns plaintext.
4	json	PASS	Sample response body parses as valid JSON.

https://docs.benchspan.com/api-reference/errors

#	Lang	Status	Notes
1	json	PASS	`{"detail":"Human-readable description of what went wrong"}` parses as valid JSON.

https://docs.benchspan.com/sdks/python

#	Lang	Status	Notes
1	bash	PASS	`pip install benchspan` (re-verified) resolves and installs.
2	python	PASS	`BenchGuard(api_key=…, agent=…, mode="block", api_url=…)` constructed successfully (no API call on construction).
3	python	SKIP	`result = guard.scan("some text", role="tool", source="gmail.get_email")` — partial fragment; relies on a `guard` from a prior snippet and on a meaningful tool string. Would also hit the upstream 500 if combined.
4	python	SKIP	`await guard.scan_async(...)` — partial fragment; requires an async runner and a defined `guard`.
5	python	SKIP	`@guard.wrap` over `client.chat.completions.create` — partial fragment; requires OpenAI SDK + `OPENAI_API_KEY` (not provided) and a `messages` variable.
6	python	SKIP	`@guard.wrap_async` variant — partial fragment, same constraints.
7	python	SKIP	`llm.invoke(messages, config={"callbacks": [guard]})` / `Crew(callbacks=[guard])` — partial fragment; `llm`, `messages`, `agents`, `tasks` undefined.
8	python	SKIP	`Agent(name="...", hooks=guard.as_agent_hooks())` — partial fragment; requires `openai-agents` and an OpenAI key.
9	python	SKIP	`LlmAgent(...before_model_callback=guard.as_adk_callback())` — partial fragment; requires `google-adk` and Gemini credentials.
10	python	SKIP	`@dataclass class ScanResult` — type definition shown for reference; not a runnable example.
11	python	SKIP	`class InjectionDetectedError(Exception)` — type definition shown for reference.
12	python	SKIP	`try: call_llm(messages) except InjectionDetectedError` — partial fragment; `call_llm`, `messages` undefined.
13	python	PASS	`logging.getLogger("benchspan").setLevel(logging.WARNING)` executed without error.

https://docs.benchspan.com/sdks/typescript

#	Lang	Status	Notes
1	bash	PASS	`npm install @benchspan/sdk` (re-verified) resolves.
2	typescript	PASS	`new BenchGuard({ apiKey, agent, mode, apiUrl })` constructed (no API call on construction).
3	typescript	SKIP	`await guard.scan("some text", { role: "user" })` — partial fragment; relies on a `guard` from prior snippet. Would also hit the upstream 500 if combined.
4	typescript	SKIP	`await guard.scanOrThrow(toolOutput, { role: "tool" })` — partial fragment; `toolOutput` and `InjectionDetectedError` import not in the snippet.
5	typescript	SKIP	`guard.wrapCall(messages, () => client.chat.completions.create(...))` — partial fragment; requires `openai` and `OPENAI_API_KEY`, plus a `messages` variable.
6	typescript	SKIP	`{ role: ..., content: ..., name?: ... }` — bare object literal, not runnable.
7	typescript	SKIP	`interface ScanResult` — TypeScript type declaration, not executable JS.
8	typescript	SKIP	`interface BenchGuardConfig` — TypeScript type declaration.
9	typescript	SKIP	`interface ScanOptions` — TypeScript type declaration.
10	typescript	SKIP	`class InjectionDetectedError extends Error` — TypeScript type declaration.

https://docs.benchspan.com/concepts/how-it-works

No executable code snippets on this page (Mermaid architecture diagram only). 0 snippets evaluated.

https://docs.benchspan.com/concepts/modes

#	Lang	Status	Notes
1	python	SKIP	block-mode snippet — partial fragment; `llm` and `messages` undefined.
2	typescript	SKIP	block-mode snippet — partial fragment; `toolOutput` undefined.
3	python	SKIP	warn-mode snippet — partial fragment; `llm`/`messages` undefined.
4	typescript	SKIP	warn-mode snippet — partial fragment; `llm`/`prompt` undefined.

https://docs.benchspan.com/concepts/roles

#	Lang	Status	Notes
1	python	SKIP	`guard.scan("Please cancel my subscription", role="user")` etc. — partial fragment; `guard` and `gmail.get_email` undefined.
2	python	SKIP	`guard.scan(..., source=...)` — partial fragment.
3	python	SKIP	`BenchGuard(api_key=..., agent=...)` — partial fragment (no observable behaviour without follow-up scan).
4	typescript	SKIP	`await guard.scan("Please cancel my subscription", { role: "user" })` etc. — partial fragment; `gmail.getEmail` undefined.
5	typescript	SKIP	TS source-tag scan — partial fragment.
6	typescript	SKIP	TS `new BenchGuard({ ..., agent: ... })` — partial fragment.

https://docs.benchspan.com/integrations/langchain

#	Lang	Status	Notes
1	bash	PASS	`pip install benchspan langchain-anthropic` — both packages available on PyPI and resolvable.
2	python	SKIP	Full LangChain ChatAnthropic example — partial fragment; needs `ANTHROPIC_API_KEY` (not provided) and an `email_body` value.
3	python	SKIP	`try: llm.invoke(...) except InjectionDetectedError` — partial fragment.
4	bash	PASS	`npm install @benchspan/sdk @langchain/anthropic` — both packages available.
5	typescript	SKIP	LangChain JS example — partial fragment; needs `ANTHROPIC_API_KEY` and a `messages` value.
6	python	SKIP	CrewAI `Crew(callbacks=[guard])` snippet — partial fragment; `agents`/`tasks` undefined.

https://docs.benchspan.com/integrations/crewai

#	Lang	Status	Notes
1	bash	PASS	`pip install benchspan crewai` — both packages available on PyPI.
2	python	SKIP	Full CrewAI example — partial fragment; `web_search_tool`, `document_reader` undefined, and `Crew.kickoff()` would require an LLM provider key (not provided).

https://docs.benchspan.com/integrations/openai-agents

#	Lang	Status	Notes
1	bash	PASS	`pip install benchspan openai-agents` — packages available.
2	python	SKIP	Full OpenAI Agents example — partial fragment; `mail_client` undefined and would need `OPENAI_API_KEY` (not provided).
3	bash	PASS	`npm install @benchspan/sdk @openai/agents` — packages available.
4	typescript	SKIP	Full OpenAI Agents JS example — partial fragment; `mailClient` undefined and would need `OPENAI_API_KEY`.
5	python	SKIP	"Benign flow / attack flow" snippet — comment-only / pseudocode fragment.

https://docs.benchspan.com/integrations/vercel-ai

#	Lang	Status	Notes
1	bash	PASS	`npm install @benchspan/sdk ai @ai-sdk/openai` — packages available.
2	typescript	SKIP	`wrapLanguageModel` + `generateText` example — partial fragment; needs `OPENAI_API_KEY` and a `userInput`; usable only inside an HTTP handler (uses `Response.json`).
3	typescript	SKIP	`streamText` snippet — partial fragment.
4	typescript	SKIP	`tools: { read_email: tool(...) }` — partial fragment; `mailClient`, `z` undefined.

https://docs.benchspan.com/integrations/google-adk

#	Lang	Status	Notes
1	bash	PASS	`pip install benchspan google-adk` — packages available.
2	python	SKIP	`LlmAgent(..., before_model_callback=guard.as_adk_callback())` — partial fragment; needs Google Gemini credentials (not provided).
3	bash	PASS	`npm install @benchspan/sdk @google/adk` — packages available.
4	typescript	SKIP	TS Google ADK example — partial fragment; needs Google credentials.
5	python	SKIP	`try: await agent.run_async(...) except InjectionDetectedError` — partial fragment.

https://docs.benchspan.com/integrations/openai

#	Lang	Status	Notes
1	bash	PASS	`pip install benchspan openai` — packages available.
2	bash	PASS	`npm install @benchspan/sdk openai` — packages available.
3	python	SKIP	`@guard.wrap def call_llm(messages)` — partial fragment; `messages` undefined and would need `OPENAI_API_KEY`.
4	typescript	SKIP	`guard.wrapCall(messages, () => client.chat.completions.create(...))` — partial fragment; same constraints.
5	typescript	SKIP	Sample `messages` array literal — data structure, not runnable.
6	python	SKIP	Manual `guard.scan(user_input, role="user")` + chat completion — partial fragment; `user_input`, `client` undefined.
7	typescript	SKIP	TS manual scan equivalent — partial fragment.

https://docs.benchspan.com/integrations/anthropic

#	Lang	Status	Notes
1	bash	PASS	`pip install benchspan anthropic` — packages available.
2	bash	PASS	`npm install @benchspan/sdk @anthropic-ai/sdk` — packages available.
3	python	SKIP	`@guard.wrap` over `client.messages.create` — partial fragment; `messages` undefined and would need `ANTHROPIC_API_KEY`.
4	typescript	SKIP	TS Anthropic example — partial fragment; same constraints.
5	python	SKIP	`for block in response.content: if block.type == "tool_use"…` — partial fragment; `response`, `run_tool` undefined.
6	python	SKIP	`guard.scan(tool_output, role="tool", source=block.name)` — partial fragment.

Target history

Prior reports

Loading history.

Benchspan Documentation Audit

1. Injection-threshold contradiction between API reference and concept page (critical)

Location: /api-reference/scan vs /concepts/how-it-works

2. Long-input handling contradicts itself: truncated vs 413 rejected (critical)

Location: /api-reference/scan vs /api-reference/errors

3. Python SDK silently defaults `role="user"` for an API-required field (critical)

Location: /sdks/python vs /api-reference/scan and /concepts/roles

4. API-key example length doesn't match the documented key length (significant)

Location: /api-reference/authentication

5. Anthropic integration tells users to send a role that doesn't exist in the Anthropic API (significant)

Location: /integrations/anthropic

6. "How it works" nav link 404s (significant)

Location: Docs landing page nav → /how-it-works

7. Self-hosted deployment named in the Python SDK, present-but-unexplained in TS, no page either way (significant)

Location: /sdks/python and /sdks/typescript

8. Latency numbers in docs are ~7× the numbers on the marketing site (significant)

Location: /concepts/how-it-works, /api-reference/overview, /concepts/modes vs benchspan.com

9. SDK parity gap: Python `scan` throws, TypeScript `scan` doesn't (significant)

Location: /sdks/python vs /sdks/typescript

10. Rate limits documented only as a monthly quota; no per-second/per-minute cap (significant)

Location: /api-reference/errors, /api-reference/authentication

11. Mode precedence between constructor and per-request parameter is undocumented (minor)

Location: /sdks/python, /sdks/typescript, /api-reference/scan

12. Classifier accuracy numbers live on the marketing site, not in the docs (minor)

Location: Docs (absent) vs benchspan.com homepage

13. No changelog despite versioned classifier in API responses (minor)

Location: /changelog (404) and /api-reference/scan

14. No OpenAPI/machine-readable spec for the REST API (minor)

Location: /api-reference/* (and absent from /sitemap.xml)

What they do well

Both llms.txt and llms-full.txt are present and serve real, non-trivial content — agents can index without scraping the rendered nav.
Roles model (user, tool, plus the explicit trust boundary excluding system and assistant) is clearly stated in /concepts/roles and consistent across SDK and integration pages.
Failure-mode posture is named explicitly ("SDKs default to failing closed") rather than left as folklore, which is rare for security middleware.

Top 3 recommendations

Fix the three correctness contradictions before anything else: threshold ≥0.5 vs >0.5, long-input truncation vs 413, and the 40-vs-42-character API key example. These are wire-contract bugs, not polish.
Either make Python's role required (matching the API contract) or loudly document the "user" default — a silent misclassification path on a security product is the worst kind of footgun.
Rewrite the Anthropic integration's tool-handling guidance to match Anthropic's actual tool_result-inside-user shape; a copy-paste from this page currently produces an upstream 400.

Check out Manicule.

Benchspan

Benchspan Documentation Audit

1. Injection-threshold contradiction between API reference and concept page (critical)

2. Long-input handling contradicts itself: truncated vs 413 rejected (critical)

3. Python SDK silently defaults role="user" for an API-required field (critical)

4. API-key example length doesn't match the documented key length (significant)

5. Anthropic integration tells users to send a role that doesn't exist in the Anthropic API (significant)

6. "How it works" nav link 404s (significant)

7. Self-hosted deployment named in the Python SDK, present-but-unexplained in TS, no page either way (significant)

8. Latency numbers in docs are ~7× the numbers on the marketing site (significant)

9. SDK parity gap: Python scan throws, TypeScript scan doesn't (significant)

10. Rate limits documented only as a monthly quota; no per-second/per-minute cap (significant)

11. Mode precedence between constructor and per-request parameter is undocumented (minor)

12. Classifier accuracy numbers live on the marketing site, not in the docs (minor)

13. No changelog despite versioned classifier in API responses (minor)

14. No OpenAPI/machine-readable spec for the REST API (minor)

What they do well

Top 3 recommendations

Runtime snippet checks

Summary

Required credentials

Pages

Prior reports

Sources

Check out Manicule.

Benchspan

Benchspan Documentation Audit

1. Injection-threshold contradiction between API reference and concept page (critical)

2. Long-input handling contradicts itself: truncated vs 413 rejected (critical)

3. Python SDK silently defaults role="user" for an API-required field (critical)

4. API-key example length doesn't match the documented key length (significant)

5. Anthropic integration tells users to send a role that doesn't exist in the Anthropic API (significant)

6. "How it works" nav link 404s (significant)

7. Self-hosted deployment named in the Python SDK, present-but-unexplained in TS, no page either way (significant)

8. Latency numbers in docs are ~7× the numbers on the marketing site (significant)

9. SDK parity gap: Python scan throws, TypeScript scan doesn't (significant)

10. Rate limits documented only as a monthly quota; no per-second/per-minute cap (significant)

11. Mode precedence between constructor and per-request parameter is undocumented (minor)

12. Classifier accuracy numbers live on the marketing site, not in the docs (minor)

13. No changelog despite versioned classifier in API responses (minor)

14. No OpenAPI/machine-readable spec for the REST API (minor)

What they do well

Top 3 recommendations

Runtime snippet checks

Summary

Required credentials

Pages

Prior reports

Sources

3. Python SDK silently defaults `role="user"` for an API-required field (critical)

9. SDK parity gap: Python `scan` throws, TypeScript `scan` doesn't (significant)

3. Python SDK silently defaults `role="user"` for an API-required field (critical)

9. SDK parity gap: Python `scan` throws, TypeScript `scan` doesn't (significant)