Galtea

Galtea Documentation Audit

Galtea ships polished concept docs, a real OpenAPI spec, an llms.txt, and a Langfuse integration that genuinely works — but the SDK surface area is documented inconsistently across three core pages, several promised links 404, navigation forks into two parallel sidebars depending on which top-tab you clicked, and a key product-model contradiction sits between two adjacent concept pages.

1. Quickstart and SDK Usage page demo two different evaluation APIs (critical)

Location: /quickstart vs /sdk/usage vs /sdk/integrations/github-actions

Problem: The Quickstart teaches a two-step pattern: galtea.inference_results.generate(session=..., agent=my_agent, input=...) followed by galtea.evaluations.create(session_id=..., metrics=[...]). The SDK Usage page and the GitHub Actions page instead teach a single combined call: galtea.inference_results.create_and_evaluate(session_id=..., output=..., metrics=[...]). No page explains when to use which, whether one is deprecated, or how generate(agent=...) relates to create_and_evaluate(output=...).

Consequence: A developer (or coding agent) following the Quickstart writes code one way; following the SDK Usage page they write it a different way; copying the GitHub Actions template they get a third variant. Agents in particular will pick whichever page they last grepped and silently produce code that diverges from the team's actual conventions. There is no canonical answer in the docs.

The fix: Pick one canonical end-to-end flow and use it everywhere. If both inference_results.generate(...) + evaluations.create(...) and inference_results.create_and_evaluate(...) are supported, add a section to /sdk/usage titled "Two execution patterns" that explicitly contrasts them, lists when to use each, and links from the Quickstart, GitHub Actions page, and Agent Skill page to it.

2. Product page documents Capabilities/Inabilities/Security Boundaries as required, Specification page calls them legacy (critical)

Location: /concepts/product vs /concepts/product/specification

Problem: /concepts/product lists Capabilities — Text required, Inabilities — Text required, Security Boundaries — Text required as required Product Properties. /concepts/product/specification states: "Specifications replace the legacy free-text fields (Capabilities, Inabilities, Policies) on Products with structured, individually testable expectations linked to specific metrics."

Consequence: A developer reading the Product page believes they must provide three text fields to create a product. A developer reading the Specification page believes those fields are legacy and replaced. Both can't be true. The OpenAPI spec shows POST /products/generate-config and a products route, but the SDK Usage page says "Product registration isn't supported via the SDK" — so the only way to find out which page is correct is to use the dashboard and observe the form.

The fix: Resolve which fields are still required on Product creation today, then either (a) remove the legacy-field properties from the Product page and link to Specifications, or (b) keep them but mark them clearly as "Legacy — retained for backward compatibility; use Specifications instead" and downgrade required to optional.

3. Three documented tutorial / API-reference landing URLs 404 (significant)

Location: /tutorials/tracing, /tutorials/specification-driven-evaluations, /api-reference/introduction

Problem: Three URLs return 404. The 404 pages' own "did you mean" suggestions (Specification-Driven Evaluations, Run Test-Based Evaluations, Tracing Tutorial, Endpoint Connection Service) indicate that real pages exist at different canonical slugs. The "REST API" top-nav tab also has no proper landing page — /api-reference/introduction 404s and /api-reference only renders the Health check endpoint, not an index.

Consequence: Anyone navigating from the top nav, from search results, or from the 404 page's own recommendations hits a dead URL. Agents that follow cross-references will hit a 404 and either bail or invent a path.

The fix: Add an /api-reference (or /api-reference/introduction) landing page with an actual overview. Either restore /tutorials/tracing and /tutorials/specification-driven-evaluations at those slugs or add redirects from them to the canonical tutorial paths.

4. `/sdk/galtea-client` linked from SDK sidebar returns 404 (significant)

Location: /sdk/galtea-client

Problem: The SDK sidebar lists "Galtea Client" as a top-level page, but the URL 404s. The 404 page suggests "Galtea Client / Get minimum SDK version / Product" as candidates, indicating the page exists somewhere with a different slug — or was removed without updating the sidebar.

Consequence: A new SDK user clicking the most prominent SDK navigation entry lands on a 404. This is the kind of break that signals "the docs aren't maintained" within the first minute of a developer's evaluation.

The fix: Either restore the page at /sdk/galtea-client or remove the sidebar entry. If the canonical page is elsewhere, add a redirect.

5. Auth API reference slug doesn't match the endpoint path (significant)

Location: /api-reference/auth/get-current-user vs /api-reference/auth/get-user

Problem: The API page documents GET /auth/user, but the docs slug for that page is /api-reference/auth/get-current-user. The literal endpoint path — /api-reference/auth/get-user — 404s, and the 404 page's suggestions ("Langfuse Integration", "AgentInput", "User Groups Service") aren't even close to what the user was looking for.

Consequence: Anyone typing the endpoint name into the URL bar, copying it from the API page header, or instructing an agent to "look up the auth/user endpoint docs" lands on a 404. If this divergence exists for auth/user, the same pattern probably exists for other endpoints — invisible until a developer hits one cold.

The fix: Make the docs slug match the endpoint path (/api-reference/auth/user) — or add a redirect from the endpoint-path URL to the canonical doc slug. Audit every API reference slug for this divergence.

6. CLI installation page hides three of four install methods (significant)

Location: /cli/installation

Problem: The page advertises four install methods — Homebrew, Debian/Ubuntu, Fedora/RHEL, and Python — as tabs. In the static HTML extracted by the scraper, only the Homebrew commands actually render; the headings for the other three tabs appear but their command bodies do not. Meanwhile the 2026-05-04 changelog entry explicitly says: "Install it via apt, dnf, or pip, then authenticate with galtea login..."

Consequence: Users without JavaScript (including most LLM doc crawlers, agents indexing the page, and anyone reading the .md mirror) only see brew install galtea-ai/tap/galtea. apt/dnf/pip users — i.e. the majority of Linux developers — have no install instructions despite the changelog promising them.

The fix: Render all four tab bodies in the static HTML (Mintlify's <Tabs> should do this by default; check the source). At minimum, list the apt/dnf/pip commands in plain text below the tabs so they appear in the .md mirror and llms.txt-derived content.

7. GitHub Actions template wires env vars in YAML that the Python script never reads (significant)

Location: /sdk/integrations/github-actions

Problem: The workflow YAML sets five env vars: GALTEA_API_KEY, GALTEA_PRODUCT_ID, GALTEA_TEST_NAME, GALTEA_ACCURACY_METRIC_NAME, GALTEA_COMPLETENESS_METRIC_NAME. The accompanying evaluate.py script does none of the following: it hard-codes api_key="YOUR_API_KEY", it imports an undefined _test_helpers.create_test_product, it never references os.environ, and it ignores GALTEA_TEST_NAME / the metric-name env vars entirely. It also notes "SDK doesn't expose products.create" while the OpenAPI spec lists POST /products.

Consequence: A developer who copy-pastes the template gets a failing CI job (missing _test_helpers module, placeholder API key) and has no way to inject the secret that the workflow took the trouble to plumb through. This is exactly the "copy-paste code completeness" failure mode that breaks agent-driven setup.

The fix: Replace api_key="YOUR_API_KEY" with api_key=os.environ["GALTEA_API_KEY"], replace create_test_product with the actual galtea.products.create(...) call (or document where _test_helpers lives), and either consume the GALTEA_TEST_NAME / metric-name env vars in the script or remove them from the YAML.

8. Metric `source` values are inconsistent across pages and conflict with the documented `Evaluation Type` enum (significant)

Location: /sdk/usage, /concepts/metric/evaluation-types, /concepts/metric

Problem: The SDK Usage page passes source="partial_prompt" for an AI-evaluated metric and source="self_hosted" for a custom metric. The Evaluation Types page passes source="human_evaluation". The Metric concept page documents the field as Evaluation Type — Enum required — How outputs are scored: AI Evaluation, Human Evaluation, or Self-Hosted — with no mention of partial_prompt and using human-readable values instead of the snake_case strings the SDK actually requires.

Consequence: A developer reading the Metric concept page will try evaluation_type="AI Evaluation" and get an error. A developer reading the SDK Usage page learns source="partial_prompt" but never learns what other values are accepted. The mapping between "AI Evaluation" (concept) and partial_prompt (SDK string) is undocumented. Agents will hallucinate values.

The fix: Add a single canonical table to /concepts/metric/evaluation-types mapping every concept name to its exact SDK string (AI Evaluation → partial_prompt, Self-Hosted → self_hosted, Human Evaluation → human_evaluation). Reference that table from the Metric properties page and the SDK Usage page.

9. `CustomScoreEvaluationMetric.measure` example is truncated — only shows the failure branch (significant)

Location: /sdk/usage

Problem: The MyKeywordMetric.measure example shows if actual_output is None: return 0.0 and then stops. The docstring promises "Returns 1.0 if 'expected' is in actual_output, else 0.0" but the body that implements that promise is not in the snippet.

Consequence: A developer copy-pasting this class gets a function that falls off the end on the happy path — returning None instead of a float. Depending on downstream validation, that will either be rejected as a non-numeric score or quietly recorded as missing data; either way the metric does not behave as the docstring promises. Self-hosted metrics are the primary mechanism for deterministic custom scoring, so this is the example a developer is most likely to lift verbatim. Agents will copy the broken stub and produce silently-failing evaluations.

The fix: Complete the snippet: return 1.0 if "expected" in actual_output.lower() else 0.0 (or whatever the canonical match logic is), and add a pytest-style sanity check below.

10. Evaluation Properties section is empty under its own heading (significant)

Location: /concepts/product/version/session/evaluation

Problem: The page renders the heading "Evaluation Properties" with no body — it jumps straight from the lifecycle diagram to "Result Properties". The evaluations resource is documented in the OpenAPI spec with CRUD plus a dozen specialized routes (Retry failed, Replay onto a new metric revision, Claim/Release/Submit human eval, etc.), so there is no shortage of properties to document.

Consequence: A developer using the concept page as the source-of-truth for the Evaluation entity sees only the post-run Result fields and has no view of the inputs (which metric, which session, which inference result, retry policy, who created it). They have to bounce to the API reference and reconstruct the shape themselves.

The fix: Populate the section with the Evaluation entity's creation-time fields (metric_id, session_id, inference_result_id, type, created_by, etc.) parallel to how Sessions and Inference Results are documented one level up.

11. Top-tab navigation forks into two parallel sidebars with no warning (significant)

Location: /sdk/integrations/agent-skill (Guides tab) vs /sdk/installation (default tab)

Problem: The Agent Skill page renders under a "Guides" top-tab whose left sidebar is Getting Started / Core Workflows / Production & Monitoring / Advanced / Integrations. The SDK installation page, which lives at a sibling /sdk/... URL, renders a completely different sidebar: Introduction / Overview / SDK / CLI / Concepts. The two share no entries.

Consequence: A developer who lands on either page sees only half the docs in their sidebar. Worse, the two sidebars don't cross-reference — readers don't know there's a second navigation universe until they happen to click a top-tab. Both sets of pages cite each other, so the most natural in-doc click teleports the reader to a different layout with different siblings.

The fix: Either unify the sidebar across top-tabs, or add a visible cue (banner, breadcrumb, "switch to Guides view" link) on every page that explains which view it lives in and how to get to the other. At minimum, surface the same set of "key pages" in both sidebars.

12. "How Everything Connects" diagram is missing — only labels render (minor)

Location: /concepts/overview

Problem: The introduction promises "For a diagram of how they all connect, start with the Concepts overview." The Concepts overview's "How Everything Connects" section then renders a bare list of relationship labels ("Defined by / Challenged by / Groups / Simulated as / Are a set of / Hold / ...") followed by a bare list of entity names. The dotted-line caption assumes a diagram is present.

Consequence: Anyone reading the docs through the .md mirror, in an agent context window, or with images disabled gets a meaningless word salad instead of the architectural map the docs explicitly send them to.

The fix: Inline a Mermaid (or static SVG with descriptive alt text) diagram so the relationships parse for non-visual consumers. The label list alone is worse than no diagram — it implies a broken render.

13. "Text Match (deprecated)" sits inline with live metrics in the available-metrics list (minor)

Location: /concepts/metric

Problem: The Available Metrics list mixes a deprecated entry — Text Match (deprecated) — Deterministic — Binary match using character-level fuzzy matching with a threshold. Use Text Similarity instead. — into the same flat list as every active metric. There is no separate "Deprecated" section, no strikethrough, no leading or trailing block grouping it apart.

Consequence: A developer scanning the list for a string-comparison metric is just as likely to pick Text Match as Text Similarity — the only signal is a parenthetical that's easy to miss. Agents extracting the metric catalog will surface Text Match as a valid option.

The fix: Move deprecated metrics to a separate "Deprecated metrics" section at the bottom of the page (or strike them through and visibly group them), and link the deprecation note to whatever replaces them.

14. Behavior test `test_variant` field has no documented enum (minor)

Location: /concepts/product/specification, /concepts/product/test

Problem: The Specification properties list test_variant — string — Variant of the test type. Applicable for ACCURACY and SECURITY test types. The Test concept page mentions "the SDK parameter variants is used to specify 'Threats' for Security tests" and a strategies parameter — but neither page enumerates the allowed values for either field. There is no table of accepted variants for ACCURACY, no list of Threats for SECURITY, and no examples.

Consequence: A developer (or agent) creating a Specification or generating a Security test by SDK has to guess the string. The platform's behavior for invalid values isn't documented either. Agents will hallucinate plausible-looking strings like "prompt_injection" and silently produce nothing.

The fix: Publish the full enum of test_variant values for ACCURACY and SECURITY test types, and the full list of strategies for Security tests, in the Specification properties table and again on the Test page.

15. `conversation_turns` parameter is Human-Evaluation-only with no parallel for AI metrics (minor)

Location: /concepts/metric/evaluation-parameters

Problem: The Parameter Reference table lists conversation_turns — Behavior tests (Human Evaluation only). This is the only parameter in the table with a "Human Evaluation only" qualifier, and the page doesn't explain why or what AI-evaluated Behavior metrics receive instead.

Consequence: A developer building an AI-judged Behavior metric who wants the model to see full turn-by-turn conversation history will reach for conversation_turns, find it accepted nowhere they can verify, and have no documented alternative. The narrow scope of this parameter is buried in a single table cell.

The fix: Either lift conversation_turns into a callout explaining its constraint and pointing to the AI-evaluation equivalent (traces? full session passed implicitly?), or add a "Behavior test parameters for AI-evaluated metrics" subsection that contrasts the two paths.

16. `is_production` is documented as a parameter but described as not stored — easy to misread (minor)

Location: /concepts/product/version/session

Problem: The Session page lists Is Production — Boolean — A creation-time parameter... in the same property table as Custom ID, Status, Error, etc., then adds: "This is not a stored field on the session. Whether a session is a production session can be inferred at read time by checking whether test_case_id is null."

Consequence: A developer querying GET /sessions/{id} will try to read is_production from the response, find it absent, and either file a bug or assume the API is broken. The "not stored" caveat is buried in the field description rather than reflected in the field's typography or position.

The fix: Either lift is_production out of the property table into a separate "Creation-time parameters (not persisted)" section, or rename it (isProductionRequest) to signal it's input-only. At minimum, repeat the "not stored — infer from test_case_id == null" caveat in any place the response shape is documented.

17. OpenAPI service title doesn't match the public brand (minor)

Location: https://api.galtea.ai/openapi.json

Problem: The OpenAPI document's info.title is "Product Management Service API" with description "API documentation for Product Management Service". Every public-facing surface — docs, marketing site, SDK — uses "Galtea".

Consequence: Tools that auto-generate API clients from the spec (Postman, openapi-generator, agent skill builders) will label the resulting SDK "Product Management Service" rather than "Galtea". Tools that surface the spec title in dashboards or directory listings will not connect it to the Galtea brand.

The fix: Update info.title to "Galtea API" (or similar) and update the description. This is a one-line change in the spec source.

What they do well

Real OpenAPI spec, real llms.txt, real .md mirrors of every doc page — agent ingestion is genuinely well-supported at the platform level, which makes the per-page inconsistencies above all the more solvable.
Concept model is unusually well-articulated — the Specification → Test → Session → Inference Result → Evaluation lineage is documented at a level most evaluation platforms don't reach.
Langfuse integration page is the gold standard for the rest of the docs to copy — single-import swap, observation-type mapping table, version constraint stated up front.

Top 3 recommendations

Pick one canonical SDK evaluation flow and rewrite the Quickstart, SDK Usage page, and GitHub Actions template to use it. Today they teach three different APIs.
Resolve the Product-vs-Specification contradiction. Decide whether Capabilities/Inabilities/Security Boundaries are still required fields, and update both pages to agree.
Audit every cross-reference and slug for 404s. /sdk/galtea-client, /tutorials/tracing, /tutorials/specification-driven-evaluations, /api-reference/introduction, and the endpoint-path/doc-slug divergence on /auth/user all bite developers and agents identically — a docs site that links to its own 404s loses trust fast.

Galtea Documentation Audit

1. Quickstart and SDK Usage page demo two different evaluation APIs (critical)

Location: /quickstart vs /sdk/usage vs /sdk/integrations/github-actions

2. Product page documents Capabilities/Inabilities/Security Boundaries as required, Specification page calls them legacy (critical)

Location: /concepts/product vs /concepts/product/specification

3. Three documented tutorial / API-reference landing URLs 404 (significant)

Location: /tutorials/tracing, /tutorials/specification-driven-evaluations, /api-reference/introduction

4. `/sdk/galtea-client` linked from SDK sidebar returns 404 (significant)

Location: /sdk/galtea-client

The fix: Either restore the page at /sdk/galtea-client or remove the sidebar entry. If the canonical page is elsewhere, add a redirect.

5. Auth API reference slug doesn't match the endpoint path (significant)

Location: /api-reference/auth/get-current-user vs /api-reference/auth/get-user

6. CLI installation page hides three of four install methods (significant)

Location: /cli/installation

7. GitHub Actions template wires env vars in YAML that the Python script never reads (significant)

Location: /sdk/integrations/github-actions

8. Metric `source` values are inconsistent across pages and conflict with the documented `Evaluation Type` enum (significant)

Location: /sdk/usage, /concepts/metric/evaluation-types, /concepts/metric

9. `CustomScoreEvaluationMetric.measure` example is truncated — only shows the failure branch (significant)

Location: /sdk/usage

The fix: Complete the snippet: return 1.0 if "expected" in actual_output.lower() else 0.0 (or whatever the canonical match logic is), and add a pytest-style sanity check below.

10. Evaluation Properties section is empty under its own heading (significant)

Location: /concepts/product/version/session/evaluation

11. Top-tab navigation forks into two parallel sidebars with no warning (significant)

Location: /sdk/integrations/agent-skill (Guides tab) vs /sdk/installation (default tab)

12. "How Everything Connects" diagram is missing — only labels render (minor)

Location: /concepts/overview

13. "Text Match (deprecated)" sits inline with live metrics in the available-metrics list (minor)

Location: /concepts/metric

14. Behavior test `test_variant` field has no documented enum (minor)

Location: /concepts/product/specification, /concepts/product/test

15. `conversation_turns` parameter is Human-Evaluation-only with no parallel for AI metrics (minor)

Location: /concepts/metric/evaluation-parameters

16. `is_production` is documented as a parameter but described as not stored — easy to misread (minor)

Location: /concepts/product/version/session

17. OpenAPI service title doesn't match the public brand (minor)

Location: https://api.galtea.ai/openapi.json

The fix: Update info.title to "Galtea API" (or similar) and update the description. This is a one-line change in the spec source.

What they do well

Real OpenAPI spec, real llms.txt, real .md mirrors of every doc page — agent ingestion is genuinely well-supported at the platform level, which makes the per-page inconsistencies above all the more solvable.
Concept model is unusually well-articulated — the Specification → Test → Session → Inference Result → Evaluation lineage is documented at a level most evaluation platforms don't reach.
Langfuse integration page is the gold standard for the rest of the docs to copy — single-import swap, observation-type mapping table, version constraint stated up front.

Top 3 recommendations

Pick one canonical SDK evaluation flow and rewrite the Quickstart, SDK Usage page, and GitHub Actions template to use it. Today they teach three different APIs.
Resolve the Product-vs-Specification contradiction. Decide whether Capabilities/Inabilities/Security Boundaries are still required fields, and update both pages to agree.
Audit every cross-reference and slug for 404s. /sdk/galtea-client, /tutorials/tracing, /tutorials/specification-driven-evaluations, /api-reference/introduction, and the endpoint-path/doc-slug divergence on /auth/user all bite developers and agents identically — a docs site that links to its own 404s loses trust fast.

Check out Manicule.

Galtea Documentation Audit

1. Quickstart and SDK Usage page demo two different evaluation APIs (critical)

2. Product page documents Capabilities/Inabilities/Security Boundaries as required, Specification page calls them legacy (critical)

3. Three documented tutorial / API-reference landing URLs 404 (significant)

4. /sdk/galtea-client linked from SDK sidebar returns 404 (significant)

5. Auth API reference slug doesn't match the endpoint path (significant)

6. CLI installation page hides three of four install methods (significant)

7. GitHub Actions template wires env vars in YAML that the Python script never reads (significant)

8. Metric source values are inconsistent across pages and conflict with the documented Evaluation Type enum (significant)

9. CustomScoreEvaluationMetric.measure example is truncated — only shows the failure branch (significant)

10. Evaluation Properties section is empty under its own heading (significant)

11. Top-tab navigation forks into two parallel sidebars with no warning (significant)

12. "How Everything Connects" diagram is missing — only labels render (minor)

13. "Text Match (deprecated)" sits inline with live metrics in the available-metrics list (minor)

14. Behavior test test_variant field has no documented enum (minor)

15. conversation_turns parameter is Human-Evaluation-only with no parallel for AI metrics (minor)

16. is_production is documented as a parameter but described as not stored — easy to misread (minor)

17. OpenAPI service title doesn't match the public brand (minor)

What they do well

Top 3 recommendations

Prior reports

Sources

Check out Manicule.

Galtea

Galtea Documentation Audit

1. Quickstart and SDK Usage page demo two different evaluation APIs (critical)

2. Product page documents Capabilities/Inabilities/Security Boundaries as required, Specification page calls them legacy (critical)

3. Three documented tutorial / API-reference landing URLs 404 (significant)

4. /sdk/galtea-client linked from SDK sidebar returns 404 (significant)

5. Auth API reference slug doesn't match the endpoint path (significant)

6. CLI installation page hides three of four install methods (significant)

7. GitHub Actions template wires env vars in YAML that the Python script never reads (significant)

8. Metric source values are inconsistent across pages and conflict with the documented Evaluation Type enum (significant)

9. CustomScoreEvaluationMetric.measure example is truncated — only shows the failure branch (significant)

10. Evaluation Properties section is empty under its own heading (significant)

11. Top-tab navigation forks into two parallel sidebars with no warning (significant)

12. "How Everything Connects" diagram is missing — only labels render (minor)

13. "Text Match (deprecated)" sits inline with live metrics in the available-metrics list (minor)

14. Behavior test test_variant field has no documented enum (minor)

15. conversation_turns parameter is Human-Evaluation-only with no parallel for AI metrics (minor)

16. is_production is documented as a parameter but described as not stored — easy to misread (minor)

17. OpenAPI service title doesn't match the public brand (minor)

What they do well

Top 3 recommendations

Prior reports

Sources

4. `/sdk/galtea-client` linked from SDK sidebar returns 404 (significant)

8. Metric `source` values are inconsistent across pages and conflict with the documented `Evaluation Type` enum (significant)

9. `CustomScoreEvaluationMetric.measure` example is truncated — only shows the failure branch (significant)

14. Behavior test `test_variant` field has no documented enum (minor)

15. `conversation_turns` parameter is Human-Evaluation-only with no parallel for AI metrics (minor)

16. `is_production` is documented as a parameter but described as not stored — easy to misread (minor)

4. `/sdk/galtea-client` linked from SDK sidebar returns 404 (significant)

8. Metric `source` values are inconsistent across pages and conflict with the documented `Evaluation Type` enum (significant)

9. `CustomScoreEvaluationMetric.measure` example is truncated — only shows the failure branch (significant)

14. Behavior test `test_variant` field has no documented enum (minor)

15. `conversation_turns` parameter is Human-Evaluation-only with no parallel for AI metrics (minor)

16. `is_production` is documented as a parameter but described as not stored — easy to misread (minor)