How to Add a New Metadata Provider¶
This guide explains how to wire a new REST/JSON metadata source into the Tuvima Library enrichment pipeline. For standard providers that return JSON from a public HTTP endpoint, no C# code is required — you drop a config file and restart.
Prerequisites¶
- Familiarity with the hydration pipeline (
docs/architecture/hydration-and-providers.md) - The target API's base URL, search endpoint, and response schema
- An API key if the provider requires one (stored in the config file; never committed to git)
Step 1 — Understand where providers fit¶
All providers run during Stage 1 (RetailIdentification) of the hydration pipeline. They gather cover art, descriptions, ratings, and bridge IDs (ISBN, ASIN, TMDB ID, etc.) that Stage 2 (WikidataBridge) later uses for precise QID resolution.
The ConfigDrivenAdapter is the single implementation that reads every config_driven
provider config at startup and builds a live HTTP client for each one. Your config file
teaches it how to talk to the new API.
Provider configs live in:
config/providers/
apple_api.json
google_books.json
open_library.json
tmdb.json
musicbrainz.json
metron.json
wikidata_reconciliation.json ↠Stage 2, not Stage 1
Step 2 — Create the config file¶
Create config/providers/my_provider.json. The complete schema is documented below
with every field explained. Use apple_api.json (books/audio) or tmdb.json (video)
as your template depending on the media domain.
Top-level identity fields¶
{
"name": "my_provider", // Internal key, matches filename stem
"version": "1.0",
"display_name": "My Provider", // Shown in the Dashboard Settings UI
"enabled": true,
"weight": 0.75, // Provider-level confidence weight [0.0–1.0]
"domain": "Ebook", // "Ebook" | "Video" | "Music" | "Comics" | "Universal"
"icon": "images/providers/my_provider.svg",
"capability_tags": ["title", "author", "cover", "description"],
"available_fields": ["title", "author", "year", "cover", "description", "isbn"],
"field_weights": {
"title": 0.75,
"author": 0.80,
"cover": 0.85,
"description": 0.80,
"isbn": 0.90,
"year": 0.75
}
}
weight scales every confidence value this provider emits before they enter the
Priority Cascade. A provider returning unreliable data should use 0.5–0.6; a
high-quality source like TMDB uses 0.8.
field_weights are per-field multipliers on top of the provider weight. Use higher
values for fields the API returns reliably (bridge IDs, structured dates) and lower
for fields that vary in quality (descriptions, genres).
HTTP and authentication¶
{
"endpoints": {
"api": "https://api.myprovider.com/v2"
},
"throttle_ms": 300, // Minimum milliseconds between requests
"max_concurrency": 2, // Parallel request cap
"cache_ttl_hours": 72, // null = no caching, integer = cache TTL
"hydration_stages": [1], // Stage 1 = RetailIdentification; never use 2 here
"adapter_type": "config_driven", // Always this value for JSON providers
"http_client": {
"timeout_seconds": 10,
"user_agent": "Tuvima Library/1.0",
"api_key": null, // Set to your actual key; keep this file gitignored
"api_key_delivery": "bearer", // "bearer" | "query_param" | "header" | null
"api_key_param_name": null // Query param name when delivery = "query_param"
},
"requires_api_key": false // Shown in Setup Wizard
}
If api_key_delivery is "bearer", the adapter sends Authorization: Bearer <key>.
If "query_param", it appends ?<api_key_param_name>=<key> to every URL.
If "header", set the header name in api_key_param_name.
Provider GUID — pick one and never change it¶
Every provider needs a stable UUID. It is a foreign key in the metadata_claims table
(provider_id column). Changing it orphans historical claim rows.
Generate a UUID with [System.Guid]::NewGuid() (PowerShell) or any UUID v4 generator.
The DatabaseConnection seeds a row into provider_registry on startup using this ID.
If you ever change the UUID, update provider_registry directly or the FK will break.
Media type scope¶
{
"can_handle": {
"media_types": ["Books", "Audiobooks"],
"entity_types": ["Work", "MediaAsset"]
}
}
Valid media_types: Books, Audiobooks, Movies, TV, Music, Comics, Podcasts.
Search strategies¶
Search strategies tell the adapter which URL to call and in which order. The adapter
tries strategies in ascending priority order, stopping at the first one that returns
results. Define more specific lookups (by bridge ID) before broader title searches.
{
"search_strategies": [
{
"name": "isbn_lookup",
"priority": 0,
"required_fields": ["isbn"],
"url_template": "{base_url}/books?isbn={isbn}",
"results_path": "items",
"result_index": 0,
"tolerate_404": true,
"fetch_limit": 1,
"max_results": 1,
"media_types": ["Books"]
},
{
"name": "title_search",
"priority": 1,
"required_fields": ["title"],
"url_template": "{base_url}/search?q={query}&limit={limit}&lang={lang}",
"query_template": "{title} {author}",
"results_path": "results",
"result_index": 0,
"tolerate_404": false,
"fetch_limit": 5,
"max_results": 25,
"media_types": ["Books", "Audiobooks"]
}
]
}
URL template tokens (all resolved at runtime by ConfigDrivenAdapter):
| Token | Resolves to |
|---|---|
{base_url} |
The endpoints.api value |
{title} |
Lookup request title (URL-encoded) |
{author} |
Author from request metadata |
{query} |
query_template rendered with available fields |
{isbn} |
ISBN from request metadata |
{limit} |
max_results value |
{lang} |
Language code from LanguagePreferences.Metadata (e.g. en) |
{country} |
Country code derived from language setting |
{year} |
Year from request metadata |
results_path is a dot-separated JSON path to the array of results in the response
(e.g. "results" for { "results": [...] }, or "response.docs" for nested paths).
tolerate_404 controls whether a 404 status silently returns empty results (true)
or is treated as an error (false).
required_fields lists the claim keys that must be present in the lookup request for
this strategy to be attempted. The adapter skips the strategy when any required field
is missing.
Field mappings¶
Field mappings declare which JSON property in the response maps to which internal claim key, at what confidence, and with which optional transform.
{
"field_mappings": [
{
"claim_key": "title",
"json_path": "volumeInfo.title",
"confidence": 0.75,
"transform": null,
"transform_args": null,
"media_types": null
},
{
"claim_key": "author",
"json_path": "volumeInfo.authors",
"confidence": 0.75,
"transform": "array_first",
"transform_args": null,
"media_types": null
},
{
"claim_key": "cover",
"json_path": "volumeInfo.imageLinks.thumbnail",
"confidence": 0.70,
"transform": "replace",
"transform_args": "zoom=1|zoom=5",
"media_types": null
},
{
"claim_key": "year",
"json_path": "volumeInfo.publishedDate",
"confidence": 0.75,
"transform": "first_n_chars",
"transform_args": "4",
"media_types": null
},
{
"claim_key": "isbn",
"json_path": "volumeInfo.industryIdentifiers[?(@.type=='ISBN_13')].identifier",
"confidence": 0.95,
"transform": null,
"transform_args": null,
"media_types": ["Books"]
}
]
}
json_path supports dot notation and array indexing. The adapter resolves it against
the result object from results_path[result_index].
Available transforms (implemented in ConfigDrivenAdapter):
| Transform | transform_args |
Effect |
|---|---|---|
strip_html |
— | Strips HTML tags from the value |
first_n_chars |
"4" |
Truncates to N characters (useful for ISO date → year) |
array_first |
— | Takes the first element of a JSON array |
array_join |
", " |
Joins array elements with separator |
to_string |
— | Converts numeric values to string |
regex_replace |
"pattern\|replacement" |
Regex find-and-replace |
replace |
"old\|new" |
Literal string substitution |
media_types on a mapping restricts it to specific media types. null applies to all.
The effective confidence for a claim is: provider.weight × field_weights[claim_key] × mapping.confidence.
Bridge IDs and preferred lookup order¶
If the provider returns a bridge ID that Wikidata can use for precise QID resolution,
declare it in preferred_bridge_ids:
The Wikidata Stage 2 adapter reads these to know which claim keys to look for when building its edition-first QID lookup.
Language strategy¶
Three options:
| Value | Behaviour | Use when |
|---|---|---|
source |
Always query in English | Provider has poor or no localization |
localized |
Query in the user's metadata language | Provider supports localized results |
both |
Query in metadata language, fall back to English on empty | Provider supports localization but has incomplete coverage |
source is the safe default for most providers. Use localized only when the API
genuinely returns localized titles and descriptions (e.g. TMDB, Apple API).
Step 3 — Complete example: a hypothetical "BookHive" provider¶
config/providers/bookhive.json:
{
"name": "bookhive",
"version": "1.0",
"display_name": "BookHive",
"enabled": true,
"weight": 0.72,
"domain": "Ebook",
"icon": "images/providers/bookhive.svg",
"capability_tags": ["title", "author", "cover", "description", "isbn", "year"],
"available_fields": ["title", "author", "year", "cover", "description", "isbn", "genre"],
"field_weights": {
"title": 0.75,
"author": 0.78,
"cover": 0.82,
"description": 0.78,
"isbn": 0.92,
"year": 0.76,
"genre": 0.70
},
"endpoints": {
"api": "https://api.bookhive.example.com/v1"
},
"throttle_ms": 400,
"max_concurrency": 1,
"cache_ttl_hours": 168,
"hydration_stages": [1],
"adapter_type": "config_driven",
"provider_id": "ba000099-0000-4000-8000-000000000099",
"requires_api_key": true,
"http_client": {
"timeout_seconds": 10,
"user_agent": "Tuvima Library/1.0",
"api_key": "YOUR_KEY_HERE",
"api_key_delivery": "query_param",
"api_key_param_name": "apikey"
},
"can_handle": {
"media_types": ["Books"],
"entity_types": ["Work", "MediaAsset"]
},
"search_strategies": [
{
"name": "isbn_lookup",
"priority": 0,
"required_fields": ["isbn"],
"url_template": "{base_url}/isbn/{isbn}",
"results_path": "data",
"result_index": 0,
"tolerate_404": true,
"fetch_limit": 1,
"max_results": 1,
"media_types": ["Books"]
},
{
"name": "title_search",
"priority": 1,
"required_fields": ["title"],
"url_template": "{base_url}/search?q={query}&limit={limit}",
"query_template": "{title} {author}",
"results_path": "data.books",
"result_index": 0,
"tolerate_404": false,
"fetch_limit": 5,
"max_results": 20,
"media_types": ["Books"]
}
],
"field_mappings": [
{
"claim_key": "title",
"json_path": "title",
"confidence": 0.75,
"transform": null,
"transform_args": null,
"media_types": null
},
{
"claim_key": "author",
"json_path": "authors",
"confidence": 0.75,
"transform": "array_first",
"transform_args": null,
"media_types": null
},
{
"claim_key": "cover",
"json_path": "cover_url",
"confidence": 0.80,
"transform": null,
"transform_args": null,
"media_types": null
},
{
"claim_key": "description",
"json_path": "synopsis",
"confidence": 0.78,
"transform": "strip_html",
"transform_args": null,
"media_types": null
},
{
"claim_key": "year",
"json_path": "publish_date",
"confidence": 0.76,
"transform": "first_n_chars",
"transform_args": "4",
"media_types": null
},
{
"claim_key": "isbn",
"json_path": "isbn13",
"confidence": 0.92,
"transform": null,
"transform_args": null,
"media_types": ["Books"]
},
{
"claim_key": "genre",
"json_path": "genres",
"confidence": 0.68,
"transform": "array_join",
"transform_args": ", ",
"media_types": null
}
],
"preferred_bridge_ids": {
"Books": ["isbn"]
},
"language_strategy": "source"
}
Step 4 — How the runtime loads your config¶
At startup, ConfigDrivenAdapter scans config/providers/ for JSON files where
adapter_type == "config_driven". For each file it:
- Deserialises
ProviderConfigurationfrom the JSON. - Registers an HTTP client named after
provider_id. - Seeds a row in
provider_registry(id, name, version, is_enabled)viaDatabaseConnection— theINSERT OR IGNOREmeans existing rows are never overwritten. - Makes the adapter available as
IExternalMetadataProviderthrough DI.
Nothing else needs to change. Restart the Engine and the provider is active.
Step 5 — Verify the provider works¶
Use the debug lookup endpoint (development environment only) to run a live enrichment pass without writing anything to the database:
POST http://localhost:61495/debug/lookup
Content-Type: application/json
{
"title": "The Name of the Wind",
"author": "Patrick Rothfuss",
"mediaType": "Books",
"isbn": "9780756404741"
}
The response includes every claim the provider returned along with its source, confidence value, and which search strategy fired. If no claims appear, check:
enabled: truein your config file.- The
can_handle.media_typeslist includes the media type you're querying. - The
required_fieldsfor at least one strategy are satisfied by your request. - The API key is set correctly if
requires_api_key: true. - Check
engine.log(Serilog rolling log) for HTTP-level errors from the adapter.
You can also browse all registered providers via the Settings → Providers screen in
the Dashboard, or inspect the provider_registry table directly:
Reference: complete field listing¶
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
name |
string | yes | — | Matches filename stem |
version |
string | yes | — | Informational only |
display_name |
string | yes | — | Shown in Settings UI |
enabled |
bool | yes | — | false skips this provider entirely |
weight |
float | yes | — | Provider-level multiplier [0.0–1.0] |
domain |
string | yes | — | Ebook / Video / Music / Comics / Universal |
adapter_type |
string | yes | — | Must be config_driven |
provider_id |
UUID string | yes | — | Stable FK — never change after first use |
hydration_stages |
int[] | yes | — | [1] for Stage 1 (RetailIdentification) |
can_handle.media_types |
string[] | yes | — | Restricts to named media types |
can_handle.entity_types |
string[] | yes | — | Work / MediaAsset |
endpoints.api |
string | yes | — | Base URL for all requests |
throttle_ms |
int | no | 250 | Min ms between requests |
max_concurrency |
int | no | 1 | Parallel request cap |
cache_ttl_hours |
int? | no | null | null = no caching |
requires_api_key |
bool | no | false | Shown in Setup Wizard |
language_strategy |
string | no | "source" |
source / localized / both |
preferred_bridge_ids |
object | no | — | Keys used by Wikidata Stage 2 |