Local AI Intelligence Layer¶
Role in the System¶
AI is a core function of Tuvima Library, not an optional add-on. It replaces brittle heuristic and regex code that previously handled filename cleaning, media type disambiguation, and metadata scoring. The Engine requires AI models to be present and will not begin ingestion until they have been downloaded.
All inference runs entirely on the local machine. No cloud account, no subscription, no data leaves the NAS.
Project Structure¶
All AI implementations live in MediaEngine.AI. This project sits alongside MediaEngine.Providers in the dependency chain — it references MediaEngine.Domain (for contracts) and MediaEngine.Storage (for configuration).
NuGet dependencies:
- LLamaSharp + LLamaSharp.Backend.Cpu (both MIT) — .NET native llama.cpp binding with GBNF grammar constraint support
- Whisper.net + Whisper.net.Runtime (both MIT) — .NET native whisper.cpp binding for speech-to-text and language detection
Model Roles¶
Three model roles cover all AI workloads. Only one model is loaded into memory at a time, enforced by a SemaphoreSlim. Models auto-unload after a configurable idle timeout. On first run, ModelAutoDownloadService downloads all models to the /models Docker volume (environment variable: TUVIMA_MODELS_DIR).
| Role | Default model | Memory footprint | Use |
|---|---|---|---|
text_fast |
Llama 3.2 1B Q4_K_M | ~750 MB | On-demand tasks: search intent parsing, TL;DR, recommendation explanations |
text_quality |
Llama 3.2 3B Q4_K_M | ~2 GB | Batch tasks: ingestion manifest analysis, vibe tagging, QID disambiguation |
audio |
Whisper Medium | ~1.5 GB | Audio tasks: transcription, language detection, sync maps |
Model roles are configurable in config/ai.json. Any GGUF-format model can be substituted by updating the config.
Structured Output¶
All LLM calls use GBNF grammar constraints — llama.cpp forces the model to produce valid JSON at the token level. This is model-agnostic and works with Llama, Mistral, Phi, Gemma, and Qwen models. JSON schema validation and retry logic serve as a safety net over the grammar constraint.
Features¶
Ingestion (automatic, runs during file processing)¶
| Feature | Model | What it does |
|---|---|---|
| Smart Labeling | text_quality | Cleans raw filenames into structured title/author/year/series fields. Replaces TitleNormalizer.cs. |
| Media Type Classification | text_quality | Classifies ambiguous file formats (MP3, MP4, M4A) when heuristic signals are insufficient. Replaces AudioProcessor/VideoProcessor disambiguation heuristics. |
| Batch Manifest Builder | text_quality | Analyses an entire folder of files as a group before retail API calls, inferring series, author, and format context. Reduces retail API calls by 80–95% for bulk imports. Supersedes IngestionHintCache. |
| Audio Language Detection | audio | Detects the spoken language of audio files using Whisper. |
Alignment (automatic / on-demand)¶
| Feature | Model | What it does |
|---|---|---|
| QID Disambiguation | text_quality | When the Reconciliation API returns multiple Wikidata candidates, picks the best match using semantic reasoning over title, description, and existing metadata. |
| Series Alignment | text_quality | Infers correct reading/watching order within a series when Wikidata series position data is absent or inconsistent. Background service (3 AM daily). |
| Watching Order | text_fast | Generates a recommended cross-media consumption order for a Hub (read the book before watching the film, etc.). On-demand. |
Enrichment (background / on-demand)¶
| Feature | Model | What it does |
|---|---|---|
| Vibe Tags | text_quality | Generates mood and atmosphere tags (3–5 per work) using per-category controlled vocabularies defined in config/ai.json. 25–30 tags per media type, organised by dimension: pacing, mood, atmosphere, tone, intensity. Vibes describe how something feels, not what it is — genre handles categorisation (from Wikidata/retail), vibes handle the emotional/atmospheric space. Background service (4 AM daily). |
| TL;DR | text_fast | Generates a 2–3 sentence plain-language summary of a work's description. On-demand. |
| Cover Art Validation | text_fast | Verifies that a downloaded cover image matches the expected work (catches mismatched covers from retail providers). |
| Audio Similarity | — | Chromaprint-based acoustic fingerprinting to detect duplicate audio files across formats. No LLM required. |
Syncing (scheduled / on-demand)¶
| Feature | Models | What it does |
|---|---|---|
| Immersive Bake | audio + text_quality | Generates a synchronized audiobook/ebook experience — aligns audio timestamps to text positions. |
| Subtitle Sync | audio | Corrects subtitle timing drift using Whisper-generated transcription as ground truth. |
| Cross-Media Scene Mapping | text_quality | Maps equivalent scenes across book, film, and audiobook editions of the same work. |
Personalization (background / on-demand)¶
| Feature | Model | What it does |
|---|---|---|
| Local Taste Profiling | text_quality | Builds a per-profile preference model from reading/watching history. Background service (Sunday 5 AM). |
| "Why" Factor | text_fast | Explains in plain language why a specific item was recommended to a user. |
Discovery (user input)¶
| Feature | Model | What it does |
|---|---|---|
| Intent Search | text_fast | Translates a natural language search query ("something scary set in space") into structured filter parameters for the library query engine. |
Advanced (manual)¶
| Feature | Model | What it does |
|---|---|---|
| User-Assisted URL Paste | text_quality | Extracts structured metadata from a pasted URL (publisher page, Wikipedia article, Goodreads link) when automated providers fail. |
Relationship to the Priority Cascade¶
AI does not replace the Priority Cascade Engine. Wikidata remains the sole authority for canonical data. AI improves the quality of inputs to the cascade:
| AI feature | Cascade interaction |
|---|---|
| SmartLabeler | Cleans filenames before they are parsed into claims — better raw input for Tier D scoring |
| MediaTypeAdvisor | Emits a high-confidence media_type claim that enters the cascade as any other claim |
| QidDisambiguator | Accelerates Tier C resolution when multiple Wikidata candidates are returned |
| BatchManifestBuilder | Reduces the number of API calls, but does not change claim weights or cascade logic |
The cascade evaluates all claims — including those produced by AI features — using the same tier system.
Genre vs Vibe — Discovery Model¶
Genres and vibes serve different purposes and come from different sources. Together they power the discovery and smart playlist features.
| Layer | Source | Example | Answers |
|---|---|---|---|
| Genre | Wikidata (P136) / retail providers | Science Fiction, Mystery, Biography | "What is it?" — categorical |
| Vibe | AI (VibeTagger, text_quality model) | atmospheric, slow-burn, haunting | "How does it feel?" — emotional/atmospheric |
| Taste Profile | AI (TasteProfileBackgroundService) | User prefers cerebral + atmospheric sci-fi | "What do I like?" — personalised |
| Intent Search | AI (text_fast model) | "something scary set in space" → genre:horror + genre:sci-fi + vibe:tense | "What am I in the mood for?" — natural language |
Intent Search is the bridge — it translates natural language into a structured query that combines genres, vibes, media types, and other metadata. The "Why" Factor then explains recommendations in plain language.
Vibe vocabulary design principles¶
- Vibes must not overlap with genres. "Science Fiction" is a genre; "cerebral" and "futuristic-feeling" are vibes.
- Vibes are organised by dimension: pacing (page-turner, slow-burn), mood (dark, uplifting), atmosphere (atmospheric, gritty), tone (satirical, whimsical), intensity (epic, intimate).
- Cross-media tags (atmospheric, dark, intimate, haunting, raw, gentle) appear across most media types because those vibes are universal. Media-specific tags stay where they make sense (visually-stunning for film, driving for music, noir for comics).
- 25–30 tags per media type. The LLM selects 3–5 per work, so the larger vocabulary gives range without diluting results.
- Vocabularies are editable by the user in Settings > Intelligence > Vibe Vocabulary.
API Endpoints¶
| Method | Route | Purpose |
|---|---|---|
| GET | /ai/status |
Model load status, memory usage |
| GET | /ai/models |
List configured models with download status |
| POST | /ai/models/{role}/download |
Trigger model download |
| POST | /ai/models/{role}/load |
Load a model into memory |
| POST | /ai/models/{role}/unload |
Unload a model |
| GET | /ai/config |
Current AI configuration |
| POST | /ai/enrich/tldr/{entityId} |
Generate TL;DR for a work |
| POST | /ai/enrich/vibes/{entityId} |
Generate vibe tags for a work |
| POST | /ai/enrich/search/intent |
Parse a natural language search query |
| POST | /ai/enrich/extract-url |
Extract metadata from a pasted URL |
Background Services¶
| Service | Schedule | Purpose |
|---|---|---|
ModelAutoDownloadService |
Startup | Downloads missing models to /models volume |
VibeBatchService |
Daily at 4 AM | Processes vibe tag queue for all un-tagged works |
SeriesAlignmentBackgroundService |
Daily at 3 AM | Resolves series order for works without Wikidata series position |
TasteProfileBackgroundService |
Weekly, Sunday at 5 AM | Rebuilds per-profile taste models from history |
Configuration¶
All AI settings live in config/ai.json:
- Model definitions per role (path, quantization, context window)
- Per-feature enable flags
- Per-category vibe vocabularies (Books, Movies, Music, Podcasts, Comics)
- Scheduling parameters for background services
- Idle unload timeout