# MindForge Voice-to-Omni Pipeline White Paper

Generated: 2026-05-13T08:17:09Z

## Abstract

The M1ND3XPAND3RS voice dataset has been upgraded from a single TTS manifest into a multi-lane creative training substrate: clean audio/text examples for voice fine-tuning, Director-chat JSONL for Unsloth/Qwen3.6 LoRA training, and operator proof surfaces that make the pipeline inspectable before any expensive or public action.

## Dataset contract

The portable audio JSONL lane accepts exactly the shapes requested:

```jsonl
{"audio":"examples/example.wav","text":"This is an example audio transcript for training."}
{"audio":"/absolute/path/to/audio1.wav","text":"You can use absolute paths for audio files."}
{"audio":"relative/path/to/audio2.wav","text":"Or relative paths from the working directory."}
{"audio":"data/audio3.wav","text":"Each line is a JSON object with audio path and text.","duration":3.5}
{"audio":"data/audio4.wav","text":"Optional: add duration field to skip audio loading during filtering.","duration":2.8}
{"audio":"data/audio5.wav","text":"Optional: add dataset_id for multi-dataset training.","dataset_id":1}
```

Current canonical examples:

- `train.jsonl`: VoxCPM/voice-safe lane with `audio`, `text`, `duration`.
- `data_train.jsonl`: portable `data/audioN.wav` lane with duration and dataset ids.
- `train.with_dataset_id.jsonl`: multi-dataset lane.
- `qwen_director_train.jsonl` and `qwen_director_validation.jsonl`: chat-SFT Director lane.

## The three-lane architecture

1. **Voice lane** — trains or serves speech. Spoken text stays clean. Control tags do not enter transcript fields.
2. **Director lane** — trains an LLM to emit strict MindForge/vLLM-Omni JSON. It can choose routes like TTS, image generation, video render, music bed, or operator review.
3. **Proof lane** — records manifests, validation reports, white paper, rendered explainer video, and morning receipts.

## Why Unsloth here

Unsloth gives a practical LoRA path for Qwen-family chat SFT. The active training scaffold uses:

- Base model: `unsloth/Qwen3.6-27B`
- GPU: Modal A100-80GB
- Adapter: LoRA rank 32 / alpha 32
- Smoke: 5 steps, 32 train rows, 8 eval rows
- Full bounded follow-up: 60 steps after smoke passes

## Completed training receipt

The bounded full Director LoRA run completed successfully.

- Run ID: `qwen36-27b-mindforge-director-smoke-20260513T081919Z`
- Max steps: `60`
- Train rows used: `256`
- Eval rows used: `42`
- Runtime: `661.7805` seconds
- Final train loss: `0.40606151446700095`
- Adapter path on Modal volume: `/outputs/qwen36-27b-mindforge-director-smoke-20260513T081919Z/adapter`
- Adapter uploaded to: `https://huggingface.co/TheMindExpansionNetwork/mindforge-qwen36-27b-director-lora`
- HF upload commit: `54ebbb836da60d8fe5ca3408317ea02098ebdf37`

The Director model is not a speech model. It is the routing brain that keeps voice text clean while still enabling rich creative control.

## Safety and cost gates

Closed by default:

- public posting
- paid image/video generation without an explicit run
- voice-to-shell
- payment/outreach
- live stream mutation
- uncontrolled recursive cron

Allowed autonomously in this sprint:

- read-only health probes
- schema validation
- static pages and manifests
- bounded explicitly requested Modal training smoke/full run
- local/video proof pack generation

## Morning success criteria

By 7 AM, the operator should be able to open one page and see:

- dataset example rows and the validated schema
- Unsloth preflight result
- smoke/full training status or receipt
- adapter output path if training completed
- this white paper
- a narrated HyperFrames explainer video
- exact next human decisions

## Next human decisions

1. Promote the Director adapter to a named HF model repo if the training metrics and generation probe pass.
2. Connect ComfyUI Cloud or a RunPod/VPS ComfyUI endpoint for real workflow execution.
3. Approve one public-safe demo post or YouTube upload after manual review.

## Tiny goblin conclusion

This is the bridge from “we have voice clips” to “we have an autonomous creative machine with receipts.” Not just beep boop. Beep boop with a manifest, dude.