Accepting engagements for Q2 2026

Full control over how your AI behaves.

Fine-tuned LLMs on your data, your rules, your hardware. Delivered within 72 hours.

NVIDIA Grace Blackwell compute tray — the hardware that runs Pylox Forge.

Trained and served on Grace Blackwell, air-gapped from the internet.

Pilot-Fit Guarantee
Your fine-tune outperforms your baseline on an eval set you supply — or the setup fee is refunded.
Why fine-tune

Rented intelligence, or owned behavior.

Cloud LLMs are rented. Every call depends on a vendor's pricing, policy, and release schedule you don't control. A fine-tuned model is yours — you dictate the behavior.

What it knows

Trained on your documents, contracts, ticket history, and internal policies.

How it speaks

Your terminology, your format, your refusal phrasing, your brand voice.

Where it runs

On your hardware, our hardware, your cloud — anywhere you choose.

Who sees the data

Zero third-party calls, zero training-data exfiltration, zero vendor logging.

What it won't do

Your content policy, your compliance rules, your brand limits — not ours.

What it costs

One-time fine-tune plus marginal inference cost. Not per-token forever.

Built on infrastructure you trust

Open-weight foundation models, NVIDIA silicon, standard deployment targets.

NVIDIA
Meta Llama
Qwen
Hugging Face
RunPod
Lambda Labs
Process

From contract to production in 72 hours.

Every engagement follows the same pipeline. Every step is automated, every output auditable, every measurement reproducible. Nothing proprietary — you could run this pipeline yourself if you wanted. We just do it faster.

01

Data intake

JSONL, CSV, PDF, chat transcripts, Slack / Zendesk / Intercom exports. We handle the ingest.

02

Schema + PII redaction

Normalized to chat schema, PII redacted automatically before anything touches training.

03

Quality filter + dedup

MinHash deduplication, quality threshold enforcement, bad-row rejection with audit log.

04

Optional domain enrichment

Local on-prem models expand your corpus. Zero data to third parties. Opt in per engagement.

05

QLoRA training

On Grace Blackwell silicon, packed sequences, state-of-the-art training recipe.

06

DPO safety alignment

Refusal behavior and brand voice baked into the LoRA weights — not just a filter on top.

07

Automated benchmark

Academic, domain, performance, cost, and safety — all 6 sections run and reported.

08

Red-team verification

50-prompt attack suite. 70%+ block rate required before any adapter ships.

09

NVFP4 + EAGLE-3 deploy

State-of-the-art inference stack — every acceleration NVIDIA ships, running together.

10

Hugging Face push

Private or public repo. You own the weights. You can export, re-host, or modify forever.

11

Handoff

Endpoint keys delivered if Pylox-hosted — or adapter file shipped if you're running self-host.

Infrastructure

The stack your model runs on.

Four pieces of NVIDIA's 2026 inference path, running together in one vLLM build.

Grace Blackwell

128 GB unified memory across CPU and GPU. Tensor cores run 4-bit math natively, with no conversion cost.

NVFP4 weights

4-bit floating point. Weights occupy a quarter of the memory a 16-bit checkpoint would, and the tensor cores process them directly.

EAGLE-3 speculative decoding

NVIDIA and RedHatAI draft heads. The model predicts several tokens ahead and verifies them in one forward pass.

FlashInfer kernels

Attention and GEMM kernels tuned for Blackwell. They ship together with NVFP4 in the same vLLM build.

Real throughput depends on prompt length, batch size, and traffic shape. We measure it on your workload and include the number in your engagement proposal.

Measured benchmarks

Throughput you can put in an SLA.

Every figure is measured on our own Grace Blackwell hardware — the same machine your fine-tune trains and serves on. NVFP4 quantization paired with EAGLE-3 speculative decoding pushes every adapter far past its baseline.

Tokens per second · single user
Tier
NVFP4 + EAGLE-3 range
On our Grace Blackwell
8B
80–400tok/s
98.9 measured
32B
30–200tok/s
36.6 measured
70B
10–120tok/s
~15 projected

Throughput depends on the GPU and its clock configuration — low end is entry Blackwell (GB10-class / L40S), high end is B200-class. The "measured" column is single-user, short-prompt throughput on our own Grace Blackwell. NVFP4 weights paired with EAGLE-3 speculative decoding is the fastest generally-available dense-model inference stack as of April 2026. Your real number depends on your hardware, prompt length, batch size, and traffic shape — we measure it during the consultation and put it in the engagement proposal.

Cost at scale
Significantly cheaper under constant high traffic.

Once a workload hits steady, high-volume traffic, self-hosted inference on owned hardware runs far below metered API pricing. Your break-even point depends on your traffic shape — quiet, bursty workloads stay closer to parity; sustained heavy traffic is where the gap widens.

  • Fixed hardware cost — volume can grow without linear cost growth.
  • No per-token metering, no rate-limit surge pricing.
  • Data never leaves your network — before the first request is billed.
Your actual break-even is computed on your traffic volume and quoted in the engagement proposal.
The guarantee

Your model beats your current baseline, or your money back.

We agree on the evaluation set and the minimum delta before work begins. If the shipped adapter doesn't clear the bar on the agreed benchmark, the engagement fee is refunded in full. No hedging, no footnotes, no clawback period.

1
Benchmark agreed upfront
You pick the eval set. We commit to it in writing before training begins.
2
Delta in the contract
Minimum lift over your current baseline is stated in the SOW — not reverse-engineered after delivery.
3
Refund within 14 days
If the shipped model misses the bar, the engagement fee is wired back within 14 days — no appeal, no prorated clawback.
4
Refund requires deletion
Refund is conditional on not deploying the shipped weights. Using them in production past the 14-day evaluation window counts as acceptance and voids the refund.
Engagement

Pick the silicon. Keep the weights.

All tiers ship with DPO safety alignment, a runtime safety gateway, and a red-team verification report. You own the adapter weights outright — no lock-in, no revenue share, no license recall.

Custom
You choose · Under 70B
Starting at
$1,000
One-time setup (training + deploy)
Refresh from $500 · hosting on request
  • Any base model under 70B — Llama, Qwen, Gemma, Mistral, or your choice
  • Any data, any use case, any format
  • Self-host or Pylox-hosted
  • Full pipeline — train, benchmark, safety, deploy
  • Perfect for experiments and exploratory fine-tunes
Book a consultation
Small
Llama 3.1 8B · 8B parameters
Starting at
$3,000
One-time setup (training + deploy)
Refresh from $1,500 · hosting on request
  • Self-host or Pylox-hosted
  • NVFP4 + EAGLE-3 inference stack
  • DPO safety alignment baked in
  • Runtime safety gateway included
  • Refresh on-demand
Book a consultation
MOST POPULAR
Medium
Qwen 3 32B · 32B parameters
Starting at
$7,000
One-time setup (training + deploy)
Refresh from $2,500 · hosting on request
  • Self-host or Pylox-hosted
  • NVFP4 + EAGLE-3 inference stack
  • DPO safety alignment included
  • Full domain benchmark harness + report
  • Red-team verification report
Book a consultation
Large
Llama 3.3 70B · 70B parameters
Starting at
$15,000
One-time setup (training + deploy)
Refresh from $5,000 · hosting on request
  • Self-host or Pylox-hosted
  • NVFP4 + EAGLE-3 inference stack
  • DPO safety alignment included
  • Extended red-team + S2 safety audit
  • Dedicated account engineer
Book a consultation

All tiers · data never leaves your hardware · adapters you own

Sovereign Edge

Your silicon, your server room, your data.

For clients who need true on-prem — we bring the hardware, install it in your server room, train your model, and walk out. All inference runs on your box, behind your firewall, forever.

DGX Spark installed on-site

Grace Blackwell GB10 with 128 GB unified memory. Runs up to 70B fine-tunes with NVFP4 + EAGLE-3 acceleration. Sits in your server room forever — not rented, not subscription-locked.

Air-gapped training handoff

Encrypted drive pickup from your site. Training on our Grace Blackwell — never touches the internet. Drive and fine-tune returned in person with chain-of-custody documentation and wipe certificate.

Nationwide coverage

South Florida (Miami-Dade, Broward, Palm Beach): no travel fee, one-hour on-site emergency response.
Anywhere else in the USA: installation included, travel billed at cost.

Law firms. Hospitals. Hedge funds. Family offices. Wealth managers. The "this can never touch OpenAI" crowd.

Book a Sovereign Edge consultation
Security

Defense in depth, documented per model.

Safety isn't a toggle — it's a stack. Every engagement ships with all three layers configured, tested, and reported in writing.

Layer 01
Weights

Training-time DPO alignment

Refusal behavior baked into the LoRA weights during fine-tune. The model is taught what to decline before it ever sees production traffic.

Layer 02
Gateway

Runtime safety gateway

Meta Prompt Guard 2 (GPU-pinned) plus Llama Guard 3 sit in front of every inference. Prompt-injection, jailbreaks, and category violations are blocked before your adapter is called.

Layer 03
QA

Red-team verification

A 50-prompt attack suite runs against every shipped model. We require a ≥70% block rate. The full report is handed to you with the adapter.

Support

SLA-backed operators, not a support queue.

Every ticket routes through the team that built your model. No offshored call center. No LLM chatbot triage. No tier-one handoff.

T1Starter
Included
with every engagement
  • Sev-1Next business day
  • Sev-23 business days
  • ChannelEmail only
  • Coverage9am–6pm ET · Mon–Fri
T2Business
On request
monthly add-on, disclosed in consultation
  • Sev-1Within 4 business hours
  • Sev-21 business day
  • ChannelEmail + Slack Connect
  • Coverage8am–8pm ET · 7-day sev-1
T3Enterprise
On request
custom tier, disclosed in consultation
  • Sev-11 hour · 24 / 7 / 365
  • Sev-24 hours
  • ChannelSlack + direct phone line
  • CoverageOn-call rotation · named architect
FAQ

Questions buyers always ask.

Direct line

Ready to forge your private model?

Send the dataset you want to fine-tune on, the compliance constraint you're trying to solve, or the compute budget you've already approved. You'll get a scoped response the same day.

Base
Miami, FL
Response
Same day
Intake
5-min scope