Mockup for reviewTech-stack demonstration. Not affiliated with Nebius and not the live Builders Network.About this build →

Token Factory

Inference API for open models.

OpenAI-compatible API to start fast. Dedicated GPUs, post-training, and workload optimization when you scale. Start in minutes; engineer for production when it matters.

Open Playground Quickstart ↗

Quickstarts

Build, customize, and deploy.

Official Token Factory playlist on YouTube

Learn about Nebius Token Factory from the makers — architecture, deep dives, and live demos curated by the team.Watch on YouTube ↗

Post-training guide

Fine-tune open models on Token Factory: data prep, training runs, evaluation, and deployment back to the inference API.Read on Token Factory docs ↗

Token Factory cookbook

Reference recipes for Token Factory — inference patterns, fine-tuning runs, evals, and production deployment recipes.View on GitHub ↗

Token Factory quickstart

OpenAI-compatible API to start fast. Set up an API key, hit the inference endpoint, and ship your first request in under five minutes.Read on Token Factory docs ↗

Workshop: Build an Agentic Slack Bot

Deploy a web-connected AI agent in under 15 minutes. Build a Slack Pricing Assistant that searches competitor pricing with Tavily, runs inference through Nebius Token Factory, and returns structured recommendations. With Colin Lowenberg (Nebius) + Lakshya Agarwal (Tavily).Watch on Nebius.com ↗

YouTube

Video walkthroughs.

Official Token Factory playlist on YouTube — architecture, deep dives, and live demos from the team — plus a curated Build-with-Token-Factory series.

Build with Token Factory playlist on YouTube

AI projects, tools, and success stories created with Token Factory — see what other builders are shipping.Watch on YouTube ↗

Videos & Workshops

Learn from live builds.

Exploring Nebius Token Factory | Open LLMs, AI Agents, Batch Inference & Fine-Tuning

A tour of Nebius Token Factory covering its OpenAI-compatible API, batch inference, fine-tuning, and how open models like Qwen Coder plug into agent frameworks and tools like Hugging Face and OpenRouter.Watch on YouTube ↗

How to Fine-Tune GPT-OSS 20B using Nebius Token Factory

Hands-on tutorial fine-tuning the GPT-OSS 20B model on Nebius Token Factory: environment setup, JSONL dataset upload, creating a fine-tuning job, monitoring training, and downloading artifacts.Watch on YouTube ↗

How to Fine-Tune Open Source LLMs with Nebius Token Factory | Full Tutorial

End-to-end walkthrough of fine-tuning an open-source LLM with LoRA on Nebius Token Factory and deploying it as a production-ready API endpoint, including dataset prep and job configuration.Watch on YouTube ↗

Docs & reference

Integrate with your stack.

API reference, framework integrations (LangChain, LlamaIndex), agent frameworks (Agno, CrewAI, Pydantic AI), and post-training guides.

Agno integration

Lightweight multi-modal agent framework with Token Factory as a first-class provider.Read on Token Factory docs ↗

aisuite integration

Use Token Factory as a model backend in aisuite, the model-agnostic Python SDK from Andrew Ng.Read on Token Factory docs ↗

Autoscaling and cache-aware routing

How Token Factory scales inference workloads and routes requests based on cache locality. Rate limits, burst behavior, and tuning guidance.Read on Token Factory docs ↗

Batch inference

Submit large request batches asynchronously for offline workloads. Cheaper per-token than real-time inference; ideal for evals and back-fills.Read on Token Factory docs ↗

CrewAI integration

Build multi-agent crews with Token Factory models. Coordination, role-playing, and task pipelines all wired through Nebius inference.Read on Token Factory docs ↗

Dedicated endpoints

Pin a Token Factory model to dedicated GPU capacity for predictable latency at scale. Includes autoscaling rules and routing patterns.Read on Token Factory docs ↗

Deploy custom models

Take a model you trained anywhere and deploy it behind a Token Factory endpoint. Custom weight loading, scaling, and monitoring.Read on Token Factory docs ↗

Function calling and tools with Token Factory

Official Token Factory guide on defining tools, letting models pick functions from context (or forcing a specific call), with Python and JavaScript examples for building tool-using agents.Read on Token Factory docs ↗

LangChain integration

Use Token Factory chat models, embeddings, and retrievers inside LangChain via the langchain-nebius package.Read on Token Factory docs ↗

LiteLLM integration

Route LiteLLM through Token Factory as an OpenAI-compatible provider. Drop-in for projects already using the LiteLLM proxy.Read on Token Factory docs ↗

LlamaIndex integration

Wire Token Factory in as the inference layer for LlamaIndex RAG pipelines.Read on Token Factory docs ↗

Pydantic AI integration

Type-safe agent framework with Pydantic validation. Token Factory backs the inference layer.Read on Token Factory docs ↗

Structured output and JSON mode with Token Factory

Official Token Factory docs showing how to force JSON responses via json_object mode or a strict JSON schema (e.g. a Pydantic BaseModel), with Python, cURL and JavaScript samples.Read on Token Factory docs ↗

Switch to Token Factory

Migrate from OpenAI / other inference providers to Token Factory. Drop-in compatibility plus the cost and rate-limit upgrades that come with it.Read on Token Factory docs ↗

Token Factory API reference

Full reference for the Token Factory inference API — chat completions, embeddings, fine-tuning, and batch endpoints.Read on Token Factory docs ↗

Token Factory playground

Interactive playground for trying open models without writing code. Live-edit prompts, swap models, and copy the request as curl/JS/Python.Read on Nebius ↗

In-depth technical resources

Production inference is more than serving a model.

Architecture breakdowns for routing, MoE latency, speculative decoding, chat app design, and more.

Building an AI-Powered Finance Planner with Full-Stack Next.js and Nebius

Step-by-step build of Money-Guard, a Next.js dashboard that analyzes spending and answers questions about transactions using Meta Llama 3.1 70B served by Nebius AI Studio.Read on Nebius ↗

Create Your Own AI-Powered Code Generator and Reviewer

Build a full-stack Next.js code assistant that generates snippets across languages and returns automated reviews, powered by DeepSeek Coder on Nebius Token Factory.Read on Nebius ↗

How to Run Meta Llama 3.1 405B with the Nebius AI Studio API

A hands-on how-to for calling Meta Llama 3.1 405B through the Nebius OpenAI-compatible API, with working Python, JavaScript, and cURL examples.Read on Nebius ↗

Routing in LLM inference is the difference between scaling and stalling

Why request routing is the single biggest lever in production LLM inference, and how Token Factory routes intelligently.Read on Nebius ↗

The invisible architecture behind great chat apps

What separates a usable chat product from a janky one — caching, routing, streaming, and the production patterns that hide the seams.Read on Nebius ↗

Why large MoE models break latency budgets — and what speculative decoding changes

Production analysis of how speculative decoding alters latency for large mixture-of-experts inference workloads.Read on Nebius ↗

Adding Nebius Token Factory to a Rust Agent Without a Custom Provider

Tutorial showing how to use Rig's OpenAI-compatible base_url override to wire Nebius Token Factory into a Rust LLM agent — no custom provider code needed.Read on Rup12 ↗

Build a Job-Finding Agent with Google ADK, Nebius AI, Mistral OCR & Linkup

A multi-agent pipeline that reads resume PDFs with Mistral OCR, searches live job boards via Linkup, and uses Qwen3-14B on Nebius AI Studio to generate and filter matches, orchestrated with Google ADK.Read on DEV ↗

Building a Multi-Agent RAG System with Couchbase, CrewAI, and Nebius AI Studio

Build a semantic search engine that pairs Couchbase as the vector store with CrewAI multi-agent RAG, using Nebius AI Studio for both the Llama LLM and the e5-mistral embeddings.Read on DEV ↗

Fine-Tune Your LLM in Minutes with Nebius

A practical guide to fine-tuning open-source LLMs on Nebius three ways: the no-code Web Console, the Python SDK, and raw cURL API requests, with .jsonl dataset prep.Read on DEV ↗

How I Built an Agentic RAG App to Brainstorm Conference Talk Ideas

Combine Tavily live web research, Couchbase vector search over past KubeCon talks, and Nebius AI Studio (e5-mistral embeddings + Qwen3) to synthesize unique conference talk abstracts.Read on DEV ↗

I Built a Team of 5 Agents Using Google ADK, Meta Llama and Nemotron-Ultra-253B

Build an AI Trend Analyzer with five sequential ADK agents (Exa, Tavily, Firecrawl + summary/analysis) running Meta Llama 3.1 and Nemotron-Ultra-253B served through Nebius AI Studio.Read on DEV ↗

I Used Agent Skills to Fine-Tune an Open-Source LLM on Nebius Token Factory

A teacher-student distillation walkthrough for an insurance-claims chatbot, using Token Factory Data Lab batch inference, LoRA fine-tuning, serverless adapter deployment, and a Gradio comparison app. Ships with a companion Jupyter notebook.Read on Medium ↗

Text-to-SQL: Creating Embeddings with Nebius AI Studio (Part 1)

Part 1 of a text-to-SQL RAG series: turn SQL schema into annotated markdown and generate vector embeddings with Nebius AI Studio (BAAI/bge-en-icl), stored in Postgres with pgvector.Read on DEV ↗

Text-to-SQL: Generating SQL with Nebius AI Studio (Part 2)

Part 2 of the text-to-SQL RAG series: use the embeddings from Part 1 to retrieve relevant schema and generate correct SQL queries with Nebius AI Studio models.Read on DEV ↗

Text-to-SQL: Querying Databases with Nebius AI Studio and Agents (Part 3)

Part 3 of the text-to-SQL RAG series: wrap the pipeline in an agent that queries a live database end to end, powered by Nebius AI Studio models.Read on DEV ↗

Use DeepSeek R1 & V3 with Bolt.DIY & Cursor in 3 Steps

Get free Nebius AI Studio API keys, route DeepSeek R1/V3 through OpenRouter, and plug the EU-hosted models into Bolt.DIY and Cursor for coding.Read on DEV ↗

Using Nebius AI Models with LangChain/LangGraph via LiteLLM

Wire Nebius-hosted Qwen models into LangChain/LangGraph through LiteLLM, then build a ReAct agent that talks to databases over the Model Context Protocol.Read on DEV ↗

GitHub

Cookbooks and examples.

Post-training cookbook examples

End-to-end fine-tuning recipes for the Token Factory post-training stack — SFT, DPO, evaluation, and deployment.View on GitHub ↗

Ready to make your first API call?

Mockup for reviewStack demo — not the live Builders Network.About this build →

Brand