Mockup for reviewTech-stack demonstration. Not affiliated with Nebius and not the live Builders Network.About this build →
Token Factory

Inference API for open models.

OpenAI-compatible API to start fast. Dedicated GPUs, post-training, and workload optimization when you scale. Start in minutes; engineer for production when it matters.

YouTube

Video walkthroughs.

Official Token Factory playlist on YouTube — architecture, deep dives, and live demos from the team — plus a curated Build-with-Token-Factory series.
Docs & reference

Integrate with your stack.

API reference, framework integrations (LangChain, LlamaIndex), agent frameworks (Agno, CrewAI, Pydantic AI), and post-training guides.
Docs

Agno integration

Lightweight multi-modal agent framework with Token Factory as a first-class provider.Read on Token Factory docs
Docs

aisuite integration

Use Token Factory as a model backend in aisuite, the model-agnostic Python SDK from Andrew Ng.Read on Token Factory docs
Docs

Autoscaling and cache-aware routing

How Token Factory scales inference workloads and routes requests based on cache locality. Rate limits, burst behavior, and tuning guidance.Read on Token Factory docs
Docs

Batch inference

Submit large request batches asynchronously for offline workloads. Cheaper per-token than real-time inference; ideal for evals and back-fills.Read on Token Factory docs
Docs

CrewAI integration

Build multi-agent crews with Token Factory models. Coordination, role-playing, and task pipelines all wired through Nebius inference.Read on Token Factory docs
Docs

Dedicated endpoints

Pin a Token Factory model to dedicated GPU capacity for predictable latency at scale. Includes autoscaling rules and routing patterns.Read on Token Factory docs
Docs

Deploy custom models

Take a model you trained anywhere and deploy it behind a Token Factory endpoint. Custom weight loading, scaling, and monitoring.Read on Token Factory docs
Docs

Function calling and tools with Token Factory

Official Token Factory guide on defining tools, letting models pick functions from context (or forcing a specific call), with Python and JavaScript examples for building tool-using agents.Read on Token Factory docs
Docs

LangChain integration

Use Token Factory chat models, embeddings, and retrievers inside LangChain via the langchain-nebius package.Read on Token Factory docs
Docs

LiteLLM integration

Route LiteLLM through Token Factory as an OpenAI-compatible provider. Drop-in for projects already using the LiteLLM proxy.Read on Token Factory docs
Docs

LlamaIndex integration

Wire Token Factory in as the inference layer for LlamaIndex RAG pipelines.Read on Token Factory docs
Docs

Pydantic AI integration

Type-safe agent framework with Pydantic validation. Token Factory backs the inference layer.Read on Token Factory docs
Docs

Structured output and JSON mode with Token Factory

Official Token Factory docs showing how to force JSON responses via json_object mode or a strict JSON schema (e.g. a Pydantic BaseModel), with Python, cURL and JavaScript samples.Read on Token Factory docs
Docs

Switch to Token Factory

Migrate from OpenAI / other inference providers to Token Factory. Drop-in compatibility plus the cost and rate-limit upgrades that come with it.Read on Token Factory docs
Docs

Token Factory API reference

Full reference for the Token Factory inference API — chat completions, embeddings, fine-tuning, and batch endpoints.Read on Token Factory docs
Docs

Token Factory playground

Interactive playground for trying open models without writing code. Live-edit prompts, swap models, and copy the request as curl/JS/Python.Read on Nebius
In-depth technical resources

Production inference is more than serving a model.

Architecture breakdowns for routing, MoE latency, speculative decoding, chat app design, and more.
Guide

Building an AI-Powered Finance Planner with Full-Stack Next.js and Nebius

Step-by-step build of Money-Guard, a Next.js dashboard that analyzes spending and answers questions about transactions using Meta Llama 3.1 70B served by Nebius AI Studio.Read on Nebius
Guide

Create Your Own AI-Powered Code Generator and Reviewer

Build a full-stack Next.js code assistant that generates snippets across languages and returns automated reviews, powered by DeepSeek Coder on Nebius Token Factory.Read on Nebius
Guide

How to Run Meta Llama 3.1 405B with the Nebius AI Studio API

A hands-on how-to for calling Meta Llama 3.1 405B through the Nebius OpenAI-compatible API, with working Python, JavaScript, and cURL examples.Read on Nebius
Guide

Routing in LLM inference is the difference between scaling and stalling

Why request routing is the single biggest lever in production LLM inference, and how Token Factory routes intelligently.Read on Nebius
Guide

The invisible architecture behind great chat apps

What separates a usable chat product from a janky one — caching, routing, streaming, and the production patterns that hide the seams.Read on Nebius
Guide

Why large MoE models break latency budgets — and what speculative decoding changes

Production analysis of how speculative decoding alters latency for large mixture-of-experts inference workloads.Read on Nebius
Guide

Adding Nebius Token Factory to a Rust Agent Without a Custom Provider

Tutorial showing how to use Rig's OpenAI-compatible base_url override to wire Nebius Token Factory into a Rust LLM agent — no custom provider code needed.Read on Rup12
Guide

Build a Job-Finding Agent with Google ADK, Nebius AI, Mistral OCR & Linkup

A multi-agent pipeline that reads resume PDFs with Mistral OCR, searches live job boards via Linkup, and uses Qwen3-14B on Nebius AI Studio to generate and filter matches, orchestrated with Google ADK.Read on DEV
Guide

Building a Multi-Agent RAG System with Couchbase, CrewAI, and Nebius AI Studio

Build a semantic search engine that pairs Couchbase as the vector store with CrewAI multi-agent RAG, using Nebius AI Studio for both the Llama LLM and the e5-mistral embeddings.Read on DEV
Guide

Fine-Tune Your LLM in Minutes with Nebius

A practical guide to fine-tuning open-source LLMs on Nebius three ways: the no-code Web Console, the Python SDK, and raw cURL API requests, with .jsonl dataset prep.Read on DEV
Guide

How I Built an Agentic RAG App to Brainstorm Conference Talk Ideas

Combine Tavily live web research, Couchbase vector search over past KubeCon talks, and Nebius AI Studio (e5-mistral embeddings + Qwen3) to synthesize unique conference talk abstracts.Read on DEV
Guide

I Built a Team of 5 Agents Using Google ADK, Meta Llama and Nemotron-Ultra-253B

Build an AI Trend Analyzer with five sequential ADK agents (Exa, Tavily, Firecrawl + summary/analysis) running Meta Llama 3.1 and Nemotron-Ultra-253B served through Nebius AI Studio.Read on DEV
Guide

I Used Agent Skills to Fine-Tune an Open-Source LLM on Nebius Token Factory

A teacher-student distillation walkthrough for an insurance-claims chatbot, using Token Factory Data Lab batch inference, LoRA fine-tuning, serverless adapter deployment, and a Gradio comparison app. Ships with a companion Jupyter notebook.Read on Medium
Guide

Text-to-SQL: Creating Embeddings with Nebius AI Studio (Part 1)

Part 1 of a text-to-SQL RAG series: turn SQL schema into annotated markdown and generate vector embeddings with Nebius AI Studio (BAAI/bge-en-icl), stored in Postgres with pgvector.Read on DEV
Guide

Text-to-SQL: Generating SQL with Nebius AI Studio (Part 2)

Part 2 of the text-to-SQL RAG series: use the embeddings from Part 1 to retrieve relevant schema and generate correct SQL queries with Nebius AI Studio models.Read on DEV
Guide

Text-to-SQL: Querying Databases with Nebius AI Studio and Agents (Part 3)

Part 3 of the text-to-SQL RAG series: wrap the pipeline in an agent that queries a live database end to end, powered by Nebius AI Studio models.Read on DEV
Guide

Use DeepSeek R1 & V3 with Bolt.DIY & Cursor in 3 Steps

Get free Nebius AI Studio API keys, route DeepSeek R1/V3 through OpenRouter, and plug the EU-hosted models into Bolt.DIY and Cursor for coding.Read on DEV
Guide

Using Nebius AI Models with LangChain/LangGraph via LiteLLM

Wire Nebius-hosted Qwen models into LangChain/LangGraph through LiteLLM, then build a ReAct agent that talks to databases over the Model Context Protocol.Read on DEV

Ready to make your first API call?

Quickstart ↗
Mockup for reviewStack demo — not the live Builders Network.About this build →
Brand