Enterprise-grade security including TLS encryption, IP/ASN filtering and CSRF protection; keeping data under your control aids leak prevention. We can also tailor robust security to your information-security policy and data-security requirements.

ChatStream® | Run Local LLMs in Production

Run local LLMs at commercial quality

You want to run the latest open-source or proprietary LLMs reliably in your own environment. ChatStream is a distributed inference platform designed precisely for serving local LLMs in production.

Common challenges of running local LLMs

Re-building your inference server every time a new open model ships
GPUs choking and responses breaking down as concurrency grows
Separate implementations per inference engine (vLLM, TensorRT-LLM, etc.)
Needing to keep confidential data on-prem / your own GPUs
Building chat UI, auth and security all by yourself

ChatStream solves these with a one-stop inference platform plus front-end UI, so you can launch a commercial LLM service quickly and low-code.

The ChatStream approach

Broad model compatibility via HF Transformers (QCT) — fast to follow the newest models, works on older GPUs
Use high-speed engines like vLLM and TensorRT-LLM depending on the workload
Scale out across multi-GPU / multi-node with proprietary GPU load balancing
Control concurrency with queueing so the system never breaks down under load

Particularly strong with Japanese — ChatStream is built by engineers in Japan, with careful handling of Japanese prompts and tokenization. It is well suited to serving Japanese-focused models such as Swallow, ELYZA and CALM.

Broad model support

Host the latest open LLMs as-is — Llama, Qwen, Gemma, Mistral and domestic Japanese models.

Multiple inference engines

Supports QCT, vLLM, DeepSpeed, TGI and TensorRT-LLM, optimized per use case.

GPU load balancing

Distributes across multi-GPU / multi-node with a proprietary algorithm for stable high-volume access.

Runs anywhere

Deployable on Qualiteg GPU Cloud / AWS / Azure / GCP / on-premises.

Turn great inference engines into a production-grade service.

High-performance inference engines such as vLLM and TensorRT-LLM are now freely available as open source. But running them in production — guaranteeing quality and scale, and keeping them up without interruption — is a different kind of challenge. ChatStream delivers that as an integrated, ready-to-use platform.

Building it from scratch in-house

Selecting, tuning and GPU-optimizing inference engines
Implementing GPU load balancing, queueing and redundancy yourself
Designing and continuously validating for scale under traffic spikes
Monitoring, logging, incident response and keeping up with model updates
Building chat UI, authentication and security from the ground up
Securing specialists and maintaining it long-term

Requires deep expertise and a continuous operations team.

With ChatStream

A platform that bundles multiple inference engines, already integrated
GPU load balancing, queueing and scale-out as standard
Goes to production with quality and scale already secured
Web chat UI and AI agent capabilities included
Enterprise operations including auth, security and monitoring
Stable operation and support plans for ongoing operations

Everything you need in one platform — supported end to end, from adoption to operations.

What you can build with ChatStream

Source code, design drawings, technical documents, customer data — keep the confidential data you can't send to external cloud APIs entirely within your own environment (on-prem / your own GPUs). From manufacturing to any field that can't let data leave the building, we build real systems on local LLMs.

Fully local AI coding agent

Run a coding agent with a Claude Code-like experience fully on local — even on-prem. Assist with code generation, review and refactoring without sending your source or repos outside.

Design & technical-doc support for manufacturing

Use AI on drawings, specs and technical know-how without sending them outside — design support, searching past cases, drafting technical documents, all while protecting confidentiality.

Internal knowledge search / RAG

Cross-search internal documents, policies and manuals and answer with citations — without uploading confidential files to the cloud.

Automatic document generation

With tool extensions, generate deliverables such as PPTX, Excel and reports fully on local. Even the working data stays inside.

Customer support

Build response assistance and first-line automation grounded in FAQs, histories and manuals — without sending customer data outside.

AI in regulated industries

For finance, healthcare, government and public sectors where data can't leave — run LLMs securely on-premises / your own GPUs.

Wondering what you can build with local LLMs? We propose concrete configurations for your use case.

Talk to us

Supported inference engines

ChatStream isn't tied to a single engine. Serve with the right engine — or a combination — depending on the model, GPU generation and speed requirements.

QCT / QCHT Qualiteg

Qualiteg Classic (HF) Transformer. The core engine, based on HuggingFace Transformers.

Broadest model compatibility; fast to follow the newest open models
Works even on environments without the latest GPUs (older GPUs)
Memory-efficient, high-throughput via a PagedAttention / classic hybrid

vLLM UC Berkeley

A high-throughput inference engine powered by PagedAttention.

Large speedups and concurrency over classic methods
Shines on recent NVIDIA GPUs / Linux
Ideal for high-volume, high-load serving

DeepSpeed Microsoft

Strong at distributed inference, including ZeRO-Inference for large models.

Splits huge models across multi-GPU / multi-node
Runs large models even under memory constraints

TGI Hugging Face

Text Generation Inference. An inference server that integrates well with the Transformers ecosystem.

Easy to adopt with good HuggingFace integration
Efficient for short-to-mid context serving

TensorRT-LLM NVIDIA

A top-tier, ultra-fast engine optimized for NVIDIA GPUs.

Faster than vLLM / DeepSpeed
Maximum performance on NVIDIA, assuming model compilation
For production where latency and throughput must be pushed to the limit

ChatStream leverages PagedAttention (dynamically allocating the KV cache in blocks to improve throughput under load) and tensor parallelism via Megatron-LM. It's a hybrid design that automatically switches inference algorithms — classic methods at low load, PagedAttention under high load.

Built for large-scale, distributed serving

ChatStream Pool distributes requests across multiple inference servers (ChatStream Server). With multi-GPU and multi-node, it handles large models and high-volume access reliably.

Users / business systems

ChatStream Front (Web Chat UI)

ChatStream Pool

Request routing & GPU load balancing

ChatStream Server

Qwen3.5-32B

RTX PRO 6000 Blackwell

ChatStream Server

Llama 4 Scout

RTX PRO 5000 Blackwell ×N

ChatStream Server

Swallow / custom LLM

Your GPUs / on-prem

API Helper

Commercial APIs

GPT / Claude / Gemini

Host multiple local LLMs simultaneously, and switch to or combine with commercial LLM APIs (OpenAI / Anthropic / Google) when needed. The front provides an OpenAI-compatible API, so existing LLM apps and LangChain can connect as-is.

Multi-LLM orchestration — explainer video

Reference configuration

Model	Qwen3.5-32B / Llama 4 Scout and other latest open LLMs
GPU configuration	NVIDIA RTX PRO 5000 Blackwell (48GB) × 4 = 192GB / RTX PRO 6000 Blackwell (96GB) for larger models
Inference engine	vLLM / TensorRT-LLM / QCT (chosen per use case)
Scale	Scale out throughput by adding nodes. We propose sizing for your requirements.

* Configurations are examples. The optimal setup depends on model, quantization, token length and GPU generation.

From GPU configuration and model selection to sizing — talk to us.

Talk to us

Representative local LLMs supported

Broad support for the representative open / local LLMs (open-weight models) featured in Qualiteg's monthly Japanese LLM Ranking. Thanks to the HF Transformers foundation (QCT), new models can be followed quickly.

Open LLMs

Model	Provider	Parameters	Type / notes
`Qwen3.5 / Qwen3`	Alibaba	4B–397B (incl. MoE, A17B etc.)	Top-tier open class, many sizes
`DeepSeek-V3.2`	DeepSeek	671B (MoE, 37B active)	MoE / top-tier general
`DeepSeek-R1`	DeepSeek	671B (MoE, 37B active)	Reasoning-focused
`GLM-5 / GLM-4.6`	Zhipu AI	Large (MoE)	MoE / coding & agents
`Kimi-K2.5`	Moonshot AI	~1T (MoE, 32B active)	MoE / ultra-large
`MiniMax-M2.1`	MiniMax	230B (MoE, 10B active)	MoE / agents & coding
`ERNIE 4.5`	Baidu	21B (MoE, 3B active) etc.	MoE / reasoning
`K-EXAONE`	LG AI Research	236B (MoE, 23B active)	MoE / from Korea
`QwQ-32B`	Alibaba	32B	Reasoning-focused
`gpt-oss`	OpenAI	20B / 120B	MoE / open-weight
`Gemma 4 / Gemma 3`	Google	1B–27B class	Dense / latest lightweight–mid
`Ministral-3`	Mistral AI	3B–14B	Dense / lightweight, reasoning
`Llama 3.x / 4`	Meta	3B+ (Dense / MoE)	Multilingual / general
`Nemotron-Nano 9B JP`	NVIDIA	9B	Dense / Japanese-optimized

The latest model evaluations are published in our Japanese LLM Ranking.
“Not sure which model fits your business?” “Want a sense of the budget?” — just get in touch. We propose the optimal model selection for your use case.

Japanese / domestic models

Model	Provider	Parameters	Type / notes
`Swallow`	Swallow Team (Institute of Science Tokyo / Tokyo Tech LLM)	8B–120B	Japanese-enhanced gpt-oss / Qwen3 / Gemma-2-Llama (latest RL)
`rinna qwq-bakeneko-32B`	rinna	32B	QwQ-based reasoning, Japanese-tuned
`ABEJA-Qwen2.5-32B`	ABEJA	32B	Qwen2.5-based, Japanese-enhanced

* Models shown are representative examples from the Japanese LLM Ranking. Parameter sizes are approximate, based on representative published values, and vary by variant. Transformers-based open LLMs on HuggingFace and your own fine-tuned models can also be hosted. For specific models, please contact us.

Key features

Beyond the performance and scale demanded by commercial services — ready-to-use completeness, and quality and operations you can trust.

PERFORMANCE & SCALE

Fast, memory-efficient inference

PagedAttention allocates the KV cache dynamically to maximize concurrent requests per GPU
A classic/PagedAttention hybrid balances low latency at low load and throughput at high load
Handles high-volume access while keeping inference cost down

Concurrency control via queueing

Queues many concurrent requests and caps simultaneous generation for stability
Cooperative, token-level interleaved generation even on a single GPU
Non-blocking processing with async I/O

Scale-out distributed serving

Distributes across multi-GPU / multi-node with proprietary GPU load balancing
Supports tensor parallelism (PagedAttention / Megatron-LM) and data parallelism
Clustered serving of large models

READY TO USE & BUILD

Web Chat UI included

A customizable chat UI (web / mobile) is bundled by default
Start using it right after deployment — no separate front-end needed
Customize screens to match your brand and workflows

Full-fledged AI agent capabilities

Build business agents connected to internal data and tools
Add new abilities via tool extensions (custom tools / function calling)
Fully local generation of PPTX, Excel and more — produce deliverables without sending data outside

Low-code / no-code build

Inference platform and UI together, one-stop, for a fast launch
Build full LLM apps and APIs with simple customization
Easy to embed into existing systems via API

TRUST & OPERATIONS

Quality and scale, assured

Goes live after validating production-level load and scale
Designed to maintain quality during sudden concurrency spikes
We propose performance/capacity sizing for your requirements

Stable operation & support plans

Delivered on the premise of stable operation including monitoring
Support plans that keep helping after go-live
Help keeping up with model updates and version upgrades

Enterprise-grade security

ASN filtering, IP filtering, TLS and CSRF protection
On-prem / your-own-GPU operation that keeps data under your control
A secure setup that never sends confidential data outside

PROVEN IN OUR OWN PRODUCT

Bestllam®, our AI integration platform unifying 20+ LLMs, is also built on ChatStream

Bestllam®, an enterprise platform that lets you securely use 20+ of the latest LLMs under a single contract, is designed and built on the ChatStream platform as our own SaaS. We run ChatStream ourselves as a production commercial service, continuously refining its quality and stability. On top of ChatStream, Bestllam extends into an AI integration platform with commercial-API usage, leak prevention via LLM-Audit, and fault tolerance / BCP through a cloud database.

Visit Bestllam

“We want to use AI without sending data outside.” Local LLMs are the answer. Get in touch.

Talk to us

Secure AI transformation, supported from the upstream

We don't just provide the ChatStream platform. Qualiteg's consulting works alongside you from planning and PoC through business design and in-house enablement, supporting AI-driven transformation centered on secure local-LLM use.

Strategy & planning

Which tasks, which models, how much to bring in-house. We map the big picture and cost-effectiveness of AI adoption with you, from the upstream.

Building a secure AI platform

Built on local-LLM operation that keeps confidential data inside, we construct a secure inference platform with ChatStream and guardrails.

Transformation & enablement

We don't stop at deployment. We support operational adoption, internal rollout and people development, turning AI into lasting competitiveness.

Learn about LLM-infra & AI technology consulting

Frequently Asked Questions (FAQ)

About adopting local LLMs (on-premise / private LLMs), using generative AI without sending data outside, and supported models, GPUs and security.

What are local LLMs? How do they differ from cloud LLMs?

An LLM (large language model) is an AI that understands and answers text, like ChatGPT. Normally such AI sends your text over the internet to an external company's servers for processing. A "local LLM" instead runs that AI inside your own environment (your servers/GPUs, on-premises or private cloud). Because your questions and documents never leave your premises, you can safely use generative AI even with confidential or personal data. Think of it as using AI only inside your own locked room, rather than a rented external space.

What is ChatStream?

ChatStream is an LLM inference platform that serves local LLMs (open models or your own models) at commercial quality. With GPU load balancing, a Web Chat UI and APIs, you can quickly build internal generative-AI tools or commercial LLM services.

What is the difference between Bestllam and ChatStream?

Bestllam® is our AI integration platform designed and built on the ChatStream platform. ChatStream is an inference platform fully equipped to run local LLMs self-contained on-premises / your own GPUs — ideal when data must not leave your organization. On top of that, Bestllam also uses external commercial LLM APIs (GPT / Claude / Gemini, etc.), adds leak prevention via LLM-Audit (prompt-injection detection, input/output auditing), and raises fault tolerance / BCP with a cloud database. Use ChatStream for fully on-prem; use Bestllam when you also want to leverage commercial APIs in an integrated way.

Can I run LLMs on-premises or on my own GPUs?

Yes. It runs on-premises, on your own GPU servers, and in private clouds, so you can operate local LLMs while keeping data under your control.

Can I use AI without sending confidential data outside?

Yes. ChatStream keeps inference entirely within your environment, so you can use AI without sending customer data, technical documents or source code to external cloud APIs — reducing the risk of data leakage.

Can you also procure GPU machines?

Yes. We can arrange GPU servers, including our Integrity series. From selecting to procuring GPU machines suited to LLM inference, you can consult us all at once. See our LLM-infra technology consulting for details.

Can it handle high volume and concurrent requests?

Yes. With proprietary GPU load balancing, queueing and high-throughput inference via PagedAttention, it handles large concurrent access reliably, and scales out across multi-GPU / multi-node.

Which local LLMs (open models) are supported?

Broad support for major open LLMs published on HuggingFace — Llama, Qwen, DeepSeek, Gemma, Mistral, gpt-oss, Swallow, ELYZA, CALM, PLaMo and more — with fast follow-up to the latest models.

Can you host our own fine-tuned LLM?

Yes. Transformers-based custom and fine-tuned models can be hosted, so you can securely operate models trained on your own data inside your organization.

Which inference engines are supported?

QCT (Qualiteg Classic Transformer), vLLM, DeepSpeed, TGI and TensorRT-LLM, used selectively depending on the model, GPU generation and speed requirements.

What GPUs are needed? Can you advise on sizing?

Yes. We recommend suitable open models for what you want to do, and propose the best configuration including GPU selection, quantization and parallelization. We build with GPUs such as NVIDIA RTX PRO Blackwell (5000 / 6000) and A6000, computing the required GPU configuration from concurrency and model size. We support you with broad GPU expertise — see our LLM-infra technology consulting.

Is a Web Chat UI included?

Yes. A customizable Web Chat UI (PC / mobile) is bundled, so you can start using it right after deployment without building a separate front-end.

Is there an OpenAI-compatible API? Can I use it from existing systems or LangChain?

Yes. An OpenAI-compatible API is provided, so existing LLM apps and LangChain can connect as-is. You can also switch between or combine commercial APIs (OpenAI / Anthropic / Google).

Can you build RAG (internal document search)?

Yes. You can build RAG (retrieval-augmented generation) over internal documents, manuals and policies without sending data outside, for grounded internal Q&A.

Can you do AI agents and tool integration? PPTX / Excel generation too?

Yes. With tool extensions, you can build business AI agents that generate deliverables such as PPTX and Excel fully on local, so even the working data stays inside.

We want to use AI on drawings/technical docs in manufacturing without sending them outside. Is that possible?

Yes. You can use AI for design support, searching past cases and drafting technical documents without sending drawings, specs or technical documents outside — balancing manufacturing confidentiality with generative-AI use. We also have cases for IT and manufacturing customers and per-design-tool use cases, so please contact us.

Can it be used in regulated industries like finance, healthcare, government and legal?

Yes. Even in fields where data cannot leave — finance, healthcare, government, public sector, legal (lawyers / law firms) — you can run local LLMs securely on-premises / your own GPUs.

Can it run on the cloud (AWS / Azure / GCP)?

Yes. It can be built and operated on Qualiteg GPU Cloud, AWS, Azure, GCP, or on-premises.

What about security?

It includes enterprise-grade security such as TLS encryption, IP / ASN filtering and CSRF protection, and keeping data under your control makes it effective for leak prevention. Beyond these, we can achieve robust security tailored to your information-security policy and data-security requirements — please contact us for details.

Are there support plans?

Yes. We offer support plans that back stable operation after go-live, including help keeping up with model updates and version upgrades.

Do you have a commercial track record?

Yes. Bestllam®, our own SaaS integrating 20+ LLMs, is also built and operated on ChatStream, with a track record of stable operation in commercial service. For other deployment cases, please contact us.

Tell me about pricing, licensing and quotes.

We propose based on your deployment model (on-prem / cloud), models and scale. For quotes and licensing details, please feel free to contact us.

Related resources

Technical articles on LLM inference infrastructure, GPU configuration and local-LLM operations on the Qualiteg Blog.

Qualiteg Blog

KVキャッシュのオフロード戦略とGQAの実践的理解

Strategies for offloading the KV cache from GPU VRAM to CPU RAM/disk, and KV-cache reduction with GQA.

Qualiteg Blog

LLM学習の現実：GPU選びから学習コストまで徹底解説

GPU counts and costs for LLM training, with LLaMA 2 and DeepSeek-V3 examples.

Qualiteg Blog

ONNX RuntimeのcuDNN警告と対策

Cause and fix for the "libcudnn.so.9" error during GPU inference with ONNX Runtime.

Qualiteg Blog

LLM推論基盤プロビジョニング講座第5回：GPUノード構成から負荷試験

GPU node sizing, load testing, trade-offs and real server configurations.

Qualiteg Blog

LLM推論基盤プロビジョニング講座第4回：推論エンジンの選定

Characteristics and selection points of inference engines such as vLLM and TGI.

Qualiteg Blog

LLM推論基盤プロビジョニング講座第3回：推論時消費メモリ見積もり

The two main drivers of GPU memory use: model footprint and KV cache.

Qualiteg Blog

LLM推論基盤プロビジョニング講座第2回：リクエスト数を見積もる

How to estimate expected request volume to compute GPU node counts.

Qualiteg Blog

LLM推論基盤プロビジョニング講座第1回：基本概念と推論速度

Foundational concepts and how to think about inference speed.

Qualiteg Blog

GPUサーバーの最適容量計算：キューイング理論と実践的モデル

Computing max supported users for a GPU server using queueing theory.

Qualiteg Blog

2025年 NVIDIA GPU 一発検索ツール

Filter NVIDIA GPUs by generation and spec — Blackwell, Hopper, Ada Lovelace and more.

Qualiteg Blog

【ChatStream】大容量のLLMの推論に必要なGPUサーバー構成

GPU server/cluster configuration for large-LLM inference, with Llama3-70B as an example.

Qualiteg Blog

推論速度を向上させる Speculative Decoding とは

A speedup technique that uses a small model to look ahead and reduce the big model's load.

Considering ChatStream?

We propose solutions for commercial local-LLM serving and inference-platform builds, tailored to your requirements. Feel free to consult us about on-prem or your-own-GPU operation as well.

Get in touch

Delivery: Qualiteg GPU Cloud / AWS / Azure / GCP / on-premises

ChatStream®

Run local LLMs at commercial quality

Common challenges of running local LLMs

The ChatStream approach

Broad model support

Multiple inference engines

GPU load balancing

Runs anywhere

Turn great inference engines into a production-grade service.

Building it from scratch in-house

With ChatStream

What you can build with ChatStream

Fully local AI coding agent

Design & technical-doc support for manufacturing

Internal knowledge search / RAG

Automatic document generation

Customer support

AI in regulated industries

Supported inference engines

QCT / QCHT Qualiteg

vLLM UC Berkeley

DeepSpeed Microsoft

TGI Hugging Face

TensorRT-LLM NVIDIA

Built for large-scale, distributed serving

Reference configuration

Representative local LLMs supported

Open LLMs

Japanese / domestic models

Key features

PERFORMANCE & SCALE

Fast, memory-efficient inference

Concurrency control via queueing

Scale-out distributed serving

READY TO USE & BUILD

Web Chat UI included

Full-fledged AI agent capabilities

Low-code / no-code build

TRUST & OPERATIONS

Quality and scale, assured

Stable operation & support plans

Enterprise-grade security

Bestllam®, our AI integration platform unifying 20+ LLMs, is also built on ChatStream

Secure AI transformation, supported from the upstream

Strategy & planning

Building a secure AI platform

Transformation & enablement

Frequently Asked Questions (FAQ)

Related resources

KVキャッシュのオフロード戦略とGQAの実践的理解

LLM学習の現実：GPU選びから学習コストまで徹底解説

ONNX RuntimeのcuDNN警告と対策

LLM推論基盤プロビジョニング講座 第5回：GPUノード構成から負荷試験

LLM推論基盤プロビジョニング講座 第4回：推論エンジンの選定

LLM推論基盤プロビジョニング講座 第3回：推論時消費メモリ見積もり

LLM推論基盤プロビジョニング講座 第2回：リクエスト数を見積もる

LLM推論基盤プロビジョニング講座 第1回：基本概念と推論速度

GPUサーバーの最適容量計算：キューイング理論と実践的モデル

2025年 NVIDIA GPU 一発検索ツール

【ChatStream】大容量のLLMの推論に必要なGPUサーバー構成

推論速度を向上させる Speculative Decoding とは

Considering ChatStream?

Get in touch

LLM推論基盤プロビジョニング講座第5回：GPUノード構成から負荷試験

LLM推論基盤プロビジョニング講座第4回：推論エンジンの選定

LLM推論基盤プロビジョニング講座第3回：推論時消費メモリ見積もり

LLM推論基盤プロビジョニング講座第2回：リクエスト数を見積もる

LLM推論基盤プロビジョニング講座第1回：基本概念と推論速度