Run local LLMs at commercial quality
You want to run the latest open-source or proprietary LLMs reliably in your own environment. ChatStream is a distributed inference platform designed precisely for serving local LLMs in production.
Common challenges of running local LLMs
- Re-building your inference server every time a new open model ships
- GPUs choking and responses breaking down as concurrency grows
- Separate implementations per inference engine (vLLM, TensorRT-LLM, etc.)
- Needing to keep confidential data on-prem / your own GPUs
- Building chat UI, auth and security all by yourself
ChatStream solves these with a one-stop inference platform plus front-end UI, so you can launch a commercial LLM service quickly and low-code.
The ChatStream approach
- Broad model compatibility via HF Transformers (QCT) — fast to follow the newest models, works on older GPUs
- Use high-speed engines like vLLM and TensorRT-LLM depending on the workload
- Scale out across multi-GPU / multi-node with proprietary GPU load balancing
- Control concurrency with queueing so the system never breaks down under load
Particularly strong with Japanese — ChatStream is built by engineers in Japan, with careful handling of Japanese prompts and tokenization. It is well suited to serving Japanese-focused models such as Swallow, ELYZA and CALM.
Broad model support
Host the latest open LLMs as-is — Llama, Qwen, Gemma, Mistral and domestic Japanese models.
Multiple inference engines
Supports QCT, vLLM, DeepSpeed, TGI and TensorRT-LLM, optimized per use case.
GPU load balancing
Distributes across multi-GPU / multi-node with a proprietary algorithm for stable high-volume access.
Runs anywhere
Deployable on Qualiteg GPU Cloud / AWS / Azure / GCP / on-premises.
Turn great inference engines into a production-grade service.
High-performance inference engines such as vLLM and TensorRT-LLM are now freely available as open source. But running them in production — guaranteeing quality and scale, and keeping them up without interruption — is a different kind of challenge. ChatStream delivers that as an integrated, ready-to-use platform.
Building it from scratch in-house
- Selecting, tuning and GPU-optimizing inference engines
- Implementing GPU load balancing, queueing and redundancy yourself
- Designing and continuously validating for scale under traffic spikes
- Monitoring, logging, incident response and keeping up with model updates
- Building chat UI, authentication and security from the ground up
- Securing specialists and maintaining it long-term
Requires deep expertise and a continuous operations team.
With ChatStream
- A platform that bundles multiple inference engines, already integrated
- GPU load balancing, queueing and scale-out as standard
- Goes to production with quality and scale already secured
- Web chat UI and AI agent capabilities included
- Enterprise operations including auth, security and monitoring
- Stable operation and support plans for ongoing operations
Everything you need in one platform — supported end to end, from adoption to operations.
What you can build with ChatStream
Source code, design drawings, technical documents, customer data — keep the confidential data you can't send to external cloud APIs entirely within your own environment (on-prem / your own GPUs). From manufacturing to any field that can't let data leave the building, we build real systems on local LLMs.
Fully local AI coding agent
Run a coding agent with a Claude Code-like experience fully on local — even on-prem. Assist with code generation, review and refactoring without sending your source or repos outside.
Design & technical-doc support for manufacturing
Use AI on drawings, specs and technical know-how without sending them outside — design support, searching past cases, drafting technical documents, all while protecting confidentiality.
Internal knowledge search / RAG
Cross-search internal documents, policies and manuals and answer with citations — without uploading confidential files to the cloud.
Automatic document generation
With tool extensions, generate deliverables such as PPTX, Excel and reports fully on local. Even the working data stays inside.
Customer support
Build response assistance and first-line automation grounded in FAQs, histories and manuals — without sending customer data outside.
AI in regulated industries
For finance, healthcare, government and public sectors where data can't leave — run LLMs securely on-premises / your own GPUs.
Wondering what you can build with local LLMs? We propose concrete configurations for your use case.
Talk to usSupported inference engines
ChatStream isn't tied to a single engine. Serve with the right engine — or a combination — depending on the model, GPU generation and speed requirements.
QCT / QCHT Qualiteg
Qualiteg Classic (HF) Transformer. The core engine, based on HuggingFace Transformers.
- Broadest model compatibility; fast to follow the newest open models
- Works even on environments without the latest GPUs (older GPUs)
- Memory-efficient, high-throughput via a PagedAttention / classic hybrid
vLLM UC Berkeley
A high-throughput inference engine powered by PagedAttention.
- Large speedups and concurrency over classic methods
- Shines on recent NVIDIA GPUs / Linux
- Ideal for high-volume, high-load serving
DeepSpeed Microsoft
Strong at distributed inference, including ZeRO-Inference for large models.
- Splits huge models across multi-GPU / multi-node
- Runs large models even under memory constraints
TGI Hugging Face
Text Generation Inference. An inference server that integrates well with the Transformers ecosystem.
- Easy to adopt with good HuggingFace integration
- Efficient for short-to-mid context serving
TensorRT-LLM NVIDIA
A top-tier, ultra-fast engine optimized for NVIDIA GPUs.
- Faster than vLLM / DeepSpeed
- Maximum performance on NVIDIA, assuming model compilation
- For production where latency and throughput must be pushed to the limit
ChatStream leverages PagedAttention (dynamically allocating the KV cache in blocks to improve throughput under load) and tensor parallelism via Megatron-LM. It's a hybrid design that automatically switches inference algorithms — classic methods at low load, PagedAttention under high load.
Built for large-scale, distributed serving
ChatStream Pool distributes requests across multiple inference servers (ChatStream Server). With multi-GPU and multi-node, it handles large models and high-volume access reliably.
Multi-LLM orchestration — explainer video
Reference configuration
| Model | Qwen3.5-32B / Llama 4 Scout and other latest open LLMs |
| GPU configuration | NVIDIA RTX PRO 5000 Blackwell (48GB) × 4 = 192GB / RTX PRO 6000 Blackwell (96GB) for larger models |
| Inference engine | vLLM / TensorRT-LLM / QCT (chosen per use case) |
| Scale | Scale out throughput by adding nodes. We propose sizing for your requirements. |
* Configurations are examples. The optimal setup depends on model, quantization, token length and GPU generation.
From GPU configuration and model selection to sizing — talk to us.
Talk to usRepresentative local LLMs supported
Broad support for the representative open / local LLMs (open-weight models) featured in Qualiteg's monthly Japanese LLM Ranking. Thanks to the HF Transformers foundation (QCT), new models can be followed quickly.
Open LLMs
| Model | Provider | Parameters | Type / notes |
|---|---|---|---|
Qwen3.5 / Qwen3 | Alibaba | 4B–397B (incl. MoE, A17B etc.) | Top-tier open class, many sizes |
DeepSeek-V3.2 | DeepSeek | 671B (MoE, 37B active) | MoE / top-tier general |
DeepSeek-R1 | DeepSeek | 671B (MoE, 37B active) | Reasoning-focused |
GLM-5 / GLM-4.6 | Zhipu AI | Large (MoE) | MoE / coding & agents |
Kimi-K2.5 | Moonshot AI | ~1T (MoE, 32B active) | MoE / ultra-large |
MiniMax-M2.1 | MiniMax | 230B (MoE, 10B active) | MoE / agents & coding |
ERNIE 4.5 | Baidu | 21B (MoE, 3B active) etc. | MoE / reasoning |
K-EXAONE | LG AI Research | 236B (MoE, 23B active) | MoE / from Korea |
QwQ-32B | Alibaba | 32B | Reasoning-focused |
gpt-oss | OpenAI | 20B / 120B | MoE / open-weight |
Gemma 4 / Gemma 3 | 1B–27B class | Dense / latest lightweight–mid | |
Ministral-3 | Mistral AI | 3B–14B | Dense / lightweight, reasoning |
Llama 3.x / 4 | Meta | 3B+ (Dense / MoE) | Multilingual / general |
Nemotron-Nano 9B JP | NVIDIA | 9B | Dense / Japanese-optimized |
The latest model evaluations are published in our Japanese LLM Ranking.
“Not sure which model fits your business?” “Want a sense of the budget?” — just get in touch. We propose the optimal model selection for your use case.
Japanese / domestic models
| Model | Provider | Parameters | Type / notes |
|---|---|---|---|
Swallow | Swallow Team (Institute of Science Tokyo / Tokyo Tech LLM) | 8B–120B | Japanese-enhanced gpt-oss / Qwen3 / Gemma-2-Llama (latest RL) |
rinna qwq-bakeneko-32B | rinna | 32B | QwQ-based reasoning, Japanese-tuned |
ABEJA-Qwen2.5-32B | ABEJA | 32B | Qwen2.5-based, Japanese-enhanced |
* Models shown are representative examples from the Japanese LLM Ranking. Parameter sizes are approximate, based on representative published values, and vary by variant. Transformers-based open LLMs on HuggingFace and your own fine-tuned models can also be hosted. For specific models, please contact us.
Key features
Beyond the performance and scale demanded by commercial services — ready-to-use completeness, and quality and operations you can trust.
PERFORMANCE & SCALE
Fast, memory-efficient inference
- PagedAttention allocates the KV cache dynamically to maximize concurrent requests per GPU
- A classic/PagedAttention hybrid balances low latency at low load and throughput at high load
- Handles high-volume access while keeping inference cost down
Concurrency control via queueing
- Queues many concurrent requests and caps simultaneous generation for stability
- Cooperative, token-level interleaved generation even on a single GPU
- Non-blocking processing with async I/O
Scale-out distributed serving
- Distributes across multi-GPU / multi-node with proprietary GPU load balancing
- Supports tensor parallelism (PagedAttention / Megatron-LM) and data parallelism
- Clustered serving of large models
READY TO USE & BUILD
Web Chat UI included
- A customizable chat UI (web / mobile) is bundled by default
- Start using it right after deployment — no separate front-end needed
- Customize screens to match your brand and workflows
Full-fledged AI agent capabilities
- Build business agents connected to internal data and tools
- Add new abilities via tool extensions (custom tools / function calling)
- Fully local generation of PPTX, Excel and more — produce deliverables without sending data outside
Low-code / no-code build
- Inference platform and UI together, one-stop, for a fast launch
- Build full LLM apps and APIs with simple customization
- Easy to embed into existing systems via API
TRUST & OPERATIONS
Quality and scale, assured
- Goes live after validating production-level load and scale
- Designed to maintain quality during sudden concurrency spikes
- We propose performance/capacity sizing for your requirements
Stable operation & support plans
- Delivered on the premise of stable operation including monitoring
- Support plans that keep helping after go-live
- Help keeping up with model updates and version upgrades
Enterprise-grade security
- ASN filtering, IP filtering, TLS and CSRF protection
- On-prem / your-own-GPU operation that keeps data under your control
- A secure setup that never sends confidential data outside
“We want to use AI without sending data outside.” Local LLMs are the answer. Get in touch.
Talk to usSecure AI transformation, supported from the upstream
We don't just provide the ChatStream platform. Qualiteg's consulting works alongside you from planning and PoC through business design and in-house enablement, supporting AI-driven transformation centered on secure local-LLM use.
Strategy & planning
Which tasks, which models, how much to bring in-house. We map the big picture and cost-effectiveness of AI adoption with you, from the upstream.
Building a secure AI platform
Built on local-LLM operation that keeps confidential data inside, we construct a secure inference platform with ChatStream and guardrails.
Transformation & enablement
We don't stop at deployment. We support operational adoption, internal rollout and people development, turning AI into lasting competitiveness.
Frequently Asked Questions (FAQ)
About adopting local LLMs (on-premise / private LLMs), using generative AI without sending data outside, and supported models, GPUs and security.
What are local LLMs? How do they differ from cloud LLMs?
An LLM (large language model) is an AI that understands and answers text, like ChatGPT. Normally such AI sends your text over the internet to an external company's servers for processing. A "local LLM" instead runs that AI inside your own environment (your servers/GPUs, on-premises or private cloud). Because your questions and documents never leave your premises, you can safely use generative AI even with confidential or personal data. Think of it as using AI only inside your own locked room, rather than a rented external space.
What is ChatStream?
ChatStream is an LLM inference platform that serves local LLMs (open models or your own models) at commercial quality. With GPU load balancing, a Web Chat UI and APIs, you can quickly build internal generative-AI tools or commercial LLM services.
What is the difference between Bestllam and ChatStream?
Bestllam® is our AI integration platform designed and built on the ChatStream platform. ChatStream is an inference platform fully equipped to run local LLMs self-contained on-premises / your own GPUs — ideal when data must not leave your organization. On top of that, Bestllam also uses external commercial LLM APIs (GPT / Claude / Gemini, etc.), adds leak prevention via LLM-Audit (prompt-injection detection, input/output auditing), and raises fault tolerance / BCP with a cloud database. Use ChatStream for fully on-prem; use Bestllam when you also want to leverage commercial APIs in an integrated way.
Can I run LLMs on-premises or on my own GPUs?
Yes. It runs on-premises, on your own GPU servers, and in private clouds, so you can operate local LLMs while keeping data under your control.
Can I use AI without sending confidential data outside?
Yes. ChatStream keeps inference entirely within your environment, so you can use AI without sending customer data, technical documents or source code to external cloud APIs — reducing the risk of data leakage.
Can you also procure GPU machines?
Yes. We can arrange GPU servers, including our Integrity series. From selecting to procuring GPU machines suited to LLM inference, you can consult us all at once. See our LLM-infra technology consulting for details.
Can it handle high volume and concurrent requests?
Yes. With proprietary GPU load balancing, queueing and high-throughput inference via PagedAttention, it handles large concurrent access reliably, and scales out across multi-GPU / multi-node.
Which local LLMs (open models) are supported?
Broad support for major open LLMs published on HuggingFace — Llama, Qwen, DeepSeek, Gemma, Mistral, gpt-oss, Swallow, ELYZA, CALM, PLaMo and more — with fast follow-up to the latest models.
Can you host our own fine-tuned LLM?
Yes. Transformers-based custom and fine-tuned models can be hosted, so you can securely operate models trained on your own data inside your organization.
Which inference engines are supported?
QCT (Qualiteg Classic Transformer), vLLM, DeepSpeed, TGI and TensorRT-LLM, used selectively depending on the model, GPU generation and speed requirements.
What GPUs are needed? Can you advise on sizing?
Yes. We recommend suitable open models for what you want to do, and propose the best configuration including GPU selection, quantization and parallelization. We build with GPUs such as NVIDIA RTX PRO Blackwell (5000 / 6000) and A6000, computing the required GPU configuration from concurrency and model size. We support you with broad GPU expertise — see our LLM-infra technology consulting.
Is a Web Chat UI included?
Yes. A customizable Web Chat UI (PC / mobile) is bundled, so you can start using it right after deployment without building a separate front-end.
Is there an OpenAI-compatible API? Can I use it from existing systems or LangChain?
Yes. An OpenAI-compatible API is provided, so existing LLM apps and LangChain can connect as-is. You can also switch between or combine commercial APIs (OpenAI / Anthropic / Google).
Can you build RAG (internal document search)?
Yes. You can build RAG (retrieval-augmented generation) over internal documents, manuals and policies without sending data outside, for grounded internal Q&A.
Can you do AI agents and tool integration? PPTX / Excel generation too?
Yes. With tool extensions, you can build business AI agents that generate deliverables such as PPTX and Excel fully on local, so even the working data stays inside.
We want to use AI on drawings/technical docs in manufacturing without sending them outside. Is that possible?
Yes. You can use AI for design support, searching past cases and drafting technical documents without sending drawings, specs or technical documents outside — balancing manufacturing confidentiality with generative-AI use. We also have cases for IT and manufacturing customers and per-design-tool use cases, so please contact us.
Can it be used in regulated industries like finance, healthcare, government and legal?
Yes. Even in fields where data cannot leave — finance, healthcare, government, public sector, legal (lawyers / law firms) — you can run local LLMs securely on-premises / your own GPUs.
Can it run on the cloud (AWS / Azure / GCP)?
Yes. It can be built and operated on Qualiteg GPU Cloud, AWS, Azure, GCP, or on-premises.
What about security?
It includes enterprise-grade security such as TLS encryption, IP / ASN filtering and CSRF protection, and keeping data under your control makes it effective for leak prevention. Beyond these, we can achieve robust security tailored to your information-security policy and data-security requirements — please contact us for details.
Are there support plans?
Yes. We offer support plans that back stable operation after go-live, including help keeping up with model updates and version upgrades.
Do you have a commercial track record?
Yes. Bestllam®, our own SaaS integrating 20+ LLMs, is also built and operated on ChatStream, with a track record of stable operation in commercial service. For other deployment cases, please contact us.
Tell me about pricing, licensing and quotes.
We propose based on your deployment model (on-prem / cloud), models and scale. For quotes and licensing details, please feel free to contact us.
Related resources
Technical articles on LLM inference infrastructure, GPU configuration and local-LLM operations on the Qualiteg Blog.

KVキャッシュのオフロード戦略とGQAの実践的理解
Strategies for offloading the KV cache from GPU VRAM to CPU RAM/disk, and KV-cache reduction with GQA.

LLM学習の現実:GPU選びから学習コストまで徹底解説
GPU counts and costs for LLM training, with LLaMA 2 and DeepSeek-V3 examples.

ONNX RuntimeのcuDNN警告と対策
Cause and fix for the "libcudnn.so.9" error during GPU inference with ONNX Runtime.

LLM推論基盤プロビジョニング講座 第5回:GPUノード構成から負荷試験
GPU node sizing, load testing, trade-offs and real server configurations.

LLM推論基盤プロビジョニング講座 第4回:推論エンジンの選定
Characteristics and selection points of inference engines such as vLLM and TGI.

LLM推論基盤プロビジョニング講座 第3回:推論時消費メモリ見積もり
The two main drivers of GPU memory use: model footprint and KV cache.

LLM推論基盤プロビジョニング講座 第2回:リクエスト数を見積もる
How to estimate expected request volume to compute GPU node counts.

LLM推論基盤プロビジョニング講座 第1回:基本概念と推論速度
Foundational concepts and how to think about inference speed.

GPUサーバーの最適容量計算:キューイング理論と実践的モデル
Computing max supported users for a GPU server using queueing theory.

2025年 NVIDIA GPU 一発検索ツール
Filter NVIDIA GPUs by generation and spec — Blackwell, Hopper, Ada Lovelace and more.

【ChatStream】大容量のLLMの推論に必要なGPUサーバー構成
GPU server/cluster configuration for large-LLM inference, with Llama3-70B as an example.
推論速度を向上させる Speculative Decoding とは
A speedup technique that uses a small model to look ahead and reduce the big model's load.
Considering ChatStream?
We propose solutions for commercial local-LLM serving and inference-platform builds, tailored to your requirements. Feel free to consult us about on-prem or your-own-GPU operation as well.
Get in touch
Delivery: Qualiteg GPU Cloud / AWS / Azure / GCP / on-premises