Runtime Local LLM | Georgy Dev

On-Device LLM Inference

Built on llama.cpp - load any GGUF model and run inference locally with no cloud services, no API keys, and no data leaving the device. Updated regularly to track upstream llama.cpp releases.

Fully Offline

No internet connection, no API keys, no telemetry. All inference runs locally on the user's device.

Token Streaming

Receive each generated token in real-time via delegates - update chat UIs and trigger gameplay events as the model writes.

GPU Acceleration

Vulkan on Windows and Linux, Metal on Mac and iOS, CPU + intrinsics on Android and Meta Quest.

Editor Model Manager

Browse, download, import, delete, and test models directly in project settings. Models ship with packaged builds automatically.

How It Works

Manage models in the editor, load them at runtime with your chosen inference parameters, and send messages. Tokens stream back through delegates as the model generates - all on a background thread, with callbacks on the game thread.

1

Manage Models

Download from the catalog, import custom GGUF files, or fetch from URL at runtime

2

Load a Model

Choose by name, file path, or URL with configurable inference parameters

3

Send Messages

Pass user messages and receive streaming responses, with conversation context preserved

4

Use the Output

Drive NPC dialogue, generate dynamic content, or feed into other plugins like TTS and lip sync

Supported Models

Any model in GGUF format works. The editor includes a catalog of popular pre-defined models for one-click download, and you can import any custom GGUF file.

Model Families

Llama (Meta), Mistral / Mixtral, Phi (Microsoft), Gemma (Google), Qwen (Alibaba), TinyLlama, and any other GGUF community model.

Quantization Levels

From Q2_K (smallest, fastest) through Q4_K_M, Q5_K_M, Q8_0, up to F16 / F32. Pick the level that fits your target device's RAM and performance budget.

Up-to-Date llama.cpp

The plugin is updated regularly on Fab to track upstream llama.cpp releases, so the latest GGUF model formats remain supported as they are released.

Platform Acceleration

Hardware acceleration tuned per platform - GPU compute where available, CPU + intrinsics where not.

Platform	Acceleration
Windows	Vulkan GPU
Linux	Vulkan GPU
Mac	Metal GPU
iOS	Metal GPU
Android	CPU + intrinsics
Meta Quest	CPU + intrinsics

For mobile and VR devices, smaller quantizations (Q2_K through Q4_K_M) with compact models (1B–3B parameters) are recommended. Desktop platforms can run larger models with higher quantization levels.

Editor Model Manager

A dedicated settings panel inside the Unreal Editor for browsing, downloading, importing, deleting, and testing models - no command-line tools or external downloads required.

Catalog Downloads

Browse a built-in catalog of popular models with one-click download. Multiple downloads in parallel, with progress bars and cancel support.

Custom Model Import

Import any GGUF file from disk or a direct URL. Custom models are treated identically to catalog models and ship with packaged builds.

In-Editor Testing

A built-in test window lets you select a model, configure parameters, send prompts, and watch responses stream in real-time - all without entering Play mode.

Full-Featured Runtime API

Beyond basic loading and inference, the plugin provides multiple model loading methods, async Blueprint nodes, runtime downloading, conversation context management, and configurable inference parameters.

Multiple Loading Methods

Load by model name (with a Blueprint dropdown in UE 5.4+), by absolute file path, or directly from a URL with automatic download. Pre-cache models without loading.

Token-by-Token Streaming

Each generated token fires a delegate on the game thread - immediately update chat UIs, trigger gameplay events, or pipe output into other systems as it arrives.

Conversation Context

Multi-turn conversations preserve message history automatically. Reset context at any time, optionally keeping the system prompt for persistent character behavior.

Configurable Inference

Control temperature, Top-P, Top-K, repeat penalty, GPU layer offload, context size, seed, thread count, max tokens, and system prompt - per model load.

Async Blueprint Nodes

Dedicated async nodes for loading, sending messages, and downloading - with output pins for tokens, completion, progress, and errors. No manual delegate binding needed.

Automatic Packaging

Models in Content/RuntimeLocalLLM/Models are auto-staged via NonUFS so they ship with packaged builds. No manual project config required.

Blueprint Example - Simple chat with streaming responses

Try the Demo

A Windows demo is available so you can experience on-device inference firsthand. The demo includes a chat interface with streaming responses, runtime model downloads via URL, and a settings menu for inference parameters - all built with Blueprints and UMG.

Chat interface with token streaming

Send messages and watch responses generate token by token in real-time

Pre-bundled and downloadable models

Models ready out of the box, plus runtime URL downloads for additional models

Configurable parameters

In-game settings menu for temperature, max tokens, context size, and more

Source included with the plugin

Full Blueprint implementation in the plugin's Demo content folder, supporting UE 4.27+

Download Demo (Windows)

Video Tutorial

Demo Project Preview

Common Use Cases

A few of the workflows the plugin supports out of the box - all running locally with no external dependencies.

NPC Dialogue

Character-driven conversations with persistent context - NPCs remember past exchanges and stay in character via the system prompt.

Dynamic Content

Generate quest text, item descriptions, lore snippets, or barks at runtime - varying output per session without authoring every variant.

Offline Chatbots

In-game assistants and chatbots that work without internet - useful for offline games, air-gapped deployments, and privacy-sensitive applications.

AI Pipelines

Pair with Speech Recognizer for voice input, TTS for spoken responses, and Lip Sync for animation - building fully offline conversational characters.

Plugin Ecosystem

Runtime Local LLM is the offline AI brain of the Georgy Dev plugin suite - combine it with speech recognition, TTS, and lip sync for fully on-device conversational characters.

Runtime Audio Importer

Process and play TTS audio output at runtime. Essential companion for handling streaming audio data from any TTS provider.

Learn more

Runtime Speech Recognizer

Offline speech-to-text via Whisper. Convert player speech to text and feed it directly into the local LLM.

Learn more

Runtime Text To Speech

Offline TTS with 900+ voices across 47 languages - speak the LLM's responses out loud locally.

Learn more

Runtime MetaHuman Lip Sync

Real-time lip sync for MetaHumans and custom characters, driven by the audio output of synthesized LLM responses.

Learn more

AI Chatbot Integrator

Cloud AI alternative - OpenAI, Claude, DeepSeek, and others. Use alongside Runtime Local LLM for hybrid online/offline scenarios.

Learn more

Documentation & Support

Comprehensive documentation covers the editor model manager, runtime API, inference parameters, ready-to-use examples (simple chat, NPC dialogue, pre-downloading), and the bundled demo project.

Full Documentation

Step-by-step guides for all features, with Blueprint and C++ examples

Community & Support

Active Discord community with developer support

Custom Development

Tailored integration or feature development - solutions@georgy.dev

View Documentation Join Discord Custom Development

Editor - Importing custom GGUF models

Ready to Add Local LLMs to Your Project?

Available on Fab for UE 4.27 – 5.7. Includes the editor model manager, runtime API, demo project, and full documentation.

Get the Plugin on Fab Enterprise & Custom Solutions