Edge-First LLM Semantic Routing on a 4GB Jetson Nano

David Pickett

17 Mar 2026 — 1 min read

People: David Pickett

Idea: Testing whether a 4GB NVIDIA Jetson Nano can act as an autonomous routing brain - classifying incoming queries with a local embedding model and deciding to answer locally or escalate to more powerful servers across four compute tiers.

Details:

The Jetson Nano can run llama.cpp natively with two models - nomic-embed-text for embeddings and gemma-3-1b for chat - both fitting comfortably in 4GB RAM
LiteLLM's semantic router runs on the Jetson using the local embedding model to classify queries, adding only ~300MB of overhead
Simple questions like "What is the capital of France?" get answered entirely on-device with no network calls - critical for spotty connectivity
Coding queries tested to route to a 24GB GPU local hub running Ollama, complex reasoning goes to MiniMax-M2.5 with LM Studio on a regional server, and deep analysis hits DeepSeek-R1 on university HPC
All four tiers - edge, local hub, regional, datacenter - route correctly from the Jetson in a single LiteLLM config
A key gotcha was using LiteLLM's llamafile/ provider instead of openai/ for llama.cpp embeddings - the OpenAI SDK sends a null encoding format that llama.cpp rejects
The UIUC KNN router also works on the Jetson but only inside a Docker container - the Jetson's ancient glibc 2.27 blocks native install
Longformer embeddings for the KNN classifier take ~6 seconds per query on the Jetson's Cortex-A57 CPU vs sub-second on an M3 Mac (would need more investigation to figure out if GPU acceleration can be made to work in that age of Jetson with modern Python tools)
GPU acceleration in Docker is blocked because PyTorch's aarch64 wheels on PyPI are CPU-only - no CUDA support (but there may be a way to build a custom Dockerfile)
NVIDIA's Triton-based router is too heavy for 4GB and will likely stay on the farm hub (or require a newer/bigger Jetson model)
https://github.com/pickettd/litellm-local-semantic-router-example
https://github.com/pickettd/local-uiuc-llmrouter-example

Automated Weekly Project Status Updates with Claude Cowork

People: David and Joe Idea: We built an automation that generates a weekly project status report by pulling activity from multiple data sources - no more chasing people down for updates every Friday. Details: * A scheduled Claude Cowork task runs every Friday at noon, starting from our master Excel sheet

Thoughtful Self Doubt using Minimax w/ Claude Code

Ran some queries and saw interesting output on the server and client component Next.js credentials. Thought it was funny on its constant self doubt and loop. The layout is updated. However, there's an issue - the layout is a server component but it's importing a

OpenClaw - Local Agent Runtime with Slack Integration and Policy Controls

**People:** David **Idea:** Testing OpenClaw as a self-hosted agent runtime that connects LLM agents to Slack with multi-agent routing, tool permissions, and security controls baked in. **Details:** - OpenClaw runs in a docker container and gives you channel integrations, tool execution, policy controls, and session memory out of the box

From Benchmarks to Builders: Running MiniMax M2.1 on Our Mac Studio

Last week I wrote about two paths to vLLM on Apple Silicon - comparing vllm-metal and vllm-mlx as options for local inference. This week the picture changed. LM Studio shipped concurrent request support, and suddenly the simplest option became the most practical one. People David Idea We spent the week

Read more

Automated Weekly Project Status Updates with Claude Cowork

Thoughtful Self Doubt using Minimax w/ Claude Code

OpenClaw - Local Agent Runtime with Slack Integration and Policy Controls

From Benchmarks to Builders: Running MiniMax M2.1 on Our Mac Studio