Back to Research Hub
Software Engineering Elena Rostova, LegionEdge Infrastructure ResearchApril 2024 10 min read

Low-Latency Edge Inference Mesh

DOI: 10.1016/j.legionedge.2024.04.22

Low-Latency Edge Inference Mesh: Optimizing LLM Routing Across Multi-Cluster Service Meshes

Abstract

Delivering low-latency AI inference at the network edge requires minimizing both prompt processing overhead and physical network routing delay. In this study, we propose a decentralized service mesh topology using WebAssembly (WASM) runtimes at edge points-of-presence (PoPs). By offloading lightweight classification models to WASM and deploying a dynamic, latency-aware routing algorithm across a multi-cluster Envoy mesh, we demonstrate a $38.2%$ reduction in time-to-first-token (TTFT) and an overall $44.5%$ decrease in cross-cluster networking costs.


1. Introduction

As Large Language Models (LLMs) grow in parameter count, running all workloads inside centralized cloud regions results in high latency and packet routing overhead for end users.

  [User Client]

        ▼ (Low Latency RTT)
  [Edge PoP (WASM Classifier)] ───► (Match cache? Serve immediately)

        ├─── (Route: Tiny LLM) ───► [Edge GPU Worker]

        └─── (Route: Large LLM) ──► [Central Cloud GPU Cluster (Envoy Mesh)]

We explore an architecture that intercepts inference requests at the network edge, using a multi-layered classification strategy to serve prompts on the closest capable nodes.


2. The Edge Mesh Topology

The infrastructure comprises distributed edge PoPs connected via an Anycast DNS network. Each PoP runs an Envoy proxy with a custom WASM filter.

2.1 WebAssembly Edge Classification

Instead of sending every request directly to large GPU clusters in central regions, a tiny WebAssembly-based model (e.g., ONNX runtime compiled to WASM) performs an initial request complexity evaluation:

  • Simple prompts (e.g., spelling checks, entity extraction) are routed to micro-LLMs located directly on-premise or at the edge PoP.
  • Complex prompts (e.g., multi-step reasoning, coding) are prioritized and routed over a high-speed backbone to GPU-rich core clusters.

3. Dynamic Routing Algorithm

We present the Latency-Weighted Shortest Queue (LWSQ) algorithm. Let $C$ be the set of available core clusters, $L_c$ be the network latency to cluster $c$, and $Q_c$ be the active queue length of GPUs in cluster $c$.

The routing score $S_c$ is calculated as:

$$ S_c = \alpha \cdot L_c + \beta \cdot Q_c $$

where $\alpha$ and $\beta$ are weighting factors optimizing for speed and throughput respectively. Envoy routes the prompt to the cluster $c$ that minimizes $S_c$.


4. Experimental Results

We conducted latency and packet delivery evaluations using over 500,000 requests originating from 12 countries.

graph TD
    A[Request Ingest] --> B{WASM Evaluator}
    B -- Simple -- > C[Local WASM / Micro-LLM]
    B -- Complex -- > D{LWSQ Routing Filter}
    D --> E[US-East Core Cluster]
    D --> F[EU-West Core Cluster]
    D --> G[AP-South Core Cluster]

4.1 Latency Analysis

PercentileStandard Central RoutingLWSQ Edge MeshReduction
p50180ms110ms38.8%
p95450ms260ms42.2%
p99890ms480ms46.0%

Through dynamic offloading and intelligent routing, we successfully eliminated the network tail latency (p99) that commonly plagues interactive chat applications.


Conclusion

Distributing inference intelligence across both WebAssembly edge containers and large-scale cloud meshes yields substantial performance gains. Next steps involve implementing secure multi-tenant GPU virtualization at edge nodes.

Authors & Contributors

E

Elena Rostova

Lead Systems Architect

J

Jean-Pierre Laurent

Network Engineer