Inference Cluster Dashboard

Real-time status of vLLM worker nodes

4 / 4
Nodes Active

Worker Nodes

Upstream 1

vllm_1_0:8000
Online
Latency 74ms
Queue
0
Processing
0
Loaded Models
openai/gpt-oss-120b

Upstream 2

vllm_2_0:8000
Online
Latency 57ms
Queue
0
Processing
0
Loaded Models
openai/gpt-oss-120b

Upstream 3

vllm_3_0:8000
Online
Latency 53ms
Queue
0
Processing
0
Loaded Models
openai/gpt-oss-120b

Upstream 4

vllm_4_0:8000
Online
Latency 42ms
Queue
0
Processing
0
Loaded Models
openai/gpt-oss-120b

Load Balancing Strategy

This cluster uses the Least Connections strategy provided by HAProxy.

  • New requests are routed to the worker with the fewest active connections.
  • This prevents any single GPU worker from becoming overwhelmed while others are idle.
  • If a node fails health checks (/health), it is automatically removed from rotation.
balance leastconn
option httpchk GET /health

Quick Integration

OpenAI Compatible Endpoint
http://jerry.kaist.ac.kr/v1
OpenAI SDK (Python)
from openai import OpenAI

client = OpenAI(
    base_url="http://jerry.kaist.ac.kr/v1",
    api_key="YOUR_KEY"
)

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Hi"}]
)
print(response.choices[0].message.content)
Example cURL
curl http://jerry.kaist.ac.kr/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [{"role": "user", "content": "Hi"}]
  }'