Inference Cluster Dashboard

Real-time status of vLLM worker nodes

4 / 4

Nodes Active

Worker Nodes

Upstream 1

vllm_1_0:8000

Online

Latency 74ms

Queue

Processing

Loaded Models

openai/gpt-oss-120b

Upstream 2

vllm_2_0:8000

Online

Latency 57ms

Queue

Processing

Loaded Models

openai/gpt-oss-120b

Upstream 3

vllm_3_0:8000

Online

Latency 53ms

Queue

Processing

Loaded Models

openai/gpt-oss-120b

Upstream 4

vllm_4_0:8000

Online

Latency 42ms

Queue

Processing

Loaded Models

openai/gpt-oss-120b

Load Balancing Strategy

This cluster uses the Least Connections strategy provided by HAProxy.

New requests are routed to the worker with the fewest active connections.
This prevents any single GPU worker from becoming overwhelmed while others are idle.
If a node fails health checks (/health), it is automatically removed from rotation.

balance leastconn
option httpchk GET /health

Quick Integration

OpenAI Compatible Endpoint

http://jerry.kaist.ac.kr/v1

OpenAI SDK (Python)

from openai import OpenAI

client = OpenAI(
    base_url="http://jerry.kaist.ac.kr/v1",
    api_key="YOUR_KEY"
)

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Hi"}]
)
print(response.choices[0].message.content)

Example cURL

curl http://jerry.kaist.ac.kr/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [{"role": "user", "content": "Hi"}]
  }'