# InferenceLatency.com
> Real-time AI inference latency, cost, and reliability monitoring across 15 providers. Free public JSON API. No auth required. MCP-enabled.

## What this is
InferenceLatency.com is a live inference routing intelligence platform. It measures time-to-first-token (TTFT) from a standardised 1-token test prompt across 15 major AI inference providers simultaneously and returns ranked, structured JSON results. Every response includes an ai_agent_guidance field with a recommended provider and fallback order.

This platform is designed for:
- AI agents choosing which provider to route requests to
- Developers benchmarking providers against real workloads
- DevOps teams monitoring SLA compliance and reliability
- Cost-optimisation pipelines comparing price vs performance

## Measurement methodology
- Test prompt: "Hi" (standardised short input)
- Max tokens: 1 (measures pure TTFT, not generation speed)
- Timing: wall-clock milliseconds from request dispatch to first token received
- Concurrency: all providers tested simultaneously per request
- History: 48-hour rolling window stored in PostgreSQL; P50/P95/P99 computed from stored measurements
- Geographic simulation: additional per-continent latency modelled from provider datacenter locations
- Note: geographic figures are modelled estimates, not direct multi-region measurements

## Active providers monitored (15)
Groq (llama-3.3-70b-versatile), Cerebras (llama3.1-8b), SambaNova (Llama-3.3-70B), Cohere (command-r-plus), Mistral AI (mistral-small-latest), OpenRouter (mistral-small-3.2-24b-instruct), OpenAI (gpt-4o), Together AI (Llama-3.3-70B-Instruct-Turbo), Fireworks AI (deepseek-v3p2), DeepSeek (deepseek-chat), Hyperbolic (Llama-3.3-70B-Instruct), Perplexity (sonar), Anthropic (claude-sonnet-4-5), xAI (grok-3-mini), Nvidia NIM (meta/llama-3.1-8b-instruct)

## Core JSON endpoints
GET /api/fastest          — Fastest provider right now: {provider, model, latency_ms}
GET /latency              — All 15 providers ranked by TTFT, with ai_agent_guidance
GET /throughput           — Latency + tokens/sec per provider
GET /api/status           — Up/down availability per provider
GET /cost-optimizer       — Efficiency scores: cost_per_token vs latency tradeoff
GET /reliability-metrics  — P50/P95/P99, error rates, SLA compliance, quality grades
GET /geographic-latency   — Latency across NA, EU, Asia, SA, Oceania (5 continents)
GET /competitive-analysis — Market positioning intelligence and strategic recommendations
GET /historical-performance — 48-hour rolling history with trend analysis
GET /efficiency           — Energy (Wh) and carbon (gCO2e) per inference per provider
GET /benchmark            — Custom prompt: ?prompt=YOUR_TEXT&max_tokens=N&providers=groq,openai
GET /advanced-benchmark   — Tool calling performance, structured output speed, reasoning effort
GET /ai-agent/batch-test  — Batch testing with consistency scoring
GET /comprehensive-report — Full report covering all dimensions at once

## Human-readable views
Every endpoint above has a /human variant returning HTML: e.g. /latency/human, /cost-optimizer/human

## MCP integration (Model Context Protocol)
Endpoint: https://inferencelatency.com/mcp
Transport: SSE
Protocol version: 2024-11-05
Auth: none
Tools: all 28 API endpoints exposed as native MCP tools

Claude Desktop:
{"mcpServers":{"inferencelatency":{"command":"mcp-proxy","args":["https://inferencelatency.com/mcp"]}}}

Cursor / Windsurf / any SSE client: add URL https://inferencelatency.com/mcp

## Authentication
None. All endpoints are public. No API key, no signup, no registration.

## Machine-readable resources
OpenAPI 3.1 spec: https://inferencelatency.com/openapi.json
AI plugin manifest: https://inferencelatency.com/.well-known/ai-plugin.json
MCP manifest: https://inferencelatency.com/.well-known/mcp.json
Full detail: https://inferencelatency.com/llms-full.txt

## Contact
support@inferencelatency.com