# InferenceLatency.com > Real-time AI inference latency, cost, and reliability monitoring across 15 providers. Free public JSON API. No auth required. MCP-enabled. ## What this is InferenceLatency.com is a live inference routing intelligence platform. It measures time-to-first-token (TTFT) from a standardised 1-token test prompt across 15 major AI inference providers simultaneously and returns ranked, structured JSON results. Every response includes an ai_agent_guidance field with a recommended provider and fallback order. This platform is designed for: - AI agents choosing which provider to route requests to - Developers benchmarking providers against real workloads - DevOps teams monitoring SLA compliance and reliability - Cost-optimisation pipelines comparing price vs performance ## Measurement methodology - Test prompt: "Hi" (standardised short input) - Max tokens: 1 (measures pure TTFT, not generation speed) - Timing: wall-clock milliseconds from request dispatch to first token received - Concurrency: all providers tested simultaneously per request - History: 48-hour rolling window stored in PostgreSQL; P50/P95/P99 computed from stored measurements - Geographic simulation: additional per-continent latency modelled from provider datacenter locations - Note: geographic figures are modelled estimates, not direct multi-region measurements ## Active providers monitored (15) Groq (llama-3.3-70b-versatile), Cerebras (llama3.1-8b), SambaNova (Llama-3.3-70B), Cohere (command-r-plus), Mistral AI (mistral-small-latest), OpenRouter (mistral-small-3.2-24b-instruct), OpenAI (gpt-4o), Together AI (Llama-3.3-70B-Instruct-Turbo), Fireworks AI (deepseek-v3p2), DeepSeek (deepseek-chat), Hyperbolic (Llama-3.3-70B-Instruct), Perplexity (sonar), Anthropic (claude-sonnet-4-5), xAI (grok-3-mini), Nvidia NIM (meta/llama-3.1-8b-instruct) ## Core JSON endpoints GET /api/fastest — Fastest provider right now: {provider, model, latency_ms} GET /latency — All 15 providers ranked by TTFT, with ai_agent_guidance GET /throughput — Latency + tokens/sec per provider GET /api/status — Up/down availability per provider GET /cost-optimizer — Efficiency scores: cost_per_token vs latency tradeoff GET /reliability-metrics — P50/P95/P99, error rates, SLA compliance, quality grades GET /geographic-latency — Latency across NA, EU, Asia, SA, Oceania (5 continents) GET /competitive-analysis — Market positioning intelligence and strategic recommendations GET /historical-performance — 48-hour rolling history with trend analysis GET /efficiency — Energy (Wh) and carbon (gCO2e) per inference per provider GET /benchmark — Custom prompt: ?prompt=YOUR_TEXT&max_tokens=N&providers=groq,openai GET /advanced-benchmark — Tool calling performance, structured output speed, reasoning effort GET /ai-agent/batch-test — Batch testing with consistency scoring GET /comprehensive-report — Full report covering all dimensions at once ## Human-readable views Every endpoint above has a /human variant returning HTML: e.g. /latency/human, /cost-optimizer/human ## MCP integration (Model Context Protocol) Endpoint: https://inferencelatency.com/mcp Transport: SSE Protocol version: 2024-11-05 Auth: none Tools: all 28 API endpoints exposed as native MCP tools Claude Desktop: {"mcpServers":{"inferencelatency":{"command":"mcp-proxy","args":["https://inferencelatency.com/mcp"]}}} Cursor / Windsurf / any SSE client: add URL https://inferencelatency.com/mcp ## Authentication None. All endpoints are public. No API key, no signup, no registration. ## Machine-readable resources OpenAPI 3.1 spec: https://inferencelatency.com/openapi.json AI plugin manifest: https://inferencelatency.com/.well-known/ai-plugin.json MCP manifest: https://inferencelatency.com/.well-known/mcp.json Full detail: https://inferencelatency.com/llms-full.txt ## Contact support@inferencelatency.com