Real-time AI inference monitoring

Compare latency, cost & reliability across 15 providers. No auth required.

15 providers 28 endpoints 24/7 monitoring 5 continents 99.7% uptime
🛋 The Lazy Button
for when you want to know everything in a hurry
OpenAI GPT-4o
Anthropic Claude
Groq
Cerebras
SambaNova
Cohere
OpenRouter
Mistral AI
Together AI
Fireworks AI
DeepSeek
Hyperbolic
Perplexity
xAI (Grok)
Nvidia NIM
Google Gemini ⚠
🏆 Check out our partners at inferencewars.com for the latest inference war reports and provider leaderboards
GET
/latency
Live latency test across all 15 providers, ranked fastest to slowest. Returns ms, success rate, and AI agent guidance.
GET
/throughput
Latency + tokens/sec for each provider. Full cost estimates and model metadata.
GET
/api/fastest
Returns the single fastest provider right now — name, model, latency_ms. Ideal for agent routing decisions.
GET
/api/status
Quick up/down availability check. Lightweight health ping for all providers.
GET
/benchmark?prompt=hello&max_tokens=5
Custom prompt benchmark across all providers with your own prompt and token limit.
GET
/advanced-benchmark
Tool calling, structured output speed, and reasoning effort impact. Built for AI agent workflows.
GET
/cost-optimizer
Cost-performance efficiency scores and budget recommendations.
GET
/reliability-metrics
P50 / P95 / P99 percentiles, error rates, and SLA compliance tracking.
GET
/geographic-latency
Latency variation across 5 continents with regional performance insights.
GET
/competitive-analysis
Industry benchmarking with market positioning and strategic recommendations.
GET
/historical-performance
48-hour rolling performance history with trend analysis and percentiles.
GET
/efficiency
Energy (Wh) and carbon emissions (gCO₂e) per inference for sustainable AI decisions.
GET
/status-page
Real-time provider health monitoring with uptime tracking.
GET
/docs
Interactive Swagger UI — full endpoint reference with parameters and response schemas.
GET
/openapi.json
OpenAPI 3.1 spec. Import into Postman, Insomnia, or any API client.
GET
/llms.txt
Machine-readable guide for LLM crawlers — ChatGPT, Claude, Perplexity.
GET
/submit
Submit a new LLM provider for automatic integration and benchmarking.
GET
/health
Service health status and provider configuration check.
GET
/analytics
Platform usage statistics and visitor analytics.
GET
/ai-agent/batch-test
Enhanced batch testing with consistency scoring and statistics.
GET
/admin/stats
Comprehensive platform monitoring statistics.

Claude Desktop (MCP)

Add to your claude_desktop_config.json to use all endpoints as AI tools.

{
  "mcpServers": {
    "inferencelatency": {
      "command": "mcp-proxy",
      "args": [
        "https://inferencelatency.com/mcp"
      ]
    }
  }
}

Cursor / Windsurf (MCP)

Add the SSE URL directly in your IDE MCP settings. No auth required.

MCP SSE URL:
https://inferencelatency.com/mcp

Transport: SSE
Auth: None required
# Get fastest provider right now
curl https://inferencelatency.com/api/fastest

# Full latency ranking (all 15 providers)
curl https://inferencelatency.com/latency

# Custom benchmark with your prompt
curl "https://inferencelatency.com/benchmark?prompt=Explain+RAG&max_tokens=50"

# Cost optimizer
curl https://inferencelatency.com/cost-optimizer
HDR
X-MCP-Endpoint
Returned on every response — https://inferencelatency.com/mcp
HDR
X-OpenAPI-Spec
Returned on every response — https://inferencelatency.com/openapi.json
GET
/.well-known/ai-plugin.json
OpenAI / ChatGPT plugin manifest
GET
/.well-known/mcp.json
MCP capability discovery manifest
GET
/llms.txt
Plain-text guide for LLM crawlers (ChatGPT, Perplexity, Claude)
GET
/.well-known/agents.json
Agent discovery manifest for automated platform discovery

What is InferenceLatency.com?

InferenceLatency.com is a real-time AI infrastructure intelligence platform. It continuously tests 15 major inference providers — measuring latency, throughput, reliability, and cost — and exposes that data via clean JSON APIs. No dashboards to log into. No subscription required.

Who is it for?

AI agents that need to route requests to the fastest available provider. Developers benchmarking which provider to use. DevOps teams building AI reliability pipelines. Researchers tracking inference performance trends across regions and models.

How are measurements taken?

Every test sends a standardised short prompt with a 1-token limit to each provider simultaneously using their official APIs. Timing is measured in milliseconds from request start to first token received (TTFT). Results are stored in a rolling 48-hour database to compute P50, P95, and P99 percentiles.

How do I use this in my agent pipeline?

Point your agent at GET /api/fastest for the current fastest provider, or /latency for the full ranked list. The JSON response includes an ai_agent_guidance field with a recommended provider and fallback order.

How should I start measuring and evaluating inference performance?
Standardise your test harness: use identical prompts, same region, same token limit. Log TTFT (time to first token), tokens/sec, and P95/P99. Use the /benchmark endpoint to get side-by-side results on your actual workload.
What drives cost per token in AI inference?
GPU utilisation, batch size, quantisation level, caching effectiveness, and cold-start behaviour are the primary cost drivers. Use /cost-optimizer to get efficiency scores that balance latency and cost for your use case.
How do I balance cost and performance without hurting UX?
Stream outputs for responsiveness, keep queue wait bounded, and use modest batching for longer generations. The /cost-optimizer endpoint provides provider-specific recommendations and efficiency scores.
Which provider is fastest right now?
Check /api/fastest for a live answer. Historically, Groq and Cerebras lead for sub-second inference, followed by SambaNova and Cohere. Rankings shift based on load and region, so real-time data is always more reliable than static benchmarks.
Does this work with streaming vs non-streaming modes?
Current measurements use non-streaming mode for consistent comparison. Streaming improves perceived latency significantly. Use /advanced-benchmark for workload-specific testing including tool calling and structured output scenarios.

Questions or partnership enquiries: support@inferencelatency.com