Decide if the call is worth it, where to send it, and how to optimise latency and cost across 15 providers.
GET /v1/should-call
InferenceLatency.com is a real-time AI inference monitoring platform that continuously tests 15 major LLM API providers — including OpenAI, Anthropic, Groq, Cerebras, Mistral AI, DeepSeek, xAI, Cohere, Together AI, Fireworks AI, Hyperbolic, Perplexity, NVIDIA NIM, SambaNova, and Google Gemini — measuring time to first token (TTFT), throughput, cost, and reliability. Results are available via free JSON APIs with no authentication required.
The primary endpoint is GET /v1/should-call — a pre-inference decision engine that AI agents can call before every LLM request to get a recommended provider, expected latency, expected cost, and a confidence score. Calling this endpoint before inference typically reduces latency by 20–60% and cost by 10–40% compared to hardcoded provider selection.
Measurements use a standardised 1-token prompt sent simultaneously to all providers via their official APIs. Timing is millisecond-precise from request start to first token received (TTFT). Results are stored in a rolling 48-hour database to compute statistically reliable P50, P95, and P99 latency percentiles — the metrics that matter most for SLA planning and production reliability.
Ask in plain English which provider to use. Results are live-tested.
?priority=speed|cost|balanced|reliability and ?use_case=chat|code|reasoning for tailored results.Add the HTTP URL in your IDE MCP settings. Streamable HTTP transport — no auth required.
https://inferencelatency.com/mcp
Add to your claude_desktop_config.json. Uses SSE transport via mcp-proxy.
{
"mcpServers": {
"inferencelatency": {
"command": "mcp-proxy",
"args": [
"https://inferencelatency.com/sse"
]
}
}
}
Find and connect via the Smithery MCP registry. No configuration needed.
https://smithery.ai/server/inferencelatency
# Get fastest provider right now curl https://inferencelatency.com/api/fastest # Full latency ranking (all 15 providers) curl https://inferencelatency.com/latency # Custom benchmark with your prompt curl "https://inferencelatency.com/benchmark?prompt=Explain+RAG&max_tokens=50" # Cost optimizer curl https://inferencelatency.com/cost-optimizer
https://inferencelatency.com/mcphttps://inferencelatency.com/openapi.jsonInferenceLatency.com is a real-time AI inference monitoring and routing platform. It continuously tests 15 major LLM API providers — OpenAI, Anthropic, Groq, Cerebras, Mistral AI, DeepSeek, xAI (Grok), Cohere, Together AI, Fireworks AI, Hyperbolic, Perplexity, NVIDIA NIM, SambaNova, and Google Gemini — measuring time to first token (TTFT), throughput, cost, and reliability, and exposes that data via free JSON APIs. No authentication required for most endpoints.
AI agents and automated pipelines that need to route requests to the optimal provider before each inference call. Developers benchmarking which LLM API to use for a given workload. DevOps and platform teams building AI reliability pipelines and SLA dashboards. Researchers tracking inference performance trends across providers, regions, and models over time.
Every test sends a standardised short prompt with a 1-token output limit to each provider simultaneously using their official APIs. Timing is measured in milliseconds from the moment the HTTP request is sent until the first token is received — this is the TTFT (time to first token) metric. Results are stored in a rolling 48-hour database and used to compute P50, P95, and P99 latency percentiles, giving a statistically reliable picture of both typical and worst-case performance. Providers are tested from a consistent geographic location so results are directly comparable.
TTFT is the duration from sending a request until the first token of the response arrives. It is the most important latency metric for interactive AI applications — it determines how quickly users see the response begin. A lower TTFT means a faster-feeling application. Throughput (tokens per second) matters more for long-form generation tasks. InferenceLatency.com tracks both.
The primary integration point is GET /v1/should-call — call this before every LLM request to get a recommendation on whether to proceed, which provider to use, expected TTFT, and expected cost. For simpler use cases, /api/fastest returns the current top-3 providers. All responses include an ai_agent_guidance field with a recommended provider and fallback order.
cost_sensitivity and latency_sensitivity settings to get a routing recommendation that automatically balances these factors based on live data.GET /v1/should-call returns a complete pre-inference decision: should_call (boolean), recommended_provider, expected_latency_ms, expected_cost, and confidence_score. Accepts optional parameters for task_type, latency_sensitivity, and cost_sensitivity. No authentication required. Free for up to 30 requests per day without an API key.