Call this before every inference request

Decide if the call is worth it, where to send it, and how to optimise latency and cost across 15 providers.

15 providers 28 endpoints 24/7 monitoring 5 continents 99.7% uptime
First call — call this before every inference request
GET /v1/should-call
No parameters required • Returns: should_call, recommended_provider, expected_latency_ms, expected_cost, confidence_score
Value: typically reduces latency 20–60% and cost 10–40%
🛋 The Lazy Button
for when you want to know everything in a hurry

What is InferenceLatency.com?

InferenceLatency.com is a real-time AI inference monitoring platform that continuously tests 15 major LLM API providers — including OpenAI, Anthropic, Groq, Cerebras, Mistral AI, DeepSeek, xAI, Cohere, Together AI, Fireworks AI, Hyperbolic, Perplexity, NVIDIA NIM, SambaNova, and Google Gemini — measuring time to first token (TTFT), throughput, cost, and reliability. Results are available via free JSON APIs with no authentication required.

The primary endpoint is GET /v1/should-call — a pre-inference decision engine that AI agents can call before every LLM request to get a recommended provider, expected latency, expected cost, and a confidence score. Calling this endpoint before inference typically reduces latency by 20–60% and cost by 10–40% compared to hardcoded provider selection.

Measurements use a standardised 1-token prompt sent simultaneously to all providers via their official APIs. Timing is millisecond-precise from request start to first token received (TTFT). Results are stored in a rolling 48-hour database to compute statistically reliable P50, P95, and P99 latency percentiles — the metrics that matter most for SLA planning and production reliability.

OpenAI GPT-4o
Anthropic Claude
Groq
Cerebras
SambaNova
Cohere
OpenRouter
Mistral AI
Together AI
Fireworks AI
DeepSeek
Hyperbolic
Perplexity
xAI (Grok)
Nvidia NIM
Google Gemini ⚠
🏆 Check out our partners at inferencewars.com for the latest inference war reports and provider leaderboards

Ask in plain English which provider to use. Results are live-tested.

GET
/latency
Live latency test across all 15 providers, ranked fastest to slowest. Returns ms, success rate, and AI agent guidance.
GET
/throughput
Latency + tokens/sec for each provider. Full cost estimates and model metadata.
GET
/api/fastest
Routing engine: returns top-3 scored providers. Add ?priority=speed|cost|balanced|reliability and ?use_case=chat|code|reasoning for tailored results.
GET
/api/recommend?query=…
Natural language recommendation engine. Pass any query and get ranked providers with live-tested results and reasoning.
GET
/api/status
Quick up/down availability check. Lightweight health ping for all providers.
GET
/benchmark?prompt=hello&max_tokens=5
Custom prompt benchmark across all providers with your own prompt and token limit.
GET
/advanced-benchmark
Tool calling, structured output speed, and reasoning effort impact. Built for AI agent workflows.
GET
/cost-optimizer
Cost-performance efficiency scores and budget recommendations.
GET
/reliability-metrics
P50 / P95 / P99 percentiles, error rates, and SLA compliance tracking.
GET
/geographic-latency
Latency variation across 5 continents with regional performance insights.
GET
/competitive-analysis
Industry benchmarking with market positioning and strategic recommendations.
GET
/historical-performance
48-hour rolling performance history with trend analysis and percentiles.
GET
/efficiency
Energy (Wh) and carbon emissions (gCO₂e) per inference for sustainable AI decisions.
GET
/status-page
Real-time provider health monitoring with uptime tracking.
GET
/docs
Interactive Swagger UI — full endpoint reference with parameters and response schemas.
GET
/openapi.json
OpenAPI 3.1 spec. Import into Postman, Insomnia, or any API client.
GET
/llms.txt
Machine-readable guide for LLM crawlers — ChatGPT, Claude, Perplexity.
GET
/submit
Submit a new LLM provider for automatic integration and benchmarking.
GET
/health
Service health status and provider configuration check.
GET
/analytics
Platform usage statistics and visitor analytics.
GET
/ai-agent/batch-test
Enhanced batch testing with consistency scoring and statistics.
GET
/admin/stats
Comprehensive platform monitoring statistics.

Cursor / Windsurf (MCP)

Add the HTTP URL in your IDE MCP settings. Streamable HTTP transport — no auth required.

https://inferencelatency.com/mcp

Claude Desktop (MCP)

Add to your claude_desktop_config.json. Uses SSE transport via mcp-proxy.

{
  "mcpServers": {
    "inferencelatency": {
      "command": "mcp-proxy",
      "args": [
        "https://inferencelatency.com/sse"
      ]
    }
  }
}

Smithery Registry

Find and connect via the Smithery MCP registry. No configuration needed.

https://smithery.ai/server/inferencelatency
# Get fastest provider right now
curl https://inferencelatency.com/api/fastest

# Full latency ranking (all 15 providers)
curl https://inferencelatency.com/latency

# Custom benchmark with your prompt
curl "https://inferencelatency.com/benchmark?prompt=Explain+RAG&max_tokens=50"

# Cost optimizer
curl https://inferencelatency.com/cost-optimizer
HDR
X-MCP-Endpoint
Returned on every response — https://inferencelatency.com/mcp
HDR
X-OpenAPI-Spec
Returned on every response — https://inferencelatency.com/openapi.json
GET
/.well-known/ai-plugin.json
OpenAI / ChatGPT plugin manifest
GET
/.well-known/mcp.json
MCP capability discovery manifest
GET
/llms.txt
Plain-text guide for LLM crawlers (ChatGPT, Perplexity, Claude)
GET
/.well-known/agents.json
Agent discovery manifest for automated platform discovery

What is InferenceLatency.com?

InferenceLatency.com is a real-time AI inference monitoring and routing platform. It continuously tests 15 major LLM API providers — OpenAI, Anthropic, Groq, Cerebras, Mistral AI, DeepSeek, xAI (Grok), Cohere, Together AI, Fireworks AI, Hyperbolic, Perplexity, NVIDIA NIM, SambaNova, and Google Gemini — measuring time to first token (TTFT), throughput, cost, and reliability, and exposes that data via free JSON APIs. No authentication required for most endpoints.

Who is it for?

AI agents and automated pipelines that need to route requests to the optimal provider before each inference call. Developers benchmarking which LLM API to use for a given workload. DevOps and platform teams building AI reliability pipelines and SLA dashboards. Researchers tracking inference performance trends across providers, regions, and models over time.

How are measurements taken?

Every test sends a standardised short prompt with a 1-token output limit to each provider simultaneously using their official APIs. Timing is measured in milliseconds from the moment the HTTP request is sent until the first token is received — this is the TTFT (time to first token) metric. Results are stored in a rolling 48-hour database and used to compute P50, P95, and P99 latency percentiles, giving a statistically reliable picture of both typical and worst-case performance. Providers are tested from a consistent geographic location so results are directly comparable.

What is time to first token (TTFT)?

TTFT is the duration from sending a request until the first token of the response arrives. It is the most important latency metric for interactive AI applications — it determines how quickly users see the response begin. A lower TTFT means a faster-feeling application. Throughput (tokens per second) matters more for long-form generation tasks. InferenceLatency.com tracks both.

How do I use this in my agent pipeline?

The primary integration point is GET /v1/should-call — call this before every LLM request to get a recommendation on whether to proceed, which provider to use, expected TTFT, and expected cost. For simpler use cases, /api/fastest returns the current top-3 providers. All responses include an ai_agent_guidance field with a recommended provider and fallback order.

Which AI API is fastest right now?
Rankings shift constantly based on provider load, model updates, and infrastructure changes. Groq (using LPU hardware) and Cerebras consistently lead for sub-second TTFT, followed by SambaNova and Cohere. Check /api/fastest for a live answer, or /v1/should-call for a full pre-inference routing decision including cost.
How do I choose between OpenAI and Groq?
Groq is typically 3–10x faster for TTFT due to its LPU hardware, making it ideal for latency-sensitive chat and real-time applications. OpenAI GPT-4o offers stronger reasoning, function calling, and a larger context window (128k). For cost-sensitive workloads, Groq and DeepSeek are significantly cheaper per token. Use /v1/should-call?task_type=chat&latency_sensitivity=high to get a live data-driven recommendation for your specific needs.
What is P95 or P99 latency and why should I track it?
P95 latency is the response time that 95% of requests complete within — the slowest 5% are slower. P99 covers the slowest 1%. Tracking these percentiles is critical for SLA planning because the average (P50) can look healthy while a minority of users experience unacceptable delays. InferenceLatency.com tracks P50, P95, and P99 for all 15 providers in a rolling 48-hour database, accessible at /reliability-metrics.
How should I start measuring and evaluating inference performance?
Standardise your test harness: use identical prompts, the same geographic region, and the same token limit across all providers. Log TTFT (time to first token), tokens per second, and P95/P99. Use the /benchmark endpoint to get side-by-side results on your actual workload across all 15 providers simultaneously.
What drives cost per token in AI inference?
GPU utilisation, batch size, quantisation level, caching effectiveness, and cold-start behaviour are the primary cost drivers. Providers using specialised hardware (Groq LPUs, Cerebras WSEs) can offer lower per-token costs due to higher throughput. Use /cost-optimizer to get efficiency scores that balance latency and cost for your use case.
How do I balance cost and performance without hurting UX?
Stream outputs for better perceived responsiveness, keep queue wait time bounded, and use modest batching for longer generations. Before each inference call, use /v1/should-call with your cost_sensitivity and latency_sensitivity settings to get a routing recommendation that automatically balances these factors based on live data.
Is there an API I can call to get the best provider automatically?
Yes. GET /v1/should-call returns a complete pre-inference decision: should_call (boolean), recommended_provider, expected_latency_ms, expected_cost, and confidence_score. Accepts optional parameters for task_type, latency_sensitivity, and cost_sensitivity. No authentication required. Free for up to 30 requests per day without an API key.
Does this work with streaming vs non-streaming inference?
Current measurements use non-streaming mode with a 1-token limit for consistent, fair comparison of TTFT across all providers. In real applications, streaming significantly improves perceived latency. Use /advanced-benchmark for workload-specific testing including tool calling, structured output speed, and reasoning effort impact scenarios.
Contact & Enquiries
contact@inferencedomains.com
General queries · Partnerships · Acquisition discussions