# InferenceLatency.com — Full Machine-Readable Reference > Version: 2.0 | Updated: 2025 | Format: llms-full.txt ## Identity Name: InferenceLatency.com Type: Real-time AI inference routing intelligence platform and public benchmarking API URL: https://inferencelatency.com MCP endpoint: https://inferencelatency.com/mcp (SSE transport, no auth) OpenAPI spec: https://inferencelatency.com/openapi.json Contact: support@inferencelatency.com ## Purpose InferenceLatency.com answers the question "which AI inference provider is fastest, cheapest, and most reliable right now?" by making live API calls to 15 providers simultaneously and returning ranked, structured JSON results. All measurements are live — there is no pre-cached or static data for the benchmarking endpoints. ## Measurement methodology (important for accuracy) - Test input: "Hi" (short, standardised prompt) - Max tokens requested: 1 - What is measured: time-to-first-token (TTFT) in milliseconds — wall clock from request dispatch to first token received - Concurrency: all providers are tested in parallel per API call - Historical data: 48-hour rolling window, stored in PostgreSQL, P50/P95/P99 computed on the fly - Geographic latency: modelled estimates based on provider datacenter locations and regional routing, NOT direct multi-region probe measurements - Carbon/energy figures: estimated from average datacenter power usage effectiveness (PUE) and model parameter counts - Rate limits: Google Gemini is rate-limited to one test per 30 minutes on the free tier; all other providers test on every call ## Providers tested (15 active) | Provider | Model | API type | |-------------|----------------------------------------|--------------| | Groq | llama-3.3-70b-versatile | direct SDK | | Cerebras | llama3.1-8b | direct SDK | | SambaNova | Meta-Llama-3.1-8B-Instruct | openai-compat| | Cohere | command-r-plus | direct SDK | | Mistral AI | mistral-small-latest | direct API | | OpenRouter | mistral-small-3.2-24b-instruct | openai-compat| | OpenAI | gpt-4o | direct SDK | | Together AI | Llama-3.3-70B-Instruct-Turbo | direct SDK | | Fireworks AI| deepseek-v3p2 | openai-compat| | DeepSeek | deepseek-chat | openai-compat| | Hyperbolic | Llama-3.3-70B-Instruct | openai-compat| | Perplexity | sonar | openai-compat| | Anthropic | claude-sonnet-4-5 | direct SDK | | xAI | grok-3-mini | openai-compat| | Nvidia NIM | meta/llama-3.1-8b-instruct | openai-compat| ## Complete endpoint reference ### Routing / quick decision endpoints GET /api/fastest Returns: {provider, model, latency_ms, timestamp} Use when: you need the single lowest-latency provider right now, no extra data GET /api/status Returns: {provider: "up"|"down"|"error"} for each of the 15 providers Use when: checking availability before routing ### Core benchmarking endpoints GET /latency Returns: ranked array of {provider, model, latency_ms, rank, success, ai_agent_guidance} Use when: full ranked comparison is needed GET /throughput Returns: per-provider {latency_ms, tokens_per_second, success} Use when: generation speed matters (tokens/sec), not just TTFT GET /benchmark?prompt=TEXT&max_tokens=N&providers=p1,p2 Parameters: prompt (required), max_tokens (optional, default 100), providers (optional, comma-separated) Returns: per-provider benchmark results for the given prompt and token limit Use when: testing with a real workload prompt GET /advanced-benchmark Returns: tool calling performance, structured output speed, reasoning effort impact per provider Use when: testing agent-specific workloads (tool use, JSON mode, CoT) GET /ai-agent/batch-test Returns: consistency scores, variance, and reliability across multiple test runs Use when: measuring provider consistency, not just peak speed ### Cost and value endpoints GET /cost-optimizer Returns: per-provider efficiency scores combining cost_per_token and latency Use when: optimising for cost-performance tradeoff GET /competitive-analysis Returns: market positioning, strategic recommendations, provider tier classification Use when: higher-level strategic routing decisions ### Reliability and history endpoints GET /reliability-metrics Returns: per-provider P50/P95/P99 latency percentiles, error_rate, sla_compliance, quality_grade Use when: SLA-driven routing, reliability comparisons GET /historical-performance Returns: 48-hour trend data, daily averages, degradation alerts Use when: understanding performance patterns over time GET /status-page Returns: real-time provider health with uptime tracking Use when: health dashboard integration ### Geographic and environmental endpoints GET /geographic-latency Returns: per-provider latency estimates for NA, EU, Asia-Pacific, South America, Oceania Note: modelled, not directly measured Use when: selecting providers for regional deployments GET /efficiency Returns: per-provider energy_wh and carbon_gco2e per inference Use when: sustainability-driven provider selection ### Reports GET /comprehensive-report Returns: all dimensions in a single response GET /comprehensive-report/human — styled HTML version (the "Lazy Button") ### Human-readable HTML views All data endpoints have a /human variant returning styled HTML: /latency/human, /throughput/human, /cost-optimizer/human, /reliability-metrics/human, /geographic-latency/human, /competitive-analysis/human, /historical-performance/human, /efficiency/human, /status-page/human ## Response format conventions Every JSON response from InferenceLatency.com includes: - results: array of per-provider data - ai_agent_guidance: {recommended_provider, fallback_order, reasoning} - metadata: {timestamp, providers_tested, measurement_method} - plugin_manifest: "/.well-known/ai-plugin.json" ## MCP configuration Protocol: Model Context Protocol (MCP) 2024-11-05 Transport: SSE (Server-Sent Events) Endpoint: https://inferencelatency.com/mcp Authentication: none All API endpoints are exposed as MCP tools. Claude Desktop config: {"mcpServers":{"inferencelatency":{"command":"mcp-proxy","args":["https://inferencelatency.com/mcp"]}}} Cursor / Windsurf: SSE URL = https://inferencelatency.com/mcp ## Authentication None required. All endpoints are public, no API key, no OAuth, no signup. ## Rate limits None enforced. Please be reasonable. Live benchmarking endpoints trigger real API calls to 15 providers. ## Caveats and transparency - TTFT measurements depend on network conditions at time of request - Provider latency is highly variable; always use /reliability-metrics for P95/P99 rather than single-point latency for SLA decisions - Geographic latency figures are modelled estimates, not direct probe data - Carbon/energy figures use industry-average estimates - Gemini (Google) rate-limited on free tier; may show stale data or be excluded from some runs