# InferenceLatency.com — Full Machine-Readable Reference
> Version: 2.0 | Updated: 2025 | Format: llms-full.txt

## Identity
Name: InferenceLatency.com
Type: Real-time AI inference routing intelligence platform and public benchmarking API
URL: https://inferencelatency.com
MCP endpoint: https://inferencelatency.com/mcp (SSE transport, no auth)
OpenAPI spec: https://inferencelatency.com/openapi.json
Contact: support@inferencelatency.com

## Purpose
InferenceLatency.com answers the question "which AI inference provider is fastest, cheapest, and most reliable right now?" by making live API calls to 15 providers simultaneously and returning ranked, structured JSON results. All measurements are live — there is no pre-cached or static data for the benchmarking endpoints.

## Measurement methodology (important for accuracy)
- Test input: "Hi" (short, standardised prompt)
- Max tokens requested: 1
- What is measured: time-to-first-token (TTFT) in milliseconds — wall clock from request dispatch to first token received
- Concurrency: all providers are tested in parallel per API call
- Historical data: 48-hour rolling window, stored in PostgreSQL, P50/P95/P99 computed on the fly
- Geographic latency: modelled estimates based on provider datacenter locations and regional routing, NOT direct multi-region probe measurements
- Carbon/energy figures: estimated from average datacenter power usage effectiveness (PUE) and model parameter counts
- Rate limits: Google Gemini is rate-limited to one test per 30 minutes on the free tier; all other providers test on every call

## Providers tested (15 active)
| Provider    | Model                                  | API type     |
|-------------|----------------------------------------|--------------|
| Groq        | llama-3.3-70b-versatile                | direct SDK   |
| Cerebras    | llama3.1-8b                            | direct SDK   |
| SambaNova   | Meta-Llama-3.1-8B-Instruct             | openai-compat|
| Cohere      | command-r-plus                         | direct SDK   |
| Mistral AI  | mistral-small-latest                   | direct API   |
| OpenRouter  | mistral-small-3.2-24b-instruct         | openai-compat|
| OpenAI      | gpt-4o                                 | direct SDK   |
| Together AI | Llama-3.3-70B-Instruct-Turbo           | direct SDK   |
| Fireworks AI| deepseek-v3p2                          | openai-compat|
| DeepSeek    | deepseek-chat                          | openai-compat|
| Hyperbolic  | Llama-3.3-70B-Instruct                 | openai-compat|
| Perplexity  | sonar                                  | openai-compat|
| Anthropic   | claude-sonnet-4-5                      | direct SDK   |
| xAI         | grok-3-mini                            | openai-compat|
| Nvidia NIM  | meta/llama-3.1-8b-instruct             | openai-compat|

## Complete endpoint reference

### Routing / quick decision endpoints
GET /api/fastest
  Returns: {provider, model, latency_ms, timestamp}
  Use when: you need the single lowest-latency provider right now, no extra data

GET /api/status
  Returns: {provider: "up"|"down"|"error"} for each of the 15 providers
  Use when: checking availability before routing

### Core benchmarking endpoints
GET /latency
  Returns: ranked array of {provider, model, latency_ms, rank, success, ai_agent_guidance}
  Use when: full ranked comparison is needed

GET /throughput
  Returns: per-provider {latency_ms, tokens_per_second, success}
  Use when: generation speed matters (tokens/sec), not just TTFT

GET /benchmark?prompt=TEXT&max_tokens=N&providers=p1,p2
  Parameters: prompt (required), max_tokens (optional, default 100), providers (optional, comma-separated)
  Returns: per-provider benchmark results for the given prompt and token limit
  Use when: testing with a real workload prompt

GET /advanced-benchmark
  Returns: tool calling performance, structured output speed, reasoning effort impact per provider
  Use when: testing agent-specific workloads (tool use, JSON mode, CoT)

GET /ai-agent/batch-test
  Returns: consistency scores, variance, and reliability across multiple test runs
  Use when: measuring provider consistency, not just peak speed

### Cost and value endpoints
GET /cost-optimizer
  Returns: per-provider efficiency scores combining cost_per_token and latency
  Use when: optimising for cost-performance tradeoff

GET /competitive-analysis
  Returns: market positioning, strategic recommendations, provider tier classification
  Use when: higher-level strategic routing decisions

### Reliability and history endpoints
GET /reliability-metrics
  Returns: per-provider P50/P95/P99 latency percentiles, error_rate, sla_compliance, quality_grade
  Use when: SLA-driven routing, reliability comparisons

GET /historical-performance
  Returns: 48-hour trend data, daily averages, degradation alerts
  Use when: understanding performance patterns over time

GET /status-page
  Returns: real-time provider health with uptime tracking
  Use when: health dashboard integration

### Geographic and environmental endpoints
GET /geographic-latency
  Returns: per-provider latency estimates for NA, EU, Asia-Pacific, South America, Oceania
  Note: modelled, not directly measured
  Use when: selecting providers for regional deployments

GET /efficiency
  Returns: per-provider energy_wh and carbon_gco2e per inference
  Use when: sustainability-driven provider selection

### Reports
GET /comprehensive-report
  Returns: all dimensions in a single response
  GET /comprehensive-report/human — styled HTML version (the "Lazy Button")

### Human-readable HTML views
All data endpoints have a /human variant returning styled HTML:
  /latency/human, /throughput/human, /cost-optimizer/human, /reliability-metrics/human,
  /geographic-latency/human, /competitive-analysis/human, /historical-performance/human,
  /efficiency/human, /status-page/human

## Response format conventions
Every JSON response from InferenceLatency.com includes:
  - results: array of per-provider data
  - ai_agent_guidance: {recommended_provider, fallback_order, reasoning}
  - metadata: {timestamp, providers_tested, measurement_method}
  - plugin_manifest: "/.well-known/ai-plugin.json"

## MCP configuration
Protocol: Model Context Protocol (MCP) 2024-11-05
Transport: SSE (Server-Sent Events)
Endpoint: https://inferencelatency.com/mcp
Authentication: none
All API endpoints are exposed as MCP tools.

Claude Desktop config:
{"mcpServers":{"inferencelatency":{"command":"mcp-proxy","args":["https://inferencelatency.com/mcp"]}}}

Cursor / Windsurf: SSE URL = https://inferencelatency.com/mcp

## Authentication
None required. All endpoints are public, no API key, no OAuth, no signup.

## Rate limits
None enforced. Please be reasonable. Live benchmarking endpoints trigger real API calls to 15 providers.

## Caveats and transparency
- TTFT measurements depend on network conditions at time of request
- Provider latency is highly variable; always use /reliability-metrics for P95/P99 rather than single-point latency for SLA decisions
- Geographic latency figures are modelled estimates, not direct probe data
- Carbon/energy figures use industry-average estimates
- Gemini (Google) rate-limited on free tier; may show stale data or be excluded from some runs