OpenAI and Anthropic incident beta

Turn failing AI API traces into next actions

Paste one failed OpenAI or Anthropic event. Get ranked causes, evidence, and the safest code change to try.

Join the incident beta See supported failures

OpenAI traces Anthropic failures

causely.megaloop.app/incident/req_8b2c9a

failed event429

provider=openai

status=429 rate_limit_exceeded

request=req_8b2c9a, responses.create

clue=burst after model switch, 38 req in 7s

retry=fixed 250ms delay, no jitter

{ request: { model: 'gpt-4.1-mini' },

headers: { retry_after_ms: 0 },

error: 'rate limit reached' }

Top hypothesis

Acceleration window exceeded

rank 1

Burst traffic followed a model switch. The retry path waits a fixed delay, so requests re-collide.

38 req in 7sno jitter429 body match

import OpenAI from "openai";

await retry(apiCall, {
  retries: 4,
  minDelayMs: 500,
  factor: 2,
  jitter: true,
});

The painful moment is not a missing chart.

It is the gap between a raw provider failure and the next safe change.

429 rate limit

A burst looks like quota trouble until the request timing is visible.

Model mismatch

Version drift hides inside payloads and vendor error bodies.

529 overload

Provider pressure needs a different next step than local retry bugs.

Ranked hypotheses, not oracles

What is likely, why, and what to try first.

The detail view keeps uncertainty visible while making the next SDK change copyable.

ranked cause detail

Burst exceeded provider acceleration window

Evidence matches timing, provider status, and retry behavior.

Quota or billing cap

Possible, but weaker than the event timing.

Malformed retry payload

Possible, but weaker than the event timing.

selected clueconfidence: mediumsafe patch

const delay = backoff({
  attempt,
  min: 500,
  factor: 2,
  jitter: true,
});

await sleep(delay);

Try capped exponential backoff before scaling queues.

Caveat: confirm quota is healthy before widening concurrency.

Narrow scope is the trust signal.

causely ranks hypotheses for recurring OpenAI and Anthropic failure classes.

/rate limits

/quota or billing

/request shape

/model or version

/timeouts

/OpenAI 500s

/Anthropic 529s

/acceleration limits

Ranked hypotheses, not guaranteed root cause.

The beta is judged by useful suggestions.

Pilot numbers are targets, not claimed proof.

100

labeled pilot incidents target

pilot teams target

60%

top suggestion useful target

<10m

median time to next action target

Founding pilot

Join if you can bring real incidents.

For teams shipping customer-facing AI features on OpenAI or Anthropic.

Founding pilot

Waitlist access first

Incident review with labeled outcomes

Pilot terms finalized with first teams

Specific about what it does not do.

The first public beta stays narrow enough for engineers to trust.

No. It ranks likely causes and next actions. The engineer still owns the incident.

Bring one failing trace.

Help shape a beta judged by useful next actions.