Troubleshooting

Common problems when scans fail, CI rejects your key, you hit concurrency limits, or customer agents do not claim jobs.

On this page

Crawl failures

When a run shows Failed on the dashboard or in run history, open the run detail—the error message is the starting point. The run detail page also links here when a crawl fails.

Symptom	What to check
Run failed immediately	Confirm the sitemap URL is reachable from the execution route (cloud vs customer agent). Check auth, firewall, and that the URL returns valid XML—not an HTML login page.
Sitemap fetch or parse error	Typical messages: `Failed to fetch sitemap`, HTTP 4xx/5xx on the sitemap, `is not valid XML`, or an empty urlset. Verify the sitemap in a browser or with `curl`; fix redirects, compression, and namespace issues.
Timeout or cancelled mid-crawl	Large sites may need a higher page limit or lower concurrency in advanced options. Agent crawls can also hit API report limits—see Agents and Customer agent setup.
Run stuck on Pending or Running	Pending on an agent route usually means no agent claimed the job (pool mismatch or agent offline). Running with no progress may be a stuck claim—see Agents. Cloud runs that hang may eventually fail via stale-job reconciliation.
Only one crawl at a time per site	A second start for the same sitemap while another run is active may be rejected or queued depending on trigger. Finish or cancel the active run first.

To interpret completed runs (errors vs warnings vs deploy diff), see Reading your report.

Rate limits (429 — too many concurrent crawls)

Signal Diff limits how many crawls can be Pending or Running at once per account. Manual dashboard runs, schedules, CI triggers, and agent-backed jobs all count toward the same tenant limit.

Account state	Max concurrent crawls
Free (no active paid API key)	1
Paid (at least one active API key)	3

When you exceed the limit, new starts return 429 Too Many Requests with a message like Too many crawls are already running for your account.

Open the dashboard and cancel or wait for Pending/Running jobs—including agent crawls you forgot about.
CI pipelines that retry quickly can hit 429 until an earlier run finishes; stagger workflows or cancel stuck runs.
Plan limits and upgrade path: Plans and limits — concurrent crawls and Pricing.

CI (401, 404, and fail_mode)

CI triggers require a paid plan, repository secrets, and a reachable sitemap (or agent routing for internal URLs). Step-by-step setup: CI and GitHub Actions setup.

Symptom	Likely fix
401 on `POST /api/trigger/ci`	Wrong, revoked, or expired `SIGNALDIFF_CI_API_KEY`. Create or rotate on Developers → API keys, update every repository secret, and re-run. See API keys lifecycle guide.
404 on `/api/trigger/ci`	`SIGNALDIFF_API_BASE_URL` must be the site origin only—for example `https://signaldiff.dev`, not `https://signaldiff.dev/api`. The action posts to `{origin}/api/trigger/ci`; an extra `/api` produces `/api/api/trigger/ci`.
429 on CI trigger	Tenant concurrent crawl limit—see Rate limits. Wait for or cancel active runs on your account.
Workflow failed but crawl “succeeded”	`fail_mode` gates the job after the crawl completes. `error` fails when `errorCount > 0`; `errorOrWarning` also fails on warnings; `none` never fails on finding counts alone. Crawl execution failures (timeout, unreachable sitemap) always fail regardless of mode.
Workflow passed but you expected failure	Check `fail_mode: none` (report-only). For pull requests, gating on new findings since the CI baseline is often better than raw totals—see Baselines and diffs.
CI crawl stays Pending (agent mode)	`execution_mode: agent` requires a running enrolled agent with a matching `agent_pool_id`—see Agents.

Full fail_mode table and PR comment permissions: CI setup — fail mode.

Customer agents

Agents pull work over HTTPS; jobs stay Pending until an authenticated agent claims them. For enroll, install, pools, and rotation, use Customer agents and the full Customer agent setup guide—this section covers common failure modes.

Symptom	What to check
Heartbeat fails (401/403)	Credential expired or rotated—re-enroll on Customer agents and update `appsettings.json` or container env on every host. Verify `ApiBaseUrl` matches your site origin.
Heartbeat OK, no crawls	Job `executionMode` must be `agent`. API `Features:EnableAgentRouting` must be true. Jobs belong to the GitHub user who started them—the agent credential tenant must match.
Jobs stay Pending	Agent not running, outbound HTTPS blocked, or pool mismatch: crawl `agentPoolId` must match the enrolled agent (empty string = default pool on both sides).
UI shows Running, agent idle	Stuck run: a prior claim set Running but the agent exited before `report`. Agents only claim Pending jobs. Wait for stale reconciliation (hours) or clear the job when the UI allows.
429 on enroll / heartbeat / claim	Per-agent protocol rate limits—avoid duplicate processes with the same agent ID and reduce aggressive polling.
Report or progress errors (500/403)	Large crawls use chunked upload; oversized batches or missing API endpoints can fail finalize. Cap pages with crawl `MaxPages` or see report page coverage for storage limits and `AgentProtocol:MaxReportPages`. Late progress POSTs after finalize may log harmless 403 replay errors—upgrade the agent if noisy.

Production checklist: keep heartbeats running, alert when claims stop, and align schedule/CI pool IDs with enrolled agents. Operator detail for stale heartbeats and fleet status: repository doc docs/agent-offline-and-heartbeat.md (links here as the canonical user-facing summary).

Report page coverage (large sites)

Signal Diff crawls every URL in your sitemap, but per-page detail in the dashboard, HTML export, and stored run payload is capped to keep Cosmos documents and API responses bounded. Site-wide counts (errors, warnings, info, total pages) always reflect the full crawl.

When stored detail covers fewer pages than were crawled, the run report shows a yellow Not all findings are listed below banner with stored vs total page counts.

Default cap and selection order

The API stores at most 25 pages per run by default (AgentProtocol:MaxReportPages). When a crawl exceeds that limit, pages are ranked and the highest-priority rows are kept:

Pages with error-level findings (including HTTP errors)
Pages with warning-level findings
Pages with info-level findings
Clean pages (no findings), in crawl order

Within each page, individual findings may also be trimmed when a page has many issues (highest severity first). Customer agents use the same selection before upload; align SignalDiffAgent:MaxReportPages with the API setting if you raise the cap.

Deploy diff URL list

The deploy diff card compares this run to your baseline. The URLs changed count is site-wide, but the expandable path-level list is capped at 100 rows, prioritizing new errors, new warnings, finding changes, then title/description/status changes.

What you can do

Goal	Option
Sample a large site	Set crawl `MaxPages` in advanced options, schedules, or CI payload so the crawl itself stops after N URLs.
See more per-page detail in reports	Operators can raise `AgentProtocol:MaxReportPages` on the API (Azure app setting `AgentProtocol__MaxReportPages`). Higher values increase Cosmos document size and API response time—see the operator runbook below.
Agent report upload failures	Very large chunked uploads can fail at the gateway. Lower `MaxPages` on the crawl or reduce stored pages before re-running. See Agents for chunked upload errors.

Operator guidance for raising the API cap: repository doc docs/agent-offline-and-heartbeat.md (MaxReportPages). Dashboard overview cards and top issues use site-wide totals even when per-page detail is partial—see Reading your report — export.

Other common issues

Symptom	What to check
Cannot sign in	Use Sign in with GitHub and complete authorization. Clear site cookies or try a private window if you loop back to the home page.
No deploy diff or run history	The first complete run has no baseline yet. On all plans, runs older than 30 days are removed—see Baselines and diffs — retention. Anonymous try-a-scan flows do not keep history.
Schedule did not run	Verify the schedule is enabled and cron/time is in UTC. Last skipped often means concurrency limits or another active run—see Schedules troubleshooting.