When a pipeline starts failing, most teams lose time in the same way: they jump straight into the last visible error, patch the symptom, rerun the job, and hope the next stage passes. This article gives you a reusable CI/CD troubleshooting checklist for that exact moment. Use it when builds slow down, tests turn flaky, artifact publishing breaks, or deployments fail after what looked like a harmless change. The goal is not to guess faster. It is to narrow the failure domain, collect the right evidence, and work through the pipeline in a stable order so you can fix the issue without creating a new one.
Overview
A useful ci cd troubleshooting process starts with a simple rule: treat the pipeline as a chain of dependencies, not as one black box. A failing build may be caused by source changes, dependency resolution, runner state, secrets, network access, registry permissions, environment drift, cluster policy, or a deployment strategy mismatch. The visible failure point is often not the root cause.
This checklist is designed to be revisited whenever workflows change, new tooling is introduced, or seasonal planning pushes teams to upgrade runtimes, operating systems, base images, package managers, Kubernetes versions, or deployment controls. The more complex your pipeline becomes, the more useful a stable diagnostic order becomes.
Before making changes, capture the following:
- The exact failing stage and job name
- The first failing run and the last known good run
- The commit, branch, tag, and triggering event
- Whether the failure is deterministic or intermittent
- Any recent changes to build images, secrets, runners, package registries, cluster policy, or deployment targets
That small snapshot prevents a common mistake in debugging ci pipeline issues: changing three variables at once and losing the original signal.
If you want to make this checklist part of team operations, turn each section into a runbook page or incident template. CI/CD failures usually repeat in families, even when the exact error message changes.
Checklist by scenario
Use the scenario that best matches the first visible symptom, but keep moving outward if the first checks do not explain the failure.
1. Build fails early during checkout or setup
What you get from this section: a fast way to rule out source and runner setup issues.
- Confirm the trigger context. Is the pipeline running on a pull request, branch push, tag, scheduled job, or manual dispatch? Permissions and available secrets often differ by event type.
- Verify repository access. Checkout failures can come from revoked deploy keys, expired tokens, shallow clone assumptions, or submodule credentials.
- Check runner image changes. A new base image or hosted runner version may change shell behavior, installed tools, path defaults, or certificate bundles.
- Compare environment variables. Missing variables, renamed keys, or masked values can break setup scripts before the real build starts.
- Reproduce setup locally or in a disposable container. If setup cannot be reproduced outside the CI platform, the problem may be runner-specific rather than code-specific.
2. Dependency install or package restore fails
What you get from this section: a focused way to separate application issues from dependency and registry issues.
- Check lockfiles first. If lockfiles changed unexpectedly, package resolution may now select different versions.
- Look for registry authentication failures. Private registries, artifact repositories, and package feeds often fail with vague timeout or not found messages when the real issue is authorization.
- Inspect rate limits and mirrors. Temporary external registry issues can look like build instability.
- Review cache behavior. Corrupt or stale caches can produce failures that disappear after manual cache invalidation.
- Verify runtime alignment. Dependency managers can fail if the job now uses a different language version than the project expects.
If your pipeline also publishes container images, keep registry behavior in mind. A change in naming, auth flow, or retention policy can affect both dependency pulls and artifact pushes. For related registry tradeoffs, see Container Registry Comparison: ECR vs GCR vs ACR vs Docker Hub.
3. Tests fail unexpectedly or become flaky
What you get from this section: a practical sequence for test failures that are real, intermittent, or environment-specific.
- Separate deterministic failures from flaky ones. If reruns pass without code changes, do not assume the code is fixed.
- Identify test scope. Unit, integration, end-to-end, contract, and smoke tests fail for different reasons and should not share the same first-response checklist.
- Check time and ordering assumptions. Tests often fail in CI because of slower startup, clock skew, time zone differences, or parallel execution.
- Review shared resources. Databases, test buckets, message queues, and mock servers can leak state across jobs.
- Inspect recent infrastructure changes. A test suite may surface a network, DNS, or certificate issue that is not really a test problem.
- Confirm retry policy. Retries can reduce noise, but they can also hide growing instability.
Flaky tests should be treated as reliability work, not as normal background noise. If you can classify the top three flake patterns, you will usually reduce mean time to recovery more than by adding more reruns.
4. Artifact build passes, but image publish or package publish fails
What you get from this section: a clean handoff checklist between build success and delivery success.
- Confirm artifact naming and tagging rules. Small changes in branch naming, semantic version parsing, or tag normalization can break publish steps.
- Validate credentials and scopes. Publishing often requires broader permissions than reading.
- Check immutable tag behavior. Some registries or repositories reject overwrites; the error may appear after the build succeeds.
- Review signing, provenance, or scanning steps. Security controls inserted late in the pipeline are a common source of newly failing builds.
- Inspect storage and retention policy effects. Quotas, retention rules, and cleanup jobs can silently remove dependencies needed downstream.
5. Deployment fails before rollout starts
What you get from this section: fast checks for configuration and authentication failures in the delivery stage.
- Verify target environment selection. Make sure the job is pointing at the intended account, project, namespace, subscription, or cluster.
- Check identity assumptions. Service account tokens, federated identity, and workload identity mappings often drift over time. For a broader identity design perspective, see Workload Identity for AI Agents: Separating Who Runs from What They Can Do.
- Review secret injection. Missing or malformed secrets can block a deployment before any rollout logic begins.
- Compare config rendering. Template expansion, variable substitution, and environment overlays may generate invalid manifests even when source templates look correct.
- Check admission controls and policy gates. Security policy, schema validation, or required labels may now reject the release.
6. Deployment starts, but rollout stalls or fails health checks
What you get from this section: a structured path through release-time failures.
- Read the platform events, not just the pipeline log. CI output may only say that the rollout timed out. The real evidence is often in deployment events, pod events, service events, or ingress controller logs.
- Check image availability. A valid deployment can still fail if nodes cannot pull the image due to auth, tag mismatch, or registry access problems.
- Inspect readiness and liveness probes. Probe settings that worked before can fail after startup behavior, dependencies, or resource requests change.
- Review resource pressure. CPU, memory, ephemeral storage, and node scheduling constraints frequently surface during rollout, not build time.
- Validate network paths. Service discovery, ingress, TLS termination, and outbound dependencies can block health checks.
- Compare version compatibility. If the environment was upgraded recently, review platform compatibility. For Kubernetes changes, see Kubernetes Version Skew Policy and Upgrade Planning Guide.
If ingress or traffic routing changed around the same time, cross-check your controller and config assumptions with Kubernetes Ingress Controller Comparison: NGINX vs Traefik vs HAProxy vs Cloud Native Options.
7. Pipeline is slow, timing out, or failing only under load
What you get from this section: a troubleshooting path for pipelines that degrade gradually instead of breaking all at once.
- Measure stage duration changes. Identify whether the slowdown is in checkout, dependency restore, test startup, image build, scan, publish, or deploy.
- Check queue time separately from runtime. A slow pipeline may really be a capacity problem on runners.
- Inspect parallelism and contention. Shared caches, databases, or limited external APIs may behave well with one job and poorly with many.
- Review logging and tracing overhead. Helpful instrumentation can become expensive if configured too broadly.
- Check large artifact movement. Uploading and downloading large caches or images can dominate total runtime.
If you need better visibility into delivery path latency, events, and collector placement, see OpenTelemetry Collectors Explained: Deployment Patterns, Tradeoffs, and Update Guide.
What to double-check
This section is the reusable core of a pipeline failing builds checklist. Even experienced teams skip these because they look obvious.
- Was anything outside the repository changed? Hosted runners, base images, external registries, secret values, cloud IAM bindings, and cluster policies change more often than teams realize.
- Did the failure start after a routine upgrade? Language runtimes, package managers, Docker versions, Kubernetes APIs, and shell defaults can all break previously stable jobs.
- Is the failure message the first error or a downstream error? The first non-recoverable error is usually earlier in the log than the final summary.
- Can the failing step run with debug output safely? More logs help, but never expose secrets while increasing verbosity.
- Is the issue environment-specific? If production deploys fail but staging deploys pass, compare identity, policy, secrets, and traffic behavior before comparing application code.
- Did caching hide or create the issue? Pipelines often appear healthy until a cache expires, a runner changes, or a clean build runs for the first time in weeks.
- Are rollback assumptions still valid? A fast rollback may fail if schema changes, migrations, feature flags, or external contracts are not backward-compatible.
For deployment failure troubleshooting, one high-value practice is to keep a short diff between environments: cluster version, ingress controller, registry endpoint, identity method, secret provider, and policy engine. That diff becomes invaluable when only one environment is failing.
Common mistakes
A checklist is only useful if it helps teams avoid familiar traps. These are the mistakes that tend to stretch incidents and make future failures harder to debug.
- Treating every failure as a code issue. Many pipeline failures are caused by infrastructure, identity, or dependency changes rather than the application commit under review.
- Rerunning without capturing evidence. A successful rerun does not explain the original failure. Save logs, timestamps, and environment details before retrying.
- Editing the pipeline and the app at the same time. Change one layer first. Otherwise, you may remove the evidence needed to find the root cause.
- Ignoring flaky failures because reruns are easy. Flake accumulates until delivery speed drops and trust in the pipeline erodes.
- Relying on CI logs alone. For deployments, platform events and workload logs are often more informative than the pipeline output.
- Not validating rollback paths. A deployment strategy is incomplete if rollback has not been exercised under realistic conditions.
- Keeping tribal knowledge in chat only. If a fix worked once, convert it into a short runbook note or checklist update.
As a practical rule, if the same class of CI/CD incident happens twice, it deserves one of three things: a guardrail, an alert, or a documented check. That is where cicd best practices checklist work becomes more valuable than one-off heroics.
When to revisit
Revisit this checklist before planned upgrades, during quarterly workflow reviews, and any time your team changes build images, runners, package sources, container registries, deployment tooling, cluster versions, or identity models. Those changes often introduce failures that look new but follow old patterns.
A practical maintenance routine looks like this:
- After each meaningful incident, add one checklist improvement. Keep it small and specific.
- Before seasonal planning cycles, review pipeline assumptions. Check runtimes, deprecations, secrets rotation, registry access, and runner capacity.
- When workflows or tools change, run a clean-path validation. Test a full build, artifact publish, deploy, health check, and rollback.
- Track the top recurring failure categories. Even a simple list of build, dependency, test, publish, auth, policy, and rollout failures will improve prioritization.
- Turn slow manual checks into automated preflight tests. Validate credentials, registry access, required variables, manifest schema, and target connectivity before the main job starts.
If you only adopt one habit from this article, make it this: diagnose pipelines from the outside in. Start with trigger context and recent changes, move through environment and dependencies, then inspect runtime and deployment behavior. That order reduces guesswork, makes failures easier to classify, and gives your team a checklist worth returning to whenever builds or releases start failing in new ways.