The High-Stakes Problem
In high-scale distributed systems, maintenance windows are a relic of the past. If you are operating a platform that demands 99.99% availability, dropping connections during a deployment is unacceptable.
The naïve approach to CI/CD—pushing code, building a container, and restarting a service—inevitably leads to the dreaded "502 Bad Gateway" or broken pipe errors for users with active sessions. This occurs because the orchestration layer kills the active process before the new process is ready to accept traffic, or before in-flight requests have completed.
Zero-downtime deployment is not just a DevOps trick; it is an architectural requirement. It demands a pipeline that respects connection draining, performs vigorous health checks before traffic shifting, and supports atomic rollbacks.
Today, we will dissect a production-grade pipeline using GitHub Actions and AWS ECS (Fargate). We are choosing ECS for this demonstration because its integration with Application Load Balancers (ALB) provides the most straightforward mechanism for connection draining and target group swapping, though the principles apply equally to Kubernetes.
Technical Deep Dive: The Solution & Code
To achieve zero downtime, we cannot simply replace a binary. We must orchestrate a "Rolling Update."
The logic flows as follows:
- Build & Push: CI passes, Docker image is built and pushed to ECR with a unique SHA tag (ensuring immutability).
- Register Task: A new Task Definition is registered pointing to the new image.
- Provision: The orchestrator spins up the new containers (tasks) alongside the old ones.
- Health Check: The Load Balancer hits the
/healthendpoint of the new tasks. - Drain & Switch: Once healthy, the ALB routes new traffic to the new tasks and begins "draining" (finishing in-flight requests) the old tasks.
- Termination: Old tasks receive a
SIGTERMand shut down gracefully.
The GitHub Actions Workflow
Below is the optimized workflow configuration. Note the usage of wait-for-service-stability. This is critical. It forces the GitHub Action to hang until the ECS scheduler confirms the new tasks are passing ALB health checks and the old ones have drained. If this fails, the pipeline fails.
name: Production Deployment
on:
push:
branches: [ "main" ]
env:
AWS_REGION: us-east-1
ECR_REPOSITORY: codingclave-core-api
ECS_SERVICE: production-api-svc
ECS_CLUSTER: production-cluster
ECS_TASK_DEFINITION: .aws/task-definition.json
CONTAINER_NAME: api-container
jobs:
deploy:
name: Build & Deploy
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- name: Checkout Code
uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy-role
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Build, Tag, and Push Image to Amazon ECR
id: build-image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT
- name: Render New Task Definition
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: ${{ env.ECS_TASK_DEFINITION }}
container-name: ${{ env.CONTAINER_NAME }}
image: ${{ steps.build-image.outputs.image }}
- name: Deploy to Amazon ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ${{ steps.task-def.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
# CRITICAL: Ensures zero-downtime verification
wait-for-service-stability: true
# Optional: Force new deployment to ensure fresh pull
force-new-deployment: true
Application-Side Requirements
The infrastructure cannot do this alone. Your application code must handle the SIGTERM signal. When ECS scales down the old tasks, it sends SIGTERM. Your app must catch this signal, stop accepting new requests, finish processing current requests, and then exit.
Here is a stripped-down NodeJS implementation:
const server = app.listen(port);
process.on('SIGTERM', () => {
console.log('SIGTERM received. Performing graceful shutdown...');
server.close(() => {
console.log('HTTP server closed. Connections drained.');
// Close DB connections here
process.exit(0);
});
// Force shutdown after timeout if connections hang
setTimeout(() => {
console.error('Forcing shutdown.');
process.exit(1);
}, 10000);
});
Architecture & Performance Benefits
Implementing this pipeline shifts your deployment strategy from "Replacement" to "Traffic Shifting."
- Immutability and Determinism: By tagging images with the Git SHA and forcing a new Task Definition, we ensure that what ran in the CI test suite is exactly what is running in production. There is no "patching" of live servers.
- Connection Draining: The symbiotic relationship between the ALB and the application's
SIGTERMhandler ensures no user receives a severed connection error. The user experience is perfectly smooth, even during peak load deployments. - Automatic Rollback Logic: The
wait-for-service-stabilityflag acts as a gatekeeper. If the new container starts but fails the ALB health check (e.g., due to a runtime configuration error), ECS will not route traffic to it. The deployment times out, the workflow fails, and the old tasks remain serving traffic. The system fails safe.
How CodingClave Can Help
While the configuration above provides a solid foundation, implementing zero-downtime pipelines in a complex enterprise environment is rarely this straightforward.
The code above works for stateless microservices. But what happens when you introduce schema migrations that break backward compatibility? How do you handle long-running WebSocket connections or background worker queues during a rolling update? A misconfigured pipeline can lead to data corruption or split-brain scenarios that are far costlier than simple downtime.
This is where CodingClave operates.
We specialize in high-scale, risk-averse architecture. We don't just write YAML files; we engineer deployment strategies that account for database locking, state management, and multi-region redundancy. We turn "deployment anxiety" into a non-event.
If your team is struggling with fragile deployments or scaling pains, do not wait for the next outage to force your hand.
Book a consultation with us today. Let's audit your infrastructure and build a roadmap to true high-availability.