The High-Stakes Problem

In high-scale distributed systems, maintenance windows are a relic of the past. If you are operating a platform that demands 99.99% availability, dropping connections during a deployment is unacceptable.

The naïve approach to CI/CD—pushing code, building a container, and restarting a service—inevitably leads to the dreaded "502 Bad Gateway" or broken pipe errors for users with active sessions. This occurs because the orchestration layer kills the active process before the new process is ready to accept traffic, or before in-flight requests have completed.

Zero-downtime deployment is not just a DevOps trick; it is an architectural requirement. It demands a pipeline that respects connection draining, performs vigorous health checks before traffic shifting, and supports atomic rollbacks.

Today, we will dissect a production-grade pipeline using GitHub Actions and AWS ECS (Fargate). We are choosing ECS for this demonstration because its integration with Application Load Balancers (ALB) provides the most straightforward mechanism for connection draining and target group swapping, though the principles apply equally to Kubernetes.

Technical Deep Dive: The Solution & Code

To achieve zero downtime, we cannot simply replace a binary. We must orchestrate a "Rolling Update."

The logic flows as follows:

  1. Build & Push: CI passes, Docker image is built and pushed to ECR with a unique SHA tag (ensuring immutability).
  2. Register Task: A new Task Definition is registered pointing to the new image.
  3. Provision: The orchestrator spins up the new containers (tasks) alongside the old ones.
  4. Health Check: The Load Balancer hits the /health endpoint of the new tasks.
  5. Drain & Switch: Once healthy, the ALB routes new traffic to the new tasks and begins "draining" (finishing in-flight requests) the old tasks.
  6. Termination: Old tasks receive a SIGTERM and shut down gracefully.

The GitHub Actions Workflow

Below is the optimized workflow configuration. Note the usage of wait-for-service-stability. This is critical. It forces the GitHub Action to hang until the ECS scheduler confirms the new tasks are passing ALB health checks and the old ones have drained. If this fails, the pipeline fails.

name: Production Deployment

on:
  push:
    branches: [ "main" ]

env:
  AWS_REGION: us-east-1
  ECR_REPOSITORY: codingclave-core-api
  ECS_SERVICE: production-api-svc
  ECS_CLUSTER: production-cluster
  ECS_TASK_DEFINITION: .aws/task-definition.json
  CONTAINER_NAME: api-container

jobs:
  deploy:
    name: Build & Deploy
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read

    steps:
    - name: Checkout Code
      uses: actions/checkout@v4

    - name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy-role
        aws-region: ${{ env.AWS_REGION }}

    - name: Login to Amazon ECR
      id: login-ecr
      uses: aws-actions/amazon-ecr-login@v2

    - name: Build, Tag, and Push Image to Amazon ECR
      id: build-image
      env:
        ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        IMAGE_TAG: ${{ github.sha }}
      run: |
        docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
        echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT

    - name: Render New Task Definition
      id: task-def
      uses: aws-actions/amazon-ecs-render-task-definition@v1
      with:
        task-definition: ${{ env.ECS_TASK_DEFINITION }}
        container-name: ${{ env.CONTAINER_NAME }}
        image: ${{ steps.build-image.outputs.image }}

    - name: Deploy to Amazon ECS
      uses: aws-actions/amazon-ecs-deploy-task-definition@v1
      with:
        task-definition: ${{ steps.task-def.outputs.task-definition }}
        service: ${{ env.ECS_SERVICE }}
        cluster: ${{ env.ECS_CLUSTER }}
        # CRITICAL: Ensures zero-downtime verification
        wait-for-service-stability: true
        # Optional: Force new deployment to ensure fresh pull
        force-new-deployment: true 

Application-Side Requirements

The infrastructure cannot do this alone. Your application code must handle the SIGTERM signal. When ECS scales down the old tasks, it sends SIGTERM. Your app must catch this signal, stop accepting new requests, finish processing current requests, and then exit.

Here is a stripped-down NodeJS implementation:

const server = app.listen(port);

process.on('SIGTERM', () => {
  console.log('SIGTERM received. Performing graceful shutdown...');
  
  server.close(() => {
    console.log('HTTP server closed. Connections drained.');
    // Close DB connections here
    process.exit(0);
  });

  // Force shutdown after timeout if connections hang
  setTimeout(() => {
    console.error('Forcing shutdown.');
    process.exit(1);
  }, 10000); 
});

Architecture & Performance Benefits

Implementing this pipeline shifts your deployment strategy from "Replacement" to "Traffic Shifting."

  1. Immutability and Determinism: By tagging images with the Git SHA and forcing a new Task Definition, we ensure that what ran in the CI test suite is exactly what is running in production. There is no "patching" of live servers.
  2. Connection Draining: The symbiotic relationship between the ALB and the application's SIGTERM handler ensures no user receives a severed connection error. The user experience is perfectly smooth, even during peak load deployments.
  3. Automatic Rollback Logic: The wait-for-service-stability flag acts as a gatekeeper. If the new container starts but fails the ALB health check (e.g., due to a runtime configuration error), ECS will not route traffic to it. The deployment times out, the workflow fails, and the old tasks remain serving traffic. The system fails safe.

How CodingClave Can Help

While the configuration above provides a solid foundation, implementing zero-downtime pipelines in a complex enterprise environment is rarely this straightforward.

The code above works for stateless microservices. But what happens when you introduce schema migrations that break backward compatibility? How do you handle long-running WebSocket connections or background worker queues during a rolling update? A misconfigured pipeline can lead to data corruption or split-brain scenarios that are far costlier than simple downtime.

This is where CodingClave operates.

We specialize in high-scale, risk-averse architecture. We don't just write YAML files; we engineer deployment strategies that account for database locking, state management, and multi-region redundancy. We turn "deployment anxiety" into a non-event.

If your team is struggling with fragile deployments or scaling pains, do not wait for the next outage to force your hand.

Book a consultation with us today. Let's audit your infrastructure and build a roadmap to true high-availability.