Blue/Green 배포 (4): 운영 및 최적화

TL;DR

Blue/Green 운영의 핵심은 "전환 체크리스트 + 모니터링 + 즉시 롤백 기준"이다.
자동화는 배포 자체보다 전환 후 5~10분 관찰 구간을 안정적으로 관리하는 데 집중해야 한다.
실제 장애 사례를 기준으로 런북과 임계값을 정기적으로 업데이트해야 재발을 줄일 수 있다.

전환/롤백 체크리스트

전환 안정성은 체크리스트의 완성도에서 대부분 결정된다. 배포 전 검증, 전환 직후 모니터링, 롤백 조건을 분리해 운영하면 대응 속도가 빨라진다.

전환 전 (Green 검증)

PRE-DEPLOYMENT CHECKLIST

Infrastructure

Green Pod/인스턴스 모두 Running & Ready
Health check 엔드포인트 응답 확인 (HTTP 200)
리소스 사용량 정상 범위 (CPU < 80%, Memory < 80%)

Application

Smoke test 통과
Critical API 엔드포인트 테스트
의존성 서비스 연결 확인 (DB, Redis, 외부 API)

Database

마이그레이션 스크립트 성공 실행
스키마 하위 호환성 확인
필요 시 데이터 백필 완료

Monitoring

알림 채널 동작 확인 (Slack, PagerDuty)
대시보드 접근 가능
롤백 담당자 대기

전환 직후 (5분 모니터링)

POST-DEPLOYMENT MONITORING

Critical Metrics (첫 5분)

HTTP 5xx 에러율 < 0.1%
HTTP 4xx 에러율 급증 없음
P99 레이턴시 < 기존 대비 20% 증가
요청 처리량(RPS) 정상 유지

Application Logs

ERROR/CRITICAL 레벨 로그 급증 없음
Exception stack trace 없음
Connection refused/timeout 없음

Business Metrics

주요 전환율(로그인, 결제) 급감 없음
사용자 세션 유지됨

롤백 결정 기준

지표	임계값	조치
5xx 에러율	> 1% (5분간)	즉시 롤백
P99 레이턴시	> 2x 기존값	조사 후 결정
핵심 기능 장애	결제/로그인 실패	즉시 롤백
CPU 사용률	> 90% (지속)	스케일 업 또는 롤백
OOM Kill	발생	즉시 롤백

모니터링 설정

지표를 많이 보는 것보다 "즉시 판단 가능한 SLI/SLO"를 명확히 두는 것이 중요하다. 알림 룰은 곧바로 실행 가능한 조치(롤백/스케일)와 함께 설계해야 효과가 있다.

핵심 SLI/SLO 정의

yaml

# SLO 예시
slos:
  availability:
    target: 99.9%  # 월 43분 다운타임 허용
    window: 30d
    indicator: |
      sum(rate(http_requests_total{status!~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))
 
  latency:
    target: 95%    # 95%의 요청이 200ms 이내
    window: 30d
    indicator: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      )
 
  error_budget:
    monthly: 43.2m  # 99.9% SLO = 43.2분/월
    alert_threshold: 50%  # 예산 50% 소진 시 알림

Prometheus 알림 규칙

yaml

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: blue-green-alerts
spec:
  groups:
  - name: deployment
    rules:
    # 5xx 에러율 급증
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) /
        sum(rate(http_requests_total[5m])) > 0.01
      for: 2m
      labels:
        severity: critical
        action: rollback
      annotations:
        summary: "High error rate detected after deployment"
        description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
        runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
 
    # 레이턴시 급증
    - alert: HighLatency
      expr: |
        histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
        > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "P99 latency exceeded 500ms"
        description: "Current P99: {{ $value | humanizeDuration }}"
 
    # Pod 재시작 감지
    - alert: PodRestartLoop
      expr: |
        increase(kube_pod_container_status_restarts_total{
          pod=~"my-app-.*"
        }[15m]) > 3
      labels:
        severity: critical
        action: rollback
      annotations:
        summary: "Pod restart loop detected"
 
    # 배포 후 롤백 감지
    - alert: DeploymentRollback
      expr: |
        kube_deployment_status_observed_generation{deployment="my-app"}
        < kube_deployment_metadata_generation{deployment="my-app"}
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "Deployment rollback in progress"

Grafana 대시보드

json

{
  "dashboard": {
    "title": "Blue/Green Deployment Dashboard",
    "panels": [
      {
        "title": "Traffic Distribution",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{color=\"blue\"}[5m]))",
            "legendFormat": "Blue"
          },
          {
            "expr": "sum(rate(http_requests_total{color=\"green\"}[5m]))",
            "legendFormat": "Green"
          }
        ]
      },
      {
        "title": "Error Rate by Version",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\",color=\"blue\"}[5m])) / sum(rate(http_requests_total{color=\"blue\"}[5m]))",
            "legendFormat": "Blue 5xx"
          },
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\",color=\"green\"}[5m])) / sum(rate(http_requests_total{color=\"green\"}[5m]))",
            "legendFormat": "Green 5xx"
          }
        ]
      },
      {
        "title": "P99 Latency Comparison",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{color=\"blue\"}[5m])) by (le))",
            "legendFormat": "Blue P99"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{color=\"green\"}[5m])) by (le))",
            "legendFormat": "Green P99"
          }
        ]
      },
      {
        "title": "Deployment Events",
        "type": "annotations",
        "datasource": "-- Grafana --",
        "query": "tags=deployment"
      }
    ]
  }
}

배포 이벤트 기록

python

# deploy_marker.py - 배포 이벤트를 Grafana에 기록
import requests
from datetime import datetime
 
def mark_deployment(version: str, environment: str, color: str):
    """Grafana에 배포 이벤트 마커 생성"""
    grafana_url = "https://grafana.example.com"
    api_key = os.environ["GRAFANA_API_KEY"]
 
    annotation = {
        "time": int(datetime.now().timestamp() * 1000),
        "tags": ["deployment", environment, color],
        "text": f"Deployed {version} to {color} ({environment})"
    }
 
    response = requests.post(
        f"{grafana_url}/api/annotations",
        json=annotation,
        headers={"Authorization": f"Bearer {api_key}"}
    )
    return response.status_code == 200

비용 최적화

Blue/Green은 이중 환경 운영으로 비용이 증가하므로, 평시/배포 시점 비용 정책을 구분해야 한다. Spot/Preemptible, 자동 스케일, 비배포 시간 축소를 조합하면 안정성과 비용을 동시에 맞출 수 있다.

비용 분석

Spot/Preemptible 인스턴스 활용

yaml

# Green 환경에 Spot 인스턴스 사용 (AWS EKS)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-green
spec:
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/lifecycle: spot
      tolerations:
      - key: "node.kubernetes.io/lifecycle"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"
      # Spot 중단 대비 graceful shutdown
      terminationGracePeriodSeconds: 30
      containers:
      - name: app
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 10"]

자동 스케일링 설정

yaml

# green-hpa.yaml - Green 환경 최소 유지
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-green
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app-green
  minReplicas: 1  # 평상시 최소 유지
  maxReplicas: 10 # 배포 시 확장
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0  # 즉시 스케일 업
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300  # 5분 대기 후 스케일 다운

배포 시간 외 Green 환경 축소

bash

#!/bin/bash
# scale-green.sh - 업무 시간 외 Green 환경 축소
 
HOUR=$(date +%H)
DAY=$(date +%u)
 
# 업무 시간 (월-금 09:00-18:00) 외에는 축소
if [ $DAY -gt 5 ] || [ $HOUR -lt 9 ] || [ $HOUR -gt 18 ]; then
  echo "Off-hours: Scaling down Green environment"
  kubectl scale deployment my-app-green --replicas=0
else
  echo "Business hours: Maintaining Green environment"
  kubectl scale deployment my-app-green --replicas=1
fi

yaml

# CronJob으로 자동화
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-green-down
spec:
  schedule: "0 19 * * 1-5"  # 평일 19:00
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: deployment-scaler
          containers:
          - name: kubectl
            image: bitnami/kubectl
            command:
            - kubectl
            - scale
            - deployment/my-app-green
            - --replicas=0
          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-green-up
spec:
  schedule: "0 8 * * 1-5"  # 평일 08:00
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: deployment-scaler
          containers:
          - name: kubectl
            image: bitnami/kubectl
            command:
            - kubectl
            - scale
            - deployment/my-app-green
            - --replicas=1
          restartPolicy: OnFailure

네트워킹 고려사항

전환 실패의 상당수는 애플리케이션 코드보다 DNS/CDN/세션 처리에서 발생한다. TTL, 캐시, 커넥션 드레이닝 정책을 사전에 합의하면 전환 중 장애를 크게 줄일 수 있다.

DNS TTL 관리

bash

# Route53 TTL 설정 예시
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "TTL": 60,
        "ResourceRecords": [{"Value": "1.2.3.4"}]
      }
    }]
  }'

CDN 캐시 무효화

bash

# CloudFront 캐시 무효화
aws cloudfront create-invalidation \
  --distribution-id E123456789 \
  --paths "/*"
 
# Fastly 캐시 퍼지
curl -X POST "https://api.fastly.com/service/{service_id}/purge_all" \
  -H "Fastly-Key: $FASTLY_API_KEY"

Connection Draining

yaml

# Kubernetes Service - Connection draining 설정
apiVersion: v1
kind: Service
metadata:
  name: my-app
  annotations:
    # AWS ALB
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
spec:
  # ...

yaml

# Pod - Graceful shutdown
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: app
        lifecycle:
          preStop:
            exec:
              # 새 연결 중단, 기존 연결 완료 대기
              command:
              - /bin/sh
              - -c
              - |
                # Health check 실패시키기
                touch /tmp/shutdown
                # 기존 연결 완료 대기
                sleep 30

Sticky Session 처리

yaml

# Redis를 통한 세션 외부화 (Spring Boot)
# application.yml
spring:
  session:
    store-type: redis
    redis:
      namespace: spring:session
  redis:
    host: redis.example.com
    port: 6379
 
---
# Kubernetes에서 Session Affinity 사용 시
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600

트러블슈팅 가이드

문제 해결 속도는 "증상 -> 원인 후보 -> 즉시 조치" 흐름이 정리되어 있는지에 달려 있다. 운영 중 자주 발생하는 패턴을 표준화해두면 새 팀원도 동일한 방식으로 대응할 수 있다.

일반적인 문제와 해결책

문제 1: Green 환경이 Ready 상태가 되지 않음

bash

# 진단
kubectl get pods -l color=green
kubectl describe pod my-app-green-xxx
 
# 일반적인 원인
# 1. 이미지 Pull 실패
kubectl get events --field-selector reason=Failed
 
# 2. Readiness Probe 실패
kubectl logs my-app-green-xxx
 
# 3. 리소스 부족
kubectl describe node | grep -A5 "Allocated resources"

문제 2: 전환 후 5xx 에러 급증

bash

# 즉시 롤백
kubectl patch svc my-app -p '{"spec":{"selector":{"color":"blue"}}}'
 
# 원인 분석
# 1. 로그 확인
kubectl logs -l color=green --tail=100
 
# 2. 의존성 연결 확인
kubectl exec -it my-app-green-xxx -- curl -v http://db-service:5432
 
# 3. 환경 변수 비교
kubectl get deployment my-app-blue -o json | jq '.spec.template.spec.containers[0].env'
kubectl get deployment my-app-green -o json | jq '.spec.template.spec.containers[0].env'

문제 3: 세션 유실

bash

# Redis 세션 확인
redis-cli KEYS "spring:session:*" | wc -l
 
# 해결책: 세션 마이그레이션 확인
# 1. 세션 스토어 연결 확인
kubectl exec my-app-green-xxx -- nc -zv redis.example.com 6379
 
# 2. 세션 키 포맷 확인 (버전 간 호환성)
redis-cli GET "spring:session:sessions:xxx"

트러블슈팅 플로우차트

실제 사례 연구

실사례는 설계 원칙을 운영 현실에 맞게 조정하는 기준이 된다. 성공 사례뿐 아니라 실패와 롤백 사례까지 기록해야 다음 배포의 안전성이 개선된다.

사례 1: E-commerce 플랫폼 (결제 시스템)

상황: 결제 모듈 업데이트 시 Blue/Green 배포

환경

항목	스펙
인프라	AWS EKS (Kubernetes 1.28)
배포 도구	Argo Rollouts
트래픽	10,000 RPS (피크)

배포 전략

결과

지표	값
다운타임	0초
롤백 필요	없음
총 소요 시간	55분

사례 2: API 서버 (DB 스키마 변경 포함)

상황: 사용자 테이블에 새 컬럼 추가

배포 계획 (Expand & Contract)

문제 발생 및 해결

단계	내용
문제	Phase 1에서 v2 배포 후 일부 API에서 null 에러
원인	구 클라이언트가 새 컬럼 없이 데이터 전송
해결	기본값 설정 추가 후 재배포

교훈

DB 스키마 변경은 최소 2주 배포 계획 필요
클라이언트 호환성 테스트 필수

사례 3: 마이크로서비스 (다중 서비스 동시 배포)

상황: 3개 서비스 동시 업데이트 (API Gateway, Auth, User)

배포 순서 (의존성 고려)

자동화 (GitHub Actions)

단계	설명
빌드	서비스별 병렬 빌드
배포	순차적 배포 (의존성 순서)
승인	각 단계 수동 승인

문제 발생 및 해결

단계	내용
문제	User 서비스 전환 후 Auth 호출 실패
원인	Auth 새 버전의 API 변경을 User가 인지 못함
해결	즉시 User 롤백, Auth API 하위 호환 패치 후 재배포

교훈

마이크로서비스 동시 배포 시 API 계약 테스트 필수
Consumer-Driven Contract Testing 도입 권장

시리즈 요약

시리즈 전반의 핵심은 "자동 전환"보다 "검증 가능한 전환"이다. 도구 선택은 팀 역량과 운영 체계에 맞춰 단순하게 시작해 점진적으로 확장하는 것이 안전하다.

핵심 포인트

주제	핵심 내용
기초	두 환경 동시 운영, 즉시 롤백 가능
Kubernetes	Argo Rollouts 또는 Istio 사용 권장
클라우드	ALB Weight, Cloud Run Tags, App Service Slots
CI/CD	자동화된 검증 → 수동 승인 → 모니터링
운영	SLO 기반 모니터링, 비용 최적화, 체크리스트

권장 사항

시작은 간단하게: 순수 Kubernetes Service selector로 시작
점진적 고도화: Argo Rollouts → Istio 순으로 도입
자동화 필수: CI/CD 파이프라인에 Blue/Green 통합
모니터링 우선: 배포 전 SLO/알림 체계 구축
롤백 연습: 정기적인 롤백 훈련 실시

다음 학습 주제

Canary 배포: Blue/Green의 점진적 버전
GitOps: Argo CD를 통한 선언적 배포
Chaos Engineering: 배포 안정성 검증
Feature Flags: 코드 수준의 릴리스 제어