Skip to main content

Command Palette

Search for a command to run...

Building a Production-Grade EKS Platform on AWS with Terraform and GitOps

Published
12 min read

https://codepen.io/qckuhtdx-the-scripter/pen/myrLwxP

Building a Production-Grade EKS Platform on AWS with Terraform and GitOps

Overview

In this post I walk through how I built a fully automated, production-style Kubernetes platform on AWS EKS using Terraform, GitHub Actions OIDC, and ArgoCD — all optimized for cost without sacrificing reliability. Every component is provisioned as code, deployed without storing a single static AWS credential, and observable from day one.

The full source code is available at: https://github.com/rajasekhar-cloud25/infrastructure

Interactive architecture diagram: View live →

The Problem with Static Credentials

The traditional approach looks like this:

# ❌ The wrong way — credentials stored permanently in GitHub
env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Problems with this approach:

  • Credentials are long-lived — if leaked, they work until manually rotated

  • They exist permanently in GitHub's secret store

  • Any workflow in the repo can use them

  • Rotation requires updating secrets in every repo that uses them

  • No audit trail of which workflow run used which credential


The Solution: OIDC Token Exchange

GitHub Actions supports OpenID Connect (OIDC). Instead of storing credentials, the workflow requests a short-lived token from GitHub's OIDC provider and exchanges it for an AWS IAM role:

GitHub Actions runner
  │
  ├─ requests OIDC token from GitHub
  │   (signed JWT containing: repo, branch, workflow, run ID)
  │
  ├─ calls AWS STS AssumeRoleWithWebIdentity
  │   (presents the OIDC token + role ARN)
  │
  ├─ AWS validates: is this token from GitHub? ✅
  │                 is the repo/branch in the trust policy? ✅
  │
  └─ AWS returns: temporary access key + secret + session token
      (expires in 1 hour, scoped to this specific IAM role)

The credentials exist only for the duration of the job. When the job ends, the credentials expire. Nothing is stored. Nothing can leak.


Project Structure

The infrastructure is organized as a set of Terraform modules, each with a single responsibility:

infrastructure/
  .github/              ← GitHub Actions workflows
  Main/                 ← Root module, wires everything together
  vpc/                  ← VPC, subnets, IGW, NAT GW, route tables, SGs
  iam/                  ← IAM roles, IRSA roles, GitHub OIDC trust
  eks/                  ← EKS cluster, node group, access entries
  ecr/                  ← ECR repositories (separate workspace)
  eip/                  ← Elastic IPs for NLB (separate workspace)
  k8s_namespaces/       ← All K8s namespaces pre-created
  kubernetes-ingress/   ← NGINX Ingress Controller + NLB + Route53
  argocd_deployment/    ← ArgoCD via local Helm chart
  s3/                   ← Terraform state bucket bootstrap
  charts/               ← Local Helm charts (ArgoCD, NGINX)
  bootstrap.sh          ← Creates S3 bucket + DynamoDB table

The two most important design decisions here: ECR and EIPs live in separate workspaces. This means container images and static IP addresses survive a full cluster destroy and recreate — no DNS updates, no image rebuilds.


Step 1 — Bootstrapping State

Before any Terraform can run, the S3 state backend needs to exist. The bootstrap.sh script handles this:

#!/bin/bash
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="ecommerce-demo-terraform-state-${ACCOUNT_ID}"

aws s3api create-bucket --bucket $BUCKET_NAME --region us-east-1
aws s3api put-bucket-versioning \
  --bucket $BUCKET_NAME \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

The bucket name is derived from the AWS account ID at runtime — no secrets needed anywhere.


Step 2 — GitHub Actions OIDC (Zero Static Credentials)

All three workflows authenticate to AWS using OIDC — the GitHub Actions token is exchanged for a short-lived IAM role. No AWS_ACCESS_KEY_ID is ever stored in GitHub secrets.

# .github/workflows/tf-apply.yaml
jobs:
  plan:
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ vars.AWS_REGION }}

Three workflows are defined:

Workflow Trigger Purpose
tf-plan On every PR Runs terraform plan, posts diff as comment
tf-apply Merge to main Requires manual approval, then applies
tf-destroy Manual only Requires typing "destroy" to confirm

Step 3 — VPC Module

The VPC module creates everything the cluster needs to run in a private, secure network:

# modules/vpc/main.tf (key resources)
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

# Public subnets — NLB, NAT Gateway, Internet Gateway
resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet("10.0.0.0/16", 8, count.index)
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    "kubernetes.io/role/elb" = "1"
  }
}

# Private subnets — EKS nodes only
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    "kubernetes.io/role/internal-elb"                    = "1"
    "kubernetes.io/cluster/${var.resource_name}"         = "shared"
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id  # Single AZ — cost optimized
}

Why a single NAT Gateway? One NAT Gateway instead of two saves ~$32/month. For a production portfolio demo this is an acceptable tradeoff — the only risk is losing outbound internet for private nodes if that one AZ goes down.

Route tables are straightforward:

  • Public RT: 0.0.0.0/0 → Internet Gateway

  • Private RT: 0.0.0.0/0 → NAT Gateway


Step 4 — IAM Module

The IAM module handles all permissions with a single consolidated design. The key insight is that the EKS OIDC provider is created inside the IAM module — this avoids a circular dependency where IAM needs EKS and EKS needs IAM.

# modules/iam/main.tf

# EKS cluster role
resource "aws_iam_role" "eks_cluster" {
  name               = "${var.resource_name}-eks-cluster-role"
  assume_role_policy = data.aws_iam_policy_document.eks_cluster_assume.json
}

# GitHub Actions OIDC role (pre-created provider as data source)
data "aws_iam_openid_connect_provider" "github" {
  url = "https://token.actions.githubusercontent.com"
}

resource "aws_iam_role" "github_actions" {
  name = "${var.resource_name}-github-actions-role"
  assume_role_policy = jsonencode({
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = data.aws_iam_openid_connect_provider.github.arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:${var.github_repo}:*"
        }
      }
    }]
  })
}

# IRSA roles — one per component, least privilege
resource "aws_iam_role" "external_secrets" {
  name = "${var.resource_name}-external-secrets-role"
  # Trust policy scoped to the ESO service account only
  assume_role_policy = data.aws_iam_policy_document.irsa_external_secrets.json
}

Every pod that needs AWS access gets its own IRSA role — no shared node-level credentials:

Component IRSA Permissions
EBS CSI Driver ec2:CreateVolume, ec2:AttachVolume
Cluster Autoscaler autoscaling:SetDesiredCapacity
External Secrets secretsmanager:GetSecretValue
cert-manager route53:ChangeResourceRecordSets

Step 5 — EKS Module

The EKS cluster runs on SPOT t3.small instances for cost optimization. Using API_AND_CONFIG_MAP auth mode with access entries instead of the legacy aws-auth ConfigMap:

# modules/eks/main.tf
resource "aws_eks_cluster" "main" {
  name     = var.resource_name
  version  = var.cluster_version
  role_arn = var.cluster_role_arn

  vpc_config {
    subnet_ids              = var.private_subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = true
  }

  access_config {
    authentication_mode                         = "API_AND_CONFIG_MAP"
    bootstrap_cluster_creator_admin_permissions = true
  }
}

resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_role_arn   = var.node_role_arn
  subnet_ids      = var.private_subnet_ids
  instance_types  = ["t3.small"]
  capacity_type   = "SPOT"        # ~60% cheaper than on-demand

  scaling_config {
    desired_size = 5
    min_size     = 2
    max_size     = 5
  }
}

# Access entries — no aws-auth ConfigMap editing required
resource "aws_eks_access_entry" "github_actions" {
  cluster_name  = aws_eks_cluster.main.name
  principal_arn = var.github_actions_role_arn
  type          = "STANDARD"
}

The Kubernetes and Helm providers authenticate using exec with aws eks get-token — this avoids plan-time failures when the cluster doesn't exist yet:

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_ca_certificate)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    args        = ["eks", "get-token", "--cluster-name", module.eks.cluster_name, "--region", var.aws_region]
  }
}

Step 6 — ECR Module (Separate Workspace)

ECR repositories are managed in their own Terraform workspace so images are never accidentally deleted when the main cluster is torn down:

# ecr/main.tf
resource "aws_ecr_repository" "app" {
  for_each             = toset(var.repository_names)
  name                 = each.value
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }
}

GitHub Actions builds and pushes on every merge:

- name: Build and push
  run: |
    aws ecr get-login-password | docker login --username AWS \
      --password-stdin $ECR_REGISTRY
    docker buildx build --platform linux/amd64 \
      --push -t \(ECR_REGISTRY/\)ECR_REPO:$IMAGE_TAG .

Step 7 — EIP Module (Separate Workspace)

Static Elastic IPs are created in their own workspace — separate from the main cluster. This means the NLB always gets the same IP addresses, Route53 A records never need updating, and the cluster can be completely rebuilt without changing DNS:

# eip/main.tf
resource "aws_eip" "nlb" {
  count  = 2
  domain = "vpc"

  tags = {
    Name = "\({var.resource_name}-nlb-eip-\){count.index}"
  }
}

The allocation IDs are then passed as a variable to the main workspace:

# environments/eks-demo-dev.tfvars
nlb_eip_allocation_ids = [
  "eipalloc-09595a182e792f01f",
  "eipalloc-032c83197c359b3fe"
]

Step 8 — Kubernetes Namespaces Module

All namespaces are created before any Helm chart runs. This prevents race conditions where a chart tries to create resources in a namespace that doesn't exist yet:

# modules/k8s_namespaces/main.tf
resource "kubernetes_namespace" "namespaces" {
  for_each = toset([
    "argocd",
    "nginx-ingress",
    "monitoring",
    "external-secrets",
    "cert-manager",
    "eks-demo",
    "shared-os",
    "kubecost"
  ])

  metadata {
    name = each.value
  }
}

This module runs before kubernetes-ingress and argocd_deployment in the dependency graph.


Step 9 — NGINX Ingress Controller + NLB + Route53

The NGINX Ingress Controller is the traffic gateway for the entire cluster. It is deployed via a local Helm chart with NLB annotations that attach the static EIPs:

# modules/kubernetes-ingress/main.tf
resource "helm_release" "nginx_ingress" {
  name      = "nginx-ingress"
  chart     = "${path.module}/../charts/kubernetes-ingress"
  namespace = "nginx-ingress"
  timeout   = 600
  wait      = true
  atomic    = true

  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-type"
    value = "nlb"
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-eip-allocations"
    value = join("\\,", var.nlb_eip_allocation_ids)
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-subnets"
    value = join("\\,", var.public_subnet_ids)
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-cert"
    value = var.acm_certificate_arn
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-ports"
    value = "443"
  }
}

Why NGINX over AWS ALB Controller?

  • NGINX is free — ALB Controller creates a new ALB per ingress (~$16/mo each)

  • No per-app ACM ARN annotation required — one cert at the NLB level covers everything

  • Portable — works identically on any cloud or on-premises

Route53 A records point directly to the static EIP addresses:

data "aws_eip" "nlb" {
  count = length(var.nlb_eip_allocation_ids)
  id    = var.nlb_eip_allocation_ids[count.index]
}

resource "aws_route53_record" "dns_records" {
  for_each = toset(var.dns_names)
  zone_id  = var.route53_zone_id
  name     = "\({each.value}.\){var.domain_name}"
  type     = "A"
  ttl      = 300
  records  = data.aws_eip.nlb[*].public_ip
}

TLS flow:

User → HTTPS
  → NLB (ACM wildcard *.reddycloud.com terminates TLS)
  → HTTP → NGINX Ingress
  → HTTP → App pod

No cert-manager needed. AWS handles certificate renewal automatically.


Step 10 — ArgoCD Deployment

ArgoCD is deployed via a local Helm chart with CRDs managed separately using the alekc/kubectl provider to avoid Helm CRD conflicts:

# modules/argocd_deployment/main.tf

# CRDs managed outside Helm to avoid upgrade conflicts
data "http" "argocd_crds" {
  for_each = toset(local.crd_files)
  url      = each.value
}

resource "kubectl_manifest" "argocd_crds" {
  for_each          = toset(local.crd_files)
  yaml_body         = data.http.argocd_crds[each.value].response_body
  server_side_apply = true
  force_conflicts   = true
  wait              = true
}

resource "helm_release" "argocd" {
  name       = "argocd-chart"
  chart      = "${path.module}/../charts/argocd"
  version    = "9.4.17"
  namespace  = "argocd"
  skip_crds  = true    # CRDs managed by kubectl_manifest above
  replace    = true
  wait       = true
  timeout    = 600

  values = [file("${path.module}/../charts/argocd/clusterValues/values.EksDemo.yaml")]

  depends_on = [kubectl_manifest.argocd_crds]
}

ArgoCD values for NGINX ingress integration:

# charts/argocd/clusterValues/values.EksDemo.yaml
configs:
  params:
    server.insecure: "true"  # TLS terminated at NLB

server:
  extraArgs:
    - --insecure

  ingress:
    enabled: true
    ingressClassName: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
      nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
    hostname: argocd.reddycloud.com
    paths: /
    pathType: Prefix
    https: false

ArgoCD runs in --insecure mode because TLS is already terminated at the NLB. The user always sees HTTPS — ArgoCD just receives plain HTTP from NGINX.


Secrets Management — External Secrets Operator

No secrets are hardcoded anywhere. The External Secrets Operator pulls from AWS Secrets Manager using IRSA:

# K8s manifest deployed via ArgoCD
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: postgres-creds
  namespace: eks-demo
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-store
    kind: ClusterSecretStore
  target:
    name: postgres-creds
    creationPolicy: Owner
  data:
    - secretKey: POSTGRES_PASSWORD
      remoteRef:
        key: ecommerce-k8s-demo/postgres
        property: password
    - secretKey: POSTGRES_USER
      remoteRef:
        key: ecommerce-k8s-demo/postgres
        property: username

The flow:

AWS Secrets Manager
  → ExternalSecret CRD (IRSA authenticated)
  → Kubernetes Secret (auto-created, kept in sync)
  → App pod (env var or volume mount)

Observability Stack

The full observability stack is deployed via ArgoCD:

Signal Collector Storage Query
Metrics Prometheus (ServiceMonitor scrape) TSDB on EBS Grafana PromQL
Traces OTel Collector (OTLP gRPC :4317) Jaeger Grafana / Jaeger UI
Logs Promtail DaemonSet Loki Grafana LogQL
Search OpenSearch client (direct) OpenSearch index OpenSearch Dashboards

The OTel Collector pipeline:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
  resource:

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

Cost Breakdown

Resource Cost Optimization
EKS cluster ~$7.20/mo Fixed control plane cost
SPOT t3.small × 5 ~$14/mo ~60% vs on-demand
NAT Gateway ~$5/mo Single AZ vs per-AZ
NLB ~$16/mo One NLB for everything
EBS volumes ~$3/mo gp3 storage class
Route53 ~$0.50/mo Hosted zone
Total ~$46/mo vs ~$200+ on-demand multi-AZ

Key Takeaways

Zero static credentials — GitHub Actions OIDC means no AWS keys ever touch GitHub secrets. IRSA means no AWS keys ever touch EKS nodes.

Destroy-safe architecture — EIPs and ECR in separate workspaces means the cluster can be completely torn down and rebuilt without updating DNS or rebuilding images.

Single ACM cert covers everything — One wildcard cert on the NLB eliminates cert-manager, Let's Encrypt rate limits, and per-app TLS configuration.

Cost matters — SPOT instances, single NAT Gateway, NGINX instead of per-ALB cost, pods instead of managed services. Same production patterns at a fraction of the cost.


Source code: github.com/rajreddy/ecommerce-k8s-demoInteractive architecture: codepen.io/qckuhtdx-the-scripter/pen/myrLwxPDomain: reddycloud.com