# Building a Production-Grade EKS Platform on AWS with Terraform and GitOps

%[https://codepen.io/qckuhtdx-the-scripter/pen/myrLwxP] 

# Building a Production-Grade EKS Platform on AWS with Terraform and GitOps

## Overview

In this post I walk through how I built a fully automated, production-style Kubernetes platform on AWS EKS using Terraform, GitHub Actions OIDC, and ArgoCD — all optimized for cost without sacrificing reliability. Every component is provisioned as code, deployed without storing a single static AWS credential, and observable from day one.

The full source code is available at: **https://github.com/rajasekhar-cloud25/infrastructure**

Interactive architecture diagram: [**View live →**](https://codepen.io/qckuhtdx-the-scripter/pen/myrLwxP)

## The Problem with Static Credentials

The traditional approach looks like this:

```yaml
# ❌ The wrong way — credentials stored permanently in GitHub
env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
```

Problems with this approach:

*   Credentials are long-lived — if leaked, they work until manually rotated
    
*   They exist permanently in GitHub's secret store
    
*   Any workflow in the repo can use them
    
*   Rotation requires updating secrets in every repo that uses them
    
*   No audit trail of which workflow run used which credential
    

* * *

## The Solution: OIDC Token Exchange

GitHub Actions supports OpenID Connect (OIDC). Instead of storing credentials, the workflow requests a short-lived token from GitHub's OIDC provider and exchanges it for an AWS IAM role:

```plaintext
GitHub Actions runner
  │
  ├─ requests OIDC token from GitHub
  │   (signed JWT containing: repo, branch, workflow, run ID)
  │
  ├─ calls AWS STS AssumeRoleWithWebIdentity
  │   (presents the OIDC token + role ARN)
  │
  ├─ AWS validates: is this token from GitHub? ✅
  │                 is the repo/branch in the trust policy? ✅
  │
  └─ AWS returns: temporary access key + secret + session token
      (expires in 1 hour, scoped to this specific IAM role)
```

The credentials exist only for the duration of the job. When the job ends, the credentials expire. Nothing is stored. Nothing can leak.

* * *

## Project Structure

The infrastructure is organized as a set of Terraform modules, each with a single responsibility:

```plaintext
infrastructure/
  .github/              ← GitHub Actions workflows
  Main/                 ← Root module, wires everything together
  vpc/                  ← VPC, subnets, IGW, NAT GW, route tables, SGs
  iam/                  ← IAM roles, IRSA roles, GitHub OIDC trust
  eks/                  ← EKS cluster, node group, access entries
  ecr/                  ← ECR repositories (separate workspace)
  eip/                  ← Elastic IPs for NLB (separate workspace)
  k8s_namespaces/       ← All K8s namespaces pre-created
  kubernetes-ingress/   ← NGINX Ingress Controller + NLB + Route53
  argocd_deployment/    ← ArgoCD via local Helm chart
  s3/                   ← Terraform state bucket bootstrap
  charts/               ← Local Helm charts (ArgoCD, NGINX)
  bootstrap.sh          ← Creates S3 bucket + DynamoDB table
```

The two most important design decisions here: **ECR and EIPs live in separate workspaces**. This means container images and static IP addresses survive a full cluster destroy and recreate — no DNS updates, no image rebuilds.

* * *

## Step 1 — Bootstrapping State

Before any Terraform can run, the S3 state backend needs to exist. The `bootstrap.sh` script handles this:

```bash
#!/bin/bash
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="ecommerce-demo-terraform-state-${ACCOUNT_ID}"

aws s3api create-bucket --bucket $BUCKET_NAME --region us-east-1
aws s3api put-bucket-versioning \
  --bucket $BUCKET_NAME \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST
```

The bucket name is derived from the AWS account ID at runtime — no secrets needed anywhere.

* * *

## Step 2 — GitHub Actions OIDC (Zero Static Credentials)

All three workflows authenticate to AWS using OIDC — the GitHub Actions token is exchanged for a short-lived IAM role. No `AWS_ACCESS_KEY_ID` is ever stored in GitHub secrets.

```yaml
# .github/workflows/tf-apply.yaml
jobs:
  plan:
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ vars.AWS_REGION }}
```

Three workflows are defined:

| Workflow | Trigger | Purpose |
| --- | --- | --- |
| `tf-plan` | On every PR | Runs `terraform plan`, posts diff as comment |
| `tf-apply` | Merge to main | Requires manual approval, then applies |
| `tf-destroy` | Manual only | Requires typing "destroy" to confirm |

* * *

## Step 3 — VPC Module

The VPC module creates everything the cluster needs to run in a private, secure network:

```hcl
# modules/vpc/main.tf (key resources)
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

# Public subnets — NLB, NAT Gateway, Internet Gateway
resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet("10.0.0.0/16", 8, count.index)
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    "kubernetes.io/role/elb" = "1"
  }
}

# Private subnets — EKS nodes only
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    "kubernetes.io/role/internal-elb"                    = "1"
    "kubernetes.io/cluster/${var.resource_name}"         = "shared"
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id  # Single AZ — cost optimized
}
```

**Why a single NAT Gateway?** One NAT Gateway instead of two saves ~$32/month. For a production portfolio demo this is an acceptable tradeoff — the only risk is losing outbound internet for private nodes if that one AZ goes down.

Route tables are straightforward:

*   **Public RT:** `0.0.0.0/0 → Internet Gateway`
    
*   **Private RT:** `0.0.0.0/0 → NAT Gateway`
    

* * *

## Step 4 — IAM Module

The IAM module handles all permissions with a single consolidated design. The key insight is that the EKS OIDC provider is created inside the IAM module — this avoids a circular dependency where IAM needs EKS and EKS needs IAM.

```hcl
# modules/iam/main.tf

# EKS cluster role
resource "aws_iam_role" "eks_cluster" {
  name               = "${var.resource_name}-eks-cluster-role"
  assume_role_policy = data.aws_iam_policy_document.eks_cluster_assume.json
}

# GitHub Actions OIDC role (pre-created provider as data source)
data "aws_iam_openid_connect_provider" "github" {
  url = "https://token.actions.githubusercontent.com"
}

resource "aws_iam_role" "github_actions" {
  name = "${var.resource_name}-github-actions-role"
  assume_role_policy = jsonencode({
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = data.aws_iam_openid_connect_provider.github.arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:${var.github_repo}:*"
        }
      }
    }]
  })
}

# IRSA roles — one per component, least privilege
resource "aws_iam_role" "external_secrets" {
  name = "${var.resource_name}-external-secrets-role"
  # Trust policy scoped to the ESO service account only
  assume_role_policy = data.aws_iam_policy_document.irsa_external_secrets.json
}
```

Every pod that needs AWS access gets its own IRSA role — no shared node-level credentials:

| Component | IRSA Permissions |
| --- | --- |
| EBS CSI Driver | `ec2:CreateVolume`, `ec2:AttachVolume` |
| Cluster Autoscaler | `autoscaling:SetDesiredCapacity` |
| External Secrets | `secretsmanager:GetSecretValue` |
| cert-manager | `route53:ChangeResourceRecordSets` |

* * *

## Step 5 — EKS Module

The EKS cluster runs on SPOT t3.small instances for cost optimization. Using `API_AND_CONFIG_MAP` auth mode with access entries instead of the legacy `aws-auth` ConfigMap:

```hcl
# modules/eks/main.tf
resource "aws_eks_cluster" "main" {
  name     = var.resource_name
  version  = var.cluster_version
  role_arn = var.cluster_role_arn

  vpc_config {
    subnet_ids              = var.private_subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = true
  }

  access_config {
    authentication_mode                         = "API_AND_CONFIG_MAP"
    bootstrap_cluster_creator_admin_permissions = true
  }
}

resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_role_arn   = var.node_role_arn
  subnet_ids      = var.private_subnet_ids
  instance_types  = ["t3.small"]
  capacity_type   = "SPOT"        # ~60% cheaper than on-demand

  scaling_config {
    desired_size = 5
    min_size     = 2
    max_size     = 5
  }
}

# Access entries — no aws-auth ConfigMap editing required
resource "aws_eks_access_entry" "github_actions" {
  cluster_name  = aws_eks_cluster.main.name
  principal_arn = var.github_actions_role_arn
  type          = "STANDARD"
}
```

The Kubernetes and Helm providers authenticate using `exec` with `aws eks get-token` — this avoids plan-time failures when the cluster doesn't exist yet:

```hcl
provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_ca_certificate)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    args        = ["eks", "get-token", "--cluster-name", module.eks.cluster_name, "--region", var.aws_region]
  }
}
```

* * *

## Step 6 — ECR Module (Separate Workspace)

ECR repositories are managed in their own Terraform workspace so images are never accidentally deleted when the main cluster is torn down:

```hcl
# ecr/main.tf
resource "aws_ecr_repository" "app" {
  for_each             = toset(var.repository_names)
  name                 = each.value
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }
}
```

GitHub Actions builds and pushes on every merge:

```yaml
- name: Build and push
  run: |
    aws ecr get-login-password | docker login --username AWS \
      --password-stdin $ECR_REGISTRY
    docker buildx build --platform linux/amd64 \
      --push -t $ECR_REGISTRY/$ECR_REPO:$IMAGE_TAG .
```

* * *

## Step 7 — EIP Module (Separate Workspace)

Static Elastic IPs are created in their own workspace — separate from the main cluster. This means the NLB always gets the same IP addresses, Route53 A records never need updating, and the cluster can be completely rebuilt without changing DNS:

```hcl
# eip/main.tf
resource "aws_eip" "nlb" {
  count  = 2
  domain = "vpc"

  tags = {
    Name = "${var.resource_name}-nlb-eip-${count.index}"
  }
}
```

The allocation IDs are then passed as a variable to the main workspace:

```hcl
# environments/eks-demo-dev.tfvars
nlb_eip_allocation_ids = [
  "eipalloc-09595a182e792f01f",
  "eipalloc-032c83197c359b3fe"
]
```

* * *

## Step 8 — Kubernetes Namespaces Module

All namespaces are created before any Helm chart runs. This prevents race conditions where a chart tries to create resources in a namespace that doesn't exist yet:

```hcl
# modules/k8s_namespaces/main.tf
resource "kubernetes_namespace" "namespaces" {
  for_each = toset([
    "argocd",
    "nginx-ingress",
    "monitoring",
    "external-secrets",
    "cert-manager",
    "eks-demo",
    "shared-os",
    "kubecost"
  ])

  metadata {
    name = each.value
  }
}
```

This module runs before `kubernetes-ingress` and `argocd_deployment` in the dependency graph.

* * *

## Step 9 — NGINX Ingress Controller + NLB + Route53

The NGINX Ingress Controller is the traffic gateway for the entire cluster. It is deployed via a local Helm chart with NLB annotations that attach the static EIPs:

```hcl
# modules/kubernetes-ingress/main.tf
resource "helm_release" "nginx_ingress" {
  name      = "nginx-ingress"
  chart     = "${path.module}/../charts/kubernetes-ingress"
  namespace = "nginx-ingress"
  timeout   = 600
  wait      = true
  atomic    = true

  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-type"
    value = "nlb"
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-eip-allocations"
    value = join("\\,", var.nlb_eip_allocation_ids)
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-subnets"
    value = join("\\,", var.public_subnet_ids)
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-cert"
    value = var.acm_certificate_arn
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-ports"
    value = "443"
  }
}
```

**Why NGINX over AWS ALB Controller?**

*   NGINX is free — ALB Controller creates a new ALB per ingress (~$16/mo each)
    
*   No per-app ACM ARN annotation required — one cert at the NLB level covers everything
    
*   Portable — works identically on any cloud or on-premises
    

Route53 A records point directly to the static EIP addresses:

```hcl
data "aws_eip" "nlb" {
  count = length(var.nlb_eip_allocation_ids)
  id    = var.nlb_eip_allocation_ids[count.index]
}

resource "aws_route53_record" "dns_records" {
  for_each = toset(var.dns_names)
  zone_id  = var.route53_zone_id
  name     = "${each.value}.${var.domain_name}"
  type     = "A"
  ttl      = 300
  records  = data.aws_eip.nlb[*].public_ip
}
```

TLS flow:

```plaintext
User → HTTPS
  → NLB (ACM wildcard *.reddycloud.com terminates TLS)
  → HTTP → NGINX Ingress
  → HTTP → App pod
```

No cert-manager needed. AWS handles certificate renewal automatically.

* * *

## Step 10 — ArgoCD Deployment

ArgoCD is deployed via a local Helm chart with CRDs managed separately using the `alekc/kubectl` provider to avoid Helm CRD conflicts:

```hcl
# modules/argocd_deployment/main.tf

# CRDs managed outside Helm to avoid upgrade conflicts
data "http" "argocd_crds" {
  for_each = toset(local.crd_files)
  url      = each.value
}

resource "kubectl_manifest" "argocd_crds" {
  for_each          = toset(local.crd_files)
  yaml_body         = data.http.argocd_crds[each.value].response_body
  server_side_apply = true
  force_conflicts   = true
  wait              = true
}

resource "helm_release" "argocd" {
  name       = "argocd-chart"
  chart      = "${path.module}/../charts/argocd"
  version    = "9.4.17"
  namespace  = "argocd"
  skip_crds  = true    # CRDs managed by kubectl_manifest above
  replace    = true
  wait       = true
  timeout    = 600

  values = [file("${path.module}/../charts/argocd/clusterValues/values.EksDemo.yaml")]

  depends_on = [kubectl_manifest.argocd_crds]
}
```

ArgoCD values for NGINX ingress integration:

```yaml
# charts/argocd/clusterValues/values.EksDemo.yaml
configs:
  params:
    server.insecure: "true"  # TLS terminated at NLB

server:
  extraArgs:
    - --insecure

  ingress:
    enabled: true
    ingressClassName: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
      nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
    hostname: argocd.reddycloud.com
    paths: /
    pathType: Prefix
    https: false
```

ArgoCD runs in `--insecure` mode because TLS is already terminated at the NLB. The user always sees HTTPS — ArgoCD just receives plain HTTP from NGINX.

* * *

## Secrets Management — External Secrets Operator

No secrets are hardcoded anywhere. The External Secrets Operator pulls from AWS Secrets Manager using IRSA:

```yaml
# K8s manifest deployed via ArgoCD
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: postgres-creds
  namespace: eks-demo
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-store
    kind: ClusterSecretStore
  target:
    name: postgres-creds
    creationPolicy: Owner
  data:
    - secretKey: POSTGRES_PASSWORD
      remoteRef:
        key: ecommerce-k8s-demo/postgres
        property: password
    - secretKey: POSTGRES_USER
      remoteRef:
        key: ecommerce-k8s-demo/postgres
        property: username
```

The flow:

```plaintext
AWS Secrets Manager
  → ExternalSecret CRD (IRSA authenticated)
  → Kubernetes Secret (auto-created, kept in sync)
  → App pod (env var or volume mount)
```

* * *

## Observability Stack

The full observability stack is deployed via ArgoCD:

| Signal | Collector | Storage | Query |
| --- | --- | --- | --- |
| Metrics | Prometheus (ServiceMonitor scrape) | TSDB on EBS | Grafana PromQL |
| Traces | OTel Collector (OTLP gRPC :4317) | Jaeger | Grafana / Jaeger UI |
| Logs | Promtail DaemonSet | Loki | Grafana LogQL |
| Search | OpenSearch client (direct) | OpenSearch index | OpenSearch Dashboards |

The OTel Collector pipeline:

```yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
  resource:

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]
```

* * *

## Cost Breakdown

| Resource | Cost | Optimization |
| --- | --- | --- |
| EKS cluster | ~$7.20/mo | Fixed control plane cost |
| SPOT t3.small × 5 | ~$14/mo | ~60% vs on-demand |
| NAT Gateway | ~$5/mo | Single AZ vs per-AZ |
| NLB | ~$16/mo | One NLB for everything |
| EBS volumes | ~$3/mo | gp3 storage class |
| Route53 | ~$0.50/mo | Hosted zone |
| **Total** | **~$46/mo** | vs ~$200+ on-demand multi-AZ |

* * *

## Key Takeaways

**Zero static credentials** — GitHub Actions OIDC means no AWS keys ever touch GitHub secrets. IRSA means no AWS keys ever touch EKS nodes.

**Destroy-safe architecture** — EIPs and ECR in separate workspaces means the cluster can be completely torn down and rebuilt without updating DNS or rebuilding images.

**Single ACM cert covers everything** — One wildcard cert on the NLB eliminates cert-manager, Let's Encrypt rate limits, and per-app TLS configuration.

**Cost matters** — SPOT instances, single NAT Gateway, NGINX instead of per-ALB cost, pods instead of managed services. Same production patterns at a fraction of the cost.

* * *

*Source code: github.com/rajreddy/ecommerce-k8s-demoInteractive architecture: codepen.io/qckuhtdx-the-scripter/pen/myrLwxPDomain: reddycloud.com*