Building a Production-Grade EKS Platform on AWS with Terraform and GitOps
https://codepen.io/qckuhtdx-the-scripter/pen/myrLwxP
Building a Production-Grade EKS Platform on AWS with Terraform and GitOps
Overview
In this post I walk through how I built a fully automated, production-style Kubernetes platform on AWS EKS using Terraform, GitHub Actions OIDC, and ArgoCD — all optimized for cost without sacrificing reliability. Every component is provisioned as code, deployed without storing a single static AWS credential, and observable from day one.
The full source code is available at: https://github.com/rajasekhar-cloud25/infrastructure
Interactive architecture diagram: View live →
The Problem with Static Credentials
The traditional approach looks like this:
# ❌ The wrong way — credentials stored permanently in GitHub
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Problems with this approach:
Credentials are long-lived — if leaked, they work until manually rotated
They exist permanently in GitHub's secret store
Any workflow in the repo can use them
Rotation requires updating secrets in every repo that uses them
No audit trail of which workflow run used which credential
The Solution: OIDC Token Exchange
GitHub Actions supports OpenID Connect (OIDC). Instead of storing credentials, the workflow requests a short-lived token from GitHub's OIDC provider and exchanges it for an AWS IAM role:
GitHub Actions runner
│
├─ requests OIDC token from GitHub
│ (signed JWT containing: repo, branch, workflow, run ID)
│
├─ calls AWS STS AssumeRoleWithWebIdentity
│ (presents the OIDC token + role ARN)
│
├─ AWS validates: is this token from GitHub? ✅
│ is the repo/branch in the trust policy? ✅
│
└─ AWS returns: temporary access key + secret + session token
(expires in 1 hour, scoped to this specific IAM role)
The credentials exist only for the duration of the job. When the job ends, the credentials expire. Nothing is stored. Nothing can leak.
Project Structure
The infrastructure is organized as a set of Terraform modules, each with a single responsibility:
infrastructure/
.github/ ← GitHub Actions workflows
Main/ ← Root module, wires everything together
vpc/ ← VPC, subnets, IGW, NAT GW, route tables, SGs
iam/ ← IAM roles, IRSA roles, GitHub OIDC trust
eks/ ← EKS cluster, node group, access entries
ecr/ ← ECR repositories (separate workspace)
eip/ ← Elastic IPs for NLB (separate workspace)
k8s_namespaces/ ← All K8s namespaces pre-created
kubernetes-ingress/ ← NGINX Ingress Controller + NLB + Route53
argocd_deployment/ ← ArgoCD via local Helm chart
s3/ ← Terraform state bucket bootstrap
charts/ ← Local Helm charts (ArgoCD, NGINX)
bootstrap.sh ← Creates S3 bucket + DynamoDB table
The two most important design decisions here: ECR and EIPs live in separate workspaces. This means container images and static IP addresses survive a full cluster destroy and recreate — no DNS updates, no image rebuilds.
Step 1 — Bootstrapping State
Before any Terraform can run, the S3 state backend needs to exist. The bootstrap.sh script handles this:
#!/bin/bash
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="ecommerce-demo-terraform-state-${ACCOUNT_ID}"
aws s3api create-bucket --bucket $BUCKET_NAME --region us-east-1
aws s3api put-bucket-versioning \
--bucket $BUCKET_NAME \
--versioning-configuration Status=Enabled
aws dynamodb create-table \
--table-name terraform-state-lock \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
The bucket name is derived from the AWS account ID at runtime — no secrets needed anywhere.
Step 2 — GitHub Actions OIDC (Zero Static Credentials)
All three workflows authenticate to AWS using OIDC — the GitHub Actions token is exchanged for a short-lived IAM role. No AWS_ACCESS_KEY_ID is ever stored in GitHub secrets.
# .github/workflows/tf-apply.yaml
jobs:
plan:
permissions:
id-token: write
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ vars.AWS_REGION }}
Three workflows are defined:
| Workflow | Trigger | Purpose |
|---|---|---|
tf-plan |
On every PR | Runs terraform plan, posts diff as comment |
tf-apply |
Merge to main | Requires manual approval, then applies |
tf-destroy |
Manual only | Requires typing "destroy" to confirm |
Step 3 — VPC Module
The VPC module creates everything the cluster needs to run in a private, secure network:
# modules/vpc/main.tf (key resources)
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
}
# Public subnets — NLB, NAT Gateway, Internet Gateway
resource "aws_subnet" "public" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet("10.0.0.0/16", 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = {
"kubernetes.io/role/elb" = "1"
}
}
# Private subnets — EKS nodes only
resource "aws_subnet" "private" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet("10.0.0.0/16", 8, count.index + 10)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
"kubernetes.io/role/internal-elb" = "1"
"kubernetes.io/cluster/${var.resource_name}" = "shared"
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public[0].id # Single AZ — cost optimized
}
Why a single NAT Gateway? One NAT Gateway instead of two saves ~$32/month. For a production portfolio demo this is an acceptable tradeoff — the only risk is losing outbound internet for private nodes if that one AZ goes down.
Route tables are straightforward:
Public RT:
0.0.0.0/0 → Internet GatewayPrivate RT:
0.0.0.0/0 → NAT Gateway
Step 4 — IAM Module
The IAM module handles all permissions with a single consolidated design. The key insight is that the EKS OIDC provider is created inside the IAM module — this avoids a circular dependency where IAM needs EKS and EKS needs IAM.
# modules/iam/main.tf
# EKS cluster role
resource "aws_iam_role" "eks_cluster" {
name = "${var.resource_name}-eks-cluster-role"
assume_role_policy = data.aws_iam_policy_document.eks_cluster_assume.json
}
# GitHub Actions OIDC role (pre-created provider as data source)
data "aws_iam_openid_connect_provider" "github" {
url = "https://token.actions.githubusercontent.com"
}
resource "aws_iam_role" "github_actions" {
name = "${var.resource_name}-github-actions-role"
assume_role_policy = jsonencode({
Statement = [{
Effect = "Allow"
Principal = { Federated = data.aws_iam_openid_connect_provider.github.arn }
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringLike = {
"token.actions.githubusercontent.com:sub" = "repo:${var.github_repo}:*"
}
}
}]
})
}
# IRSA roles — one per component, least privilege
resource "aws_iam_role" "external_secrets" {
name = "${var.resource_name}-external-secrets-role"
# Trust policy scoped to the ESO service account only
assume_role_policy = data.aws_iam_policy_document.irsa_external_secrets.json
}
Every pod that needs AWS access gets its own IRSA role — no shared node-level credentials:
| Component | IRSA Permissions |
|---|---|
| EBS CSI Driver | ec2:CreateVolume, ec2:AttachVolume |
| Cluster Autoscaler | autoscaling:SetDesiredCapacity |
| External Secrets | secretsmanager:GetSecretValue |
| cert-manager | route53:ChangeResourceRecordSets |
Step 5 — EKS Module
The EKS cluster runs on SPOT t3.small instances for cost optimization. Using API_AND_CONFIG_MAP auth mode with access entries instead of the legacy aws-auth ConfigMap:
# modules/eks/main.tf
resource "aws_eks_cluster" "main" {
name = var.resource_name
version = var.cluster_version
role_arn = var.cluster_role_arn
vpc_config {
subnet_ids = var.private_subnet_ids
endpoint_private_access = true
endpoint_public_access = true
}
access_config {
authentication_mode = "API_AND_CONFIG_MAP"
bootstrap_cluster_creator_admin_permissions = true
}
}
resource "aws_eks_node_group" "main" {
cluster_name = aws_eks_cluster.main.name
node_role_arn = var.node_role_arn
subnet_ids = var.private_subnet_ids
instance_types = ["t3.small"]
capacity_type = "SPOT" # ~60% cheaper than on-demand
scaling_config {
desired_size = 5
min_size = 2
max_size = 5
}
}
# Access entries — no aws-auth ConfigMap editing required
resource "aws_eks_access_entry" "github_actions" {
cluster_name = aws_eks_cluster.main.name
principal_arn = var.github_actions_role_arn
type = "STANDARD"
}
The Kubernetes and Helm providers authenticate using exec with aws eks get-token — this avoids plan-time failures when the cluster doesn't exist yet:
provider "kubernetes" {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_ca_certificate)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", module.eks.cluster_name, "--region", var.aws_region]
}
}
Step 6 — ECR Module (Separate Workspace)
ECR repositories are managed in their own Terraform workspace so images are never accidentally deleted when the main cluster is torn down:
# ecr/main.tf
resource "aws_ecr_repository" "app" {
for_each = toset(var.repository_names)
name = each.value
image_tag_mutability = "MUTABLE"
image_scanning_configuration {
scan_on_push = true
}
}
GitHub Actions builds and pushes on every merge:
- name: Build and push
run: |
aws ecr get-login-password | docker login --username AWS \
--password-stdin $ECR_REGISTRY
docker buildx build --platform linux/amd64 \
--push -t \(ECR_REGISTRY/\)ECR_REPO:$IMAGE_TAG .
Step 7 — EIP Module (Separate Workspace)
Static Elastic IPs are created in their own workspace — separate from the main cluster. This means the NLB always gets the same IP addresses, Route53 A records never need updating, and the cluster can be completely rebuilt without changing DNS:
# eip/main.tf
resource "aws_eip" "nlb" {
count = 2
domain = "vpc"
tags = {
Name = "\({var.resource_name}-nlb-eip-\){count.index}"
}
}
The allocation IDs are then passed as a variable to the main workspace:
# environments/eks-demo-dev.tfvars
nlb_eip_allocation_ids = [
"eipalloc-09595a182e792f01f",
"eipalloc-032c83197c359b3fe"
]
Step 8 — Kubernetes Namespaces Module
All namespaces are created before any Helm chart runs. This prevents race conditions where a chart tries to create resources in a namespace that doesn't exist yet:
# modules/k8s_namespaces/main.tf
resource "kubernetes_namespace" "namespaces" {
for_each = toset([
"argocd",
"nginx-ingress",
"monitoring",
"external-secrets",
"cert-manager",
"eks-demo",
"shared-os",
"kubecost"
])
metadata {
name = each.value
}
}
This module runs before kubernetes-ingress and argocd_deployment in the dependency graph.
Step 9 — NGINX Ingress Controller + NLB + Route53
The NGINX Ingress Controller is the traffic gateway for the entire cluster. It is deployed via a local Helm chart with NLB annotations that attach the static EIPs:
# modules/kubernetes-ingress/main.tf
resource "helm_release" "nginx_ingress" {
name = "nginx-ingress"
chart = "${path.module}/../charts/kubernetes-ingress"
namespace = "nginx-ingress"
timeout = 600
wait = true
atomic = true
set {
name = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-type"
value = "nlb"
}
set {
name = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-eip-allocations"
value = join("\\,", var.nlb_eip_allocation_ids)
}
set {
name = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-subnets"
value = join("\\,", var.public_subnet_ids)
}
set {
name = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-cert"
value = var.acm_certificate_arn
}
set {
name = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-ports"
value = "443"
}
}
Why NGINX over AWS ALB Controller?
NGINX is free — ALB Controller creates a new ALB per ingress (~$16/mo each)
No per-app ACM ARN annotation required — one cert at the NLB level covers everything
Portable — works identically on any cloud or on-premises
Route53 A records point directly to the static EIP addresses:
data "aws_eip" "nlb" {
count = length(var.nlb_eip_allocation_ids)
id = var.nlb_eip_allocation_ids[count.index]
}
resource "aws_route53_record" "dns_records" {
for_each = toset(var.dns_names)
zone_id = var.route53_zone_id
name = "\({each.value}.\){var.domain_name}"
type = "A"
ttl = 300
records = data.aws_eip.nlb[*].public_ip
}
TLS flow:
User → HTTPS
→ NLB (ACM wildcard *.reddycloud.com terminates TLS)
→ HTTP → NGINX Ingress
→ HTTP → App pod
No cert-manager needed. AWS handles certificate renewal automatically.
Step 10 — ArgoCD Deployment
ArgoCD is deployed via a local Helm chart with CRDs managed separately using the alekc/kubectl provider to avoid Helm CRD conflicts:
# modules/argocd_deployment/main.tf
# CRDs managed outside Helm to avoid upgrade conflicts
data "http" "argocd_crds" {
for_each = toset(local.crd_files)
url = each.value
}
resource "kubectl_manifest" "argocd_crds" {
for_each = toset(local.crd_files)
yaml_body = data.http.argocd_crds[each.value].response_body
server_side_apply = true
force_conflicts = true
wait = true
}
resource "helm_release" "argocd" {
name = "argocd-chart"
chart = "${path.module}/../charts/argocd"
version = "9.4.17"
namespace = "argocd"
skip_crds = true # CRDs managed by kubectl_manifest above
replace = true
wait = true
timeout = 600
values = [file("${path.module}/../charts/argocd/clusterValues/values.EksDemo.yaml")]
depends_on = [kubectl_manifest.argocd_crds]
}
ArgoCD values for NGINX ingress integration:
# charts/argocd/clusterValues/values.EksDemo.yaml
configs:
params:
server.insecure: "true" # TLS terminated at NLB
server:
extraArgs:
- --insecure
ingress:
enabled: true
ingressClassName: "nginx"
annotations:
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
hostname: argocd.reddycloud.com
paths: /
pathType: Prefix
https: false
ArgoCD runs in --insecure mode because TLS is already terminated at the NLB. The user always sees HTTPS — ArgoCD just receives plain HTTP from NGINX.
Secrets Management — External Secrets Operator
No secrets are hardcoded anywhere. The External Secrets Operator pulls from AWS Secrets Manager using IRSA:
# K8s manifest deployed via ArgoCD
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: postgres-creds
namespace: eks-demo
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-store
kind: ClusterSecretStore
target:
name: postgres-creds
creationPolicy: Owner
data:
- secretKey: POSTGRES_PASSWORD
remoteRef:
key: ecommerce-k8s-demo/postgres
property: password
- secretKey: POSTGRES_USER
remoteRef:
key: ecommerce-k8s-demo/postgres
property: username
The flow:
AWS Secrets Manager
→ ExternalSecret CRD (IRSA authenticated)
→ Kubernetes Secret (auto-created, kept in sync)
→ App pod (env var or volume mount)
Observability Stack
The full observability stack is deployed via ArgoCD:
| Signal | Collector | Storage | Query |
|---|---|---|---|
| Metrics | Prometheus (ServiceMonitor scrape) | TSDB on EBS | Grafana PromQL |
| Traces | OTel Collector (OTLP gRPC :4317) | Jaeger | Grafana / Jaeger UI |
| Logs | Promtail DaemonSet | Loki | Grafana LogQL |
| Search | OpenSearch client (direct) | OpenSearch index | OpenSearch Dashboards |
The OTel Collector pipeline:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
resource:
exporters:
jaeger:
endpoint: jaeger-collector:14250
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
Cost Breakdown
| Resource | Cost | Optimization |
|---|---|---|
| EKS cluster | ~$7.20/mo | Fixed control plane cost |
| SPOT t3.small × 5 | ~$14/mo | ~60% vs on-demand |
| NAT Gateway | ~$5/mo | Single AZ vs per-AZ |
| NLB | ~$16/mo | One NLB for everything |
| EBS volumes | ~$3/mo | gp3 storage class |
| Route53 | ~$0.50/mo | Hosted zone |
| Total | ~$46/mo | vs ~$200+ on-demand multi-AZ |
Key Takeaways
Zero static credentials — GitHub Actions OIDC means no AWS keys ever touch GitHub secrets. IRSA means no AWS keys ever touch EKS nodes.
Destroy-safe architecture — EIPs and ECR in separate workspaces means the cluster can be completely torn down and rebuilt without updating DNS or rebuilding images.
Single ACM cert covers everything — One wildcard cert on the NLB eliminates cert-manager, Let's Encrypt rate limits, and per-app TLS configuration.
Cost matters — SPOT instances, single NAT Gateway, NGINX instead of per-ALB cost, pods instead of managed services. Same production patterns at a fraction of the cost.
Source code: github.com/rajreddy/ecommerce-k8s-demoInteractive architecture: codepen.io/qckuhtdx-the-scripter/pen/myrLwxPDomain: reddycloud.com

