raja.cloud

Building a Production Ecommerce Platform on AWS EKS

Rajasekhar Reddy — Sat, 11 Apr 2026 15:10:17 GMT

Building and Deploying a Full-Stack Ecommerce Platform: From Zero to Production on AWS EKS

By Rajasekhar Reddy | Senior DevOps Engineer | April 2026

For years, I wanted to build an ecommerce application. I started multiple times and never finished. The scope was always too big, the technology choices paralyzing, the motivation fading after the first few bugs.

Then I decided to stop planning and start shipping. I built a complete ecommerce platform with real payments, deployed it to AWS EKS with full observability, and made it production-ready — all in a few focused sessions.

This is the story of how I built Raj Store, what architectural decisions I made, and why I chose a modular monolith over microservices. Whether you're a developer looking for inspiration, a DevOps engineer curious about full-stack deployment, or a business stakeholder evaluating technology approaches — there's something here for you.

Raj Store — 100 products, real images, search with filters, hover animations

The Problem I Was Solving

Every ecommerce tutorial teaches you the basics — a product list, a cart, maybe a checkout page. But none of them show you what it takes to go from "it works on localhost" to "it's running in production with real payments, real observability, and real deployment pipelines."

I wanted to bridge that gap. Not just build an app, but deploy it the way a real engineering team would.

What I Built

The Application

Raj Store is a full-featured ecommerce platform built with Python (FastAPI) on the backend and server-rendered HTML with Jinja2 + TailwindCSS on the frontend. No React, no separate frontend repo — just clean server-side rendering with HTMX for interactivity.

Core features:

User authentication with JWT tokens stored in secure HTTP-only cookies
Product catalog with 100+ real products seeded from DummyJSON
Full-text search with category filters, price range, and sorting
Shopping cart with quantity management
Real payment processing via Stripe Checkout (test mode)
Complete order lifecycle: Pending → Confirmed → Shipped → Delivered
Amazon-style order numbers (ORD-20260408-A7B2C9)
Admin panel with dashboard, order management, and Stripe refunds
Product reviews and 5-star ratings (purchase-gated — only buyers can review)
Wishlist for saving products
International shipping addresses with 26-country dropdown
Email notifications for registration, order confirmation, and status updates

Product detail page with image, pricing, stock status, Add to Cart, wishlist, and customer reviews

Shopping cart with subtotals, total calculation, and Stripe checkout button

Architecture: Why I Chose a Modular Monolith

This is probably the most important decision I made, and it goes against what most tutorials tell you.

When I started planning, my instinct was to split everything into microservices: auth-service, product-service, cart-service, order-service, payment-service. Seven separate applications, seven databases, seven deployment pipelines.

Then I asked myself: who is this for?

I'm a solo developer building a portfolio project. I don't have a team of 50 engineers who need to deploy independently. I don't have millions of requests per second that require different scaling strategies for different components. I don't have organizational boundaries that mandate service separation.

What I do have is one application that needs to work reliably, be easy to debug, and be impressive to demo.

A modular monolith was the right call. Here's the actual code structure:

app/
├── main.py              # Entry point, router registration
├── config.py            # Environment-based settings (Pydantic)
├── database.py          # SQLAlchemy engine + session
├── dependencies.py      # Auth dependencies
├── otel.py              # OpenTelemetry setup (graceful degradation)
├── models/              # SQLAlchemy ORM models
│   ├── user.py
│   ├── product.py
│   ├── cart.py
│   ├── order.py
│   ├── review.py
│   └── ...
├── schemas/             # Pydantic request/response schemas
├── services/            # Business logic layer
│   ├── product_service.py
│   ├── cart_service.py
│   ├── order_service.py
│   ├── search_service.py   # OpenSearch integration
│   └── email_service.py
└── routers/             # HTTP endpoints
    ├── auth.py
    ├── pages.py         # Server-rendered HTML pages
    ├── product.py
    ├── cart.py
    ├── order.py         # Stripe checkout + payments
    ├── admin.py
    └── ...

Each "module" (auth, products, cart, orders) has its own model, schema, service, and router — but they share one database, one process, and one deployment. This is exactly how Shopify (Rails monolith), GitHub (Rails monolith), and Stack Overflow (.NET monolith) are built.

The key insight: you can always extract a microservice later when you have real traffic patterns showing which module needs independent scaling. Starting with microservices before you have that data is premature optimization at the architectural level.

The Tech Stack

Layer	Technology	Why
Backend	FastAPI (Python 3.11)	Async, type-safe, auto-generated OpenAPI docs
ORM	SQLAlchemy 2.x	Industry standard, works with SQLite and PostgreSQL
Templates	Jinja2 + TailwindCSS	Server-rendered, no JavaScript framework needed
Interactivity	HTMX	Add-to-cart without page reload, minimal JS
Payments	Stripe Checkout	PCI compliant, hosted payment page, real refunds
Database	PostgreSQL 16	Production-grade, running as StatefulSet on EKS
Search	OpenSearch	Fuzzy full-text search with relevance ranking
Cache	Redis 7 (ready, not yet enabled)	Product listing cache, session store
Tracing	OpenTelemetry + Jaeger	Distributed tracing across every HTTP request
Metrics	Prometheus + Grafana	Request rate, latency percentiles, pod health
Analytics	Apache Superset	Revenue dashboards, sales analytics, customer insights

Deployment: Production-Grade EKS

This is where my DevOps background made the difference. The application runs on AWS EKS with a full production deployment pipeline.

Infrastructure (Terraform)

Everything is Infrastructure as Code. One terraform apply creates:

VPC with private subnets across 3 availability zones
EKS cluster with managed node groups (5 × t3.small)
IAM roles with IRSA (IAM Roles for Service Accounts) — no long-lived credentials anywhere
EBS CSI driver for persistent volume provisioning
ECR for container image storage

CI/CD Pipeline

The deployment pipeline uses GitHub Actions with OIDC authentication to AWS — no access keys stored anywhere.

Developer pushes code to main
    ↓
GitHub Actions triggers (OIDC → AWS)
    ↓
Docker image built and pushed to ECR
    ↓
ArgoCD detects new image tag in git
    ↓
ArgoCD syncs Helm chart to EKS
    ↓
Rolling update with zero downtime

GitHub Actions build pipeline — OIDC auth, Docker build, ECR push

GitOps with ArgoCD

ArgoCD watches the Helm chart in the git repository. Any change to the chart (new image tag, config change, resource limit adjustment) is automatically applied to the cluster. No manual kubectl apply needed.

ArgoCD application view showing all Kubernetes resources in sync

Secrets Management

Application secrets (database passwords, Stripe API keys, JWT signing keys) are stored in AWS Secrets Manager and synced to Kubernetes via External Secrets Operator. The git repository contains zero secrets — only references to which keys to fetch.

AWS Secrets Manager (raj-store/prod)
    ↓ (ESO syncs every 1 hour)
Kubernetes Secret (raj-store-secrets)
    ↓ (mounted as env vars)
FastAPI Pod

Non-sensitive configuration (service hostnames, ports, feature flags) lives in a Kubernetes ConfigMap managed by the Helm chart.

Observability: Seeing Everything

This is where the project goes from "deployed app" to "production-ready platform." I instrumented the application with three pillars of observability: traces, metrics, and analytics.

Distributed Tracing (OpenTelemetry + Jaeger)

Every HTTP request is automatically traced — from the NGINX ingress through the FastAPI route handler, into every SQL query, and out to external API calls (like Stripe).

Jaeger showing distributed traces — every request broken down into spans with timing

The OpenTelemetry integration is auto-instrumented. I added ~30 lines of setup code, and every FastAPI route, every SQLAlchemy query, and every outbound HTTP call is traced automatically. Health check and readiness probe endpoints are excluded from traces to reduce noise.

When debugging a slow checkout, I can see exactly where time is spent:

POST /orders/checkout-form           2.3s total
├── get_current_user                     2ms
├── get_all_cart_items (SQL)            18ms
├── get_user_addresses (SQL)             8ms
└── stripe.checkout.Session.create   2200ms  ← Stripe API call

Metrics (Prometheus + Grafana)

Prometheus scrapes metrics from the FastAPI application via the /metrics endpoint (provided by prometheus-fastapi-instrumentator). Grafana dashboards show:

Request rate per endpoint
Response time percentiles (p50, p95, p99)
Error rate percentage
Pod CPU and memory utilization
HTTP status code distribution

Grafana dashboard — request rate, latency percentiles, error rate, pod resources

Business Analytics (Apache Superset)

Superset connects directly to the PostgreSQL database (via a read-only user for security) and provides SQL-powered dashboards for business metrics:

Revenue trends over time
Top-selling products by revenue
Order status distribution (pending vs confirmed vs shipped vs delivered)
Customer geography
Average order value
Product rating distribution

Superset analytics dashboard — revenue, top products, order status, customer insights

Stripe Integration: Real Payments

The payment flow uses Stripe Checkout — a hosted payment page that handles card validation, 3D Secure, and PCI compliance without any card data touching my server.

Checkout flow:

User clicks "Proceed to Checkout" in the cart
App creates a Stripe Checkout Session with line items
User is redirected to Stripe's hosted payment page
After payment, Stripe redirects back with a session ID
App verifies payment status with Stripe API
Order is created with status "confirmed" and the Stripe payment intent is saved
Confirmation email is sent

Refunds are handled through the admin panel. The admin clicks a "Refund" button, which calls the Stripe Refund API with the saved payment intent. The order status updates to "refunded" and the customer receives an email notification. Real money flows back through Stripe — verified in the Stripe Dashboard.

Stripe Checkout page — real payment processing in test mode

Order confirmation page with Amazon-style order number

What I'd Do Differently

1. Start with PostgreSQL from day one. I started with SQLite for local development and migrated to PostgreSQL for production. While SQLAlchemy abstracts most differences, there were small quirks (like ALTER TABLE behavior) that caused unnecessary debugging. Starting with PostgreSQL via Docker Compose would have been smoother.

2. Add Alembic migrations early. I used Base.metadata.create_all() for table creation and manual ALTER TABLE for schema changes. This works for a solo project but doesn't scale. Alembic would have given me versioned, repeatable migrations from the start.

4. Use Stripe webhooks instead of redirect-based verification. The current flow relies on the success redirect to verify payment. If the user's browser crashes after payment but before the redirect, the order isn't created even though the card was charged. Webhooks solve this by having Stripe notify the server directly — independent of the browser.

For DevOps Engineers

If you're a DevOps engineer looking to level up your application development skills, this project covers:

How authentication actually works in web apps (JWT, cookies, middleware)
Why database schema design matters (foreign keys, indexes, relationships)
How payment gateways integrate (Stripe's redirect-based flow)
What "full-stack observability" means in practice (traces + metrics + analytics)
How to deploy a real application, not just infrastructure
When microservices make sense vs when a monolith is the right choice

The biggest insight: once you've built and deployed your own application, you understand developers' problems from the inside. That makes you a dramatically better DevOps engineer.

For Business Stakeholders

If you're evaluating this as a technology approach:

The modular monolith architecture keeps development velocity high while maintaining code quality
GitOps deployment means every change is auditable, reversible, and automated
Full observability means issues are detected and debugged in minutes, not hours
Scaling is straightforward — increase replica count for the app, upgrade node sizes for the database
The same codebase serves both the HTML storefront and a REST API (future mobile app ready)

Try It Yourself

Browse products, search, filter by category
Create an account and place a test order (use card 4242 4242 4242 4242)
Check your order history
Leave a product review

Source code: github.com/rajasekhar-cloud25/ecommerce-api

Rajasekhar Reddy is a Senior DevOps Engineer with 7+ years of experience across AWS, Azure, GCP, and hybrid cloud environments. He holds CKA and Azure Administrator certifications. Connect on rajasekharcloud.com.

🤖 How to Build an AI Agent Using MCP and Connect It to Salesforce (Step-by-Step Guide)

reddyj4 — Thu, 02 Apr 2026 11:53:29 GMT

AI agents are changing how developers build applications. Instead of hardcoding every step, we can build systems where AI decides what to do, when to do it, and which tools to use.

In this post I’ll walk you through building a simple AI agent using MCP (Model Context Protocol) and connecting it to Salesforce to fetch real data. The agent will:

Discover and call tools (Salesforce queries, notifications, updates)
Execute tool functions when needed
Return structured, actionable results

🧠 What is MCP?

MCP (Model Context Protocol) is a design approach where the model is given:

A list of available tools (name, description, input schema)
Context (conversation + environment data)
A protocol for asking to call tools and receiving the results

The flow becomes:

AI understands the user goal
AI chooses a tool and (optionally) constructs structured arguments
System executes the tool
Tool output is fed back into the model for final response or next step

This helps you keep the agent small, auditable, and safe.

⚙️ Prerequisites

Node.js (v18+)
Basic JavaScript / Node knowledge
OpenAI API key
Salesforce Developer org (or any org with API access)
dotenv for env variables

🏗️ Project Setup

mkdir mcp-agent
cd mcp-agent
npm init -y
npm install express openai axios dotenv

Create a .env:

OPENAI_API_KEY=your_openai_key
SF_CLIENT_ID=your_client_id
SF_CLIENT_SECRET=your_client_secret
SF_REFRESH_TOKEN=your_refresh_token
SF_INSTANCE_URL=https://your-instance.salesforce.com
PORT=3000

Notes:

Use OAuth with a refresh token (offline access) so your service can refresh access tokens without interactive login.
Store secrets securely (vault/secret manager) in production, not plain .env.

🔗 Connect to Salesforce (recommended approach)

In Salesforce: Setup → App Manager → New Connected App
- Enable OAuth
- Callback URL: http://localhost:3000/callback (for dev)
- Scopes: api, refresh_token, offline_access (and others only if needed)
Use the refresh token to obtain short-lived access tokens. Example token refresh helper:

// sfAuth.js
import axios from "axios";

export async function getAccessToken() {
  const params = new URLSearchParams();
  params.append("grant_type", "refresh_token");
  params.append("client_id", process.env.SF_CLIENT_ID);
  params.append("client_secret", process.env.SF_CLIENT_SECRET);
  params.append("refresh_token", process.env.SF_REFRESH_TOKEN);

  const res = await axios.post(
    `${process.env.SF_INSTANCE_URL}/services/oauth2/token`,
    params
  );
  return res.data.access_token;
}

(Adjust URL to token endpoint if using a different Salesforce instance domain.)

🔧 Creating Salesforce Tools (MCP)

Tools are plain functions your agent can call. Keep them small, idiomatic, and idempotent where possible.

Example: fetch Accounts.

// tools.js
import axios from "axios";
import { getAccessToken } from "./sfAuth.js";

export async function getAccounts(limit = 10) {
  const accessToken = await getAccessToken();

  const soql = `SELECT Id, Name, Type, Industry, LastModifiedDate FROM Account ORDER BY LastModifiedDate DESC LIMIT ${Number(limit)}`;
  const encoded = encodeURIComponent(soql);
  const url = `\({process.env.SF_INSTANCE_URL}/services/data/v59.0/query/?q=\){encoded}`;

  const res = await axios.get(url, {
    headers: {
      Authorization: `Bearer ${accessToken}`,
      Accept: "application/json",
    },
  });

  // Return minimal fields and count
  return {
    records: res.data.records,
    totalSize: res.data.totalSize,
  };
}

export async function getAccountById(id) {
  const accessToken = await getAccessToken();
  const url = `\({process.env.SF_INSTANCE_URL}/services/data/v59.0/sobjects/Account/\){id}`;
  const res = await axios.get(url, {
    headers: { Authorization: `Bearer ${accessToken}` },
  });
  return res.data;
}

Add other tools similarly: query opportunities, create tasks, update fields, post Chatter messages, etc. Each tool should return structured JSON.

🧩 Registering Tools (tool metadata for the model)

Expose metadata the model can use to decide which tool to call. When you use function-calling (or a simple MCP pattern), the metadata helps the model produce structured calls.

// mcp.js
import { getAccounts, getAccountById } from "./tools.js";

export const tools = [
  {
    name: "getAccounts",
    description: "Fetch recent Salesforce accounts. Args: { limit: number }",
    function: getAccounts,
    // If using function-calling features, include a JSON Schema for args:
    parameters: {
      type: "object",
      properties: {
        limit: { type: "integer", description: "Max number of accounts to fetch" },
      },
      required: [],
    },
  },
  {
    name: "getAccountById",
    description: "Fetch a single Account by Salesforce Id. Args: { id: string }",
    function: getAccountById,
    parameters: {
      type: "object",
      properties: {
        id: { type: "string" },
      },
      required: ["id"],
    },
  },
];

🤖 Building the Agent Orchestrator

Pattern used here (MCP loop):

Send user input + tool metadata to the model.
If the model returns a function/tool call, run that function locally.
Return the tool output to the model as a new message and ask for the final answer.
Repeat if the model requests additional tools.

Example agent using the OpenAI function-calling pattern (pseudo-real code for the official Node SDK):

// agent.js
import OpenAI from "openai";
import { tools } from "./mcp.js";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// small helper to map tool metadata for the model's function parameter
function buildFunctionDefs(tools) {
  return tools.map(t => ({
    name: t.name,
    description: t.description,
    parameters: t.parameters || { type: "object" },
  }));
}

export async function runAgent(userInput) {
  // 1) Ask the model what to do
  const initial = await client.chat.completions.create({
    model: "gpt-4o", // pick a model in your account that supports function-calling
    messages: [
      {
        role: "system",
        content:
          "You are an assistant that can call tools. When you want to call a tool, respond with a function call using JSON arguments matching the declared schema.",
      },
      { role: "user", content: userInput },
    ],
    functions: buildFunctionDefs(tools),
    function_call: "auto",
  });

  const message = initial.choices[0].message;

  // 2) If the model wants to call a function, execute it
  if (message.function_call) {
    const { name, arguments: argsStr } = message.function_call;
    let args = {};
    try {
      args = argsStr ? JSON.parse(argsStr) : {};
    } catch (err) {
      // Bad JSON from model — tell it to reformat
      return {
        error: "Model returned invalid JSON for function call arguments",
        detail: err.message,
      };
    }

    // find the tool function and run it
    const tool = tools.find(t => t.name === name);
    if (!tool) {
      return { error: `Unknown tool: ${name}` };
    }

    let toolOutput;
    try {
      toolOutput = await tool.function(...Object.values(args));
    } catch (err) {
      toolOutput = { error: err.message };
    }

    // 3) Send the tool output back to the model and ask for finalization
    const followUp = await client.chat.completions.create({
      model: "gpt-4o",
      messages: [
        { role: "system", content: "You are an assistant that can call tools." },
        { role: "user", content: userInput },
        message, // original model function call
        {
          role: "function",
          name,
          content: JSON.stringify(toolOutput),
        },
        {
          role: "user",
          content: "Based on the tool output, provide a concise summary and next steps.",
        },
      ],
    });

    const final = followUp.choices[0].message.content;
    return { result: final, toolOutput };
  } else {
    // Model didn't call a tool — just return its text
    return { result: message.content };
  }
}

Notes:

The above uses the Chat Completions function-calling flow. If you're using the newer Responses API, adapt accordingly to send tool metadata and handle tool calls similarly.
Validate model-returned JSON and guard against unexpected inputs.

🖥️ Example Express Server

// server.js
import express from "express";
import dotenv from "dotenv";
import { runAgent } from "./agent.js";

dotenv.config();
const app = express();
app.use(express.json());

app.post("/agent", async (req, res) => {
  try {
    const { input } = req.body;
    const out = await runAgent(input);
    res.json(out);
  } catch (err) {
    console.error(err);
    res.status(500).json({ error: err.message });
  }
});

const port = process.env.PORT || 3000;
app.listen(port, () => console.log(`Agent server running on ${port}`));

✅ Practical Patterns & Tips

Start with a small set of tools (read-only queries first), then expand to mutate data (create/update) with caution.
Limit scopes in your connected app. Grant only what you need.
Log function calls with correlation IDs for auditing.
Sanitize and validate any model-provided arguments before executing tools.
Add rate limiting and retries when calling external APIs (Salesforce/OpenAI).
Return structured results (JSON) from tools so the model can reason about data reliably.
Implement a “dry-run” or “preview” mode where the agent suggests actions but does not execute them unless explicitly approved.

🛡️ Security & Compliance

Never embed long-lived credentials in code. Use refresh-token + client secret flow and replace secrets using a secure secret store.
Rate-limit model and API usage, and monitor costs.
Add RBAC and approvals for destructive operations (e.g., mass-updates).
Keep sensitive data out of prompts when possible; redact or transform before sending to OpenAI if needed.

🧪 Testing & Iteration

Start with unit tests for each tool (mock Salesforce responses).
Test the agent with typical and adversarial prompts to see how it selects tools.
Add guardrails: deterministic schemas, explicit allowed-values lists, and user confirmations for risky actions.

📈 Next Steps / Ideas

Add human-in-the-loop approvals for any write operations.
Expand tools to query related records, compute metrics, or create tasks.
Build a UI that visualizes the agent’s chosen tool-calls and outputs for auditability.
Record conversations and actions for compliance and debugging.

Final Thoughts

Using MCP lets you design agents that are flexible yet auditable: the model chooses tools and the system executes them in a controlled environment. Start small, instrument heavily, and gradually add capabilities and safety checks. With a minimal set of tools and a solid orchestration loop, you can automate meaningful Salesforce tasks and free up time for higher-value work.

Building a Production-Grade EKS Platform on AWS with Terraform and GitOps

Rajasekhar Reddy — Wed, 01 Apr 2026 17:27:16 GMT

https://codepen.io/qckuhtdx-the-scripter/pen/myrLwxP

Building a Production-Grade EKS Platform on AWS with Terraform and GitOps

Overview

In this post I walk through how I built a fully automated, production-style Kubernetes platform on AWS EKS using Terraform, GitHub Actions OIDC, and ArgoCD — all optimized for cost without sacrificing reliability. Every component is provisioned as code, deployed without storing a single static AWS credential, and observable from day one.

The full source code is available at: https://github.com/rajasekhar-cloud25/infrastructure

Interactive architecture diagram: View live →

The Problem with Static Credentials

The traditional approach looks like this:

# ❌ The wrong way — credentials stored permanently in GitHub
env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Problems with this approach:

Credentials are long-lived — if leaked, they work until manually rotated
They exist permanently in GitHub's secret store
Any workflow in the repo can use them
Rotation requires updating secrets in every repo that uses them
No audit trail of which workflow run used which credential

The Solution: OIDC Token Exchange

GitHub Actions supports OpenID Connect (OIDC). Instead of storing credentials, the workflow requests a short-lived token from GitHub's OIDC provider and exchanges it for an AWS IAM role:

GitHub Actions runner
  │
  ├─ requests OIDC token from GitHub
  │   (signed JWT containing: repo, branch, workflow, run ID)
  │
  ├─ calls AWS STS AssumeRoleWithWebIdentity
  │   (presents the OIDC token + role ARN)
  │
  ├─ AWS validates: is this token from GitHub? ✅
  │                 is the repo/branch in the trust policy? ✅
  │
  └─ AWS returns: temporary access key + secret + session token
      (expires in 1 hour, scoped to this specific IAM role)

The credentials exist only for the duration of the job. When the job ends, the credentials expire. Nothing is stored. Nothing can leak.

Project Structure

The infrastructure is organized as a set of Terraform modules, each with a single responsibility:

infrastructure/
  .github/              ← GitHub Actions workflows
  Main/                 ← Root module, wires everything together
  vpc/                  ← VPC, subnets, IGW, NAT GW, route tables, SGs
  iam/                  ← IAM roles, IRSA roles, GitHub OIDC trust
  eks/                  ← EKS cluster, node group, access entries
  ecr/                  ← ECR repositories (separate workspace)
  eip/                  ← Elastic IPs for NLB (separate workspace)
  k8s_namespaces/       ← All K8s namespaces pre-created
  kubernetes-ingress/   ← NGINX Ingress Controller + NLB + Route53
  argocd_deployment/    ← ArgoCD via local Helm chart
  s3/                   ← Terraform state bucket bootstrap
  charts/               ← Local Helm charts (ArgoCD, NGINX)
  bootstrap.sh          ← Creates S3 bucket + DynamoDB table

The two most important design decisions here: ECR and EIPs live in separate workspaces. This means container images and static IP addresses survive a full cluster destroy and recreate — no DNS updates, no image rebuilds.

Step 1 — Bootstrapping State

Before any Terraform can run, the S3 state backend needs to exist. The bootstrap.sh script handles this:

#!/bin/bash
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="ecommerce-demo-terraform-state-${ACCOUNT_ID}"

aws s3api create-bucket --bucket $BUCKET_NAME --region us-east-1
aws s3api put-bucket-versioning \
  --bucket $BUCKET_NAME \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

The bucket name is derived from the AWS account ID at runtime — no secrets needed anywhere.

Step 2 — GitHub Actions OIDC (Zero Static Credentials)

All three workflows authenticate to AWS using OIDC — the GitHub Actions token is exchanged for a short-lived IAM role. No AWS_ACCESS_KEY_ID is ever stored in GitHub secrets.

# .github/workflows/tf-apply.yaml
jobs:
  plan:
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ vars.AWS_REGION }}

Three workflows are defined:

Workflow	Trigger	Purpose
`tf-plan`	On every PR	Runs `terraform plan`, posts diff as comment
`tf-apply`	Merge to main	Requires manual approval, then applies
`tf-destroy`	Manual only	Requires typing "destroy" to confirm

Step 3 — VPC Module

The VPC module creates everything the cluster needs to run in a private, secure network:

# modules/vpc/main.tf (key resources)
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

# Public subnets — NLB, NAT Gateway, Internet Gateway
resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet("10.0.0.0/16", 8, count.index)
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    "kubernetes.io/role/elb" = "1"
  }
}

# Private subnets — EKS nodes only
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    "kubernetes.io/role/internal-elb"                    = "1"
    "kubernetes.io/cluster/${var.resource_name}"         = "shared"
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id  # Single AZ — cost optimized
}

Why a single NAT Gateway? One NAT Gateway instead of two saves ~$32/month. For a production portfolio demo this is an acceptable tradeoff — the only risk is losing outbound internet for private nodes if that one AZ goes down.

Route tables are straightforward:

Public RT: 0.0.0.0/0 → Internet Gateway
Private RT: 0.0.0.0/0 → NAT Gateway

Step 4 — IAM Module

The IAM module handles all permissions with a single consolidated design. The key insight is that the EKS OIDC provider is created inside the IAM module — this avoids a circular dependency where IAM needs EKS and EKS needs IAM.

# modules/iam/main.tf

# EKS cluster role
resource "aws_iam_role" "eks_cluster" {
  name               = "${var.resource_name}-eks-cluster-role"
  assume_role_policy = data.aws_iam_policy_document.eks_cluster_assume.json
}

# GitHub Actions OIDC role (pre-created provider as data source)
data "aws_iam_openid_connect_provider" "github" {
  url = "https://token.actions.githubusercontent.com"
}

resource "aws_iam_role" "github_actions" {
  name = "${var.resource_name}-github-actions-role"
  assume_role_policy = jsonencode({
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = data.aws_iam_openid_connect_provider.github.arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:${var.github_repo}:*"
        }
      }
    }]
  })
}

# IRSA roles — one per component, least privilege
resource "aws_iam_role" "external_secrets" {
  name = "${var.resource_name}-external-secrets-role"
  # Trust policy scoped to the ESO service account only
  assume_role_policy = data.aws_iam_policy_document.irsa_external_secrets.json
}

Every pod that needs AWS access gets its own IRSA role — no shared node-level credentials:

Component	IRSA Permissions
EBS CSI Driver	`ec2:CreateVolume`, `ec2:AttachVolume`
Cluster Autoscaler	`autoscaling:SetDesiredCapacity`
External Secrets	`secretsmanager:GetSecretValue`
cert-manager	`route53:ChangeResourceRecordSets`

Step 5 — EKS Module

The EKS cluster runs on SPOT t3.small instances for cost optimization. Using API_AND_CONFIG_MAP auth mode with access entries instead of the legacy aws-auth ConfigMap:

# modules/eks/main.tf
resource "aws_eks_cluster" "main" {
  name     = var.resource_name
  version  = var.cluster_version
  role_arn = var.cluster_role_arn

  vpc_config {
    subnet_ids              = var.private_subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = true
  }

  access_config {
    authentication_mode                         = "API_AND_CONFIG_MAP"
    bootstrap_cluster_creator_admin_permissions = true
  }
}

resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_role_arn   = var.node_role_arn
  subnet_ids      = var.private_subnet_ids
  instance_types  = ["t3.small"]
  capacity_type   = "SPOT"        # ~60% cheaper than on-demand

  scaling_config {
    desired_size = 5
    min_size     = 2
    max_size     = 5
  }
}

# Access entries — no aws-auth ConfigMap editing required
resource "aws_eks_access_entry" "github_actions" {
  cluster_name  = aws_eks_cluster.main.name
  principal_arn = var.github_actions_role_arn
  type          = "STANDARD"
}

The Kubernetes and Helm providers authenticate using exec with aws eks get-token — this avoids plan-time failures when the cluster doesn't exist yet:

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_ca_certificate)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    args        = ["eks", "get-token", "--cluster-name", module.eks.cluster_name, "--region", var.aws_region]
  }
}

Step 6 — ECR Module (Separate Workspace)

ECR repositories are managed in their own Terraform workspace so images are never accidentally deleted when the main cluster is torn down:

# ecr/main.tf
resource "aws_ecr_repository" "app" {
  for_each             = toset(var.repository_names)
  name                 = each.value
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }
}

GitHub Actions builds and pushes on every merge:

- name: Build and push
  run: |
    aws ecr get-login-password | docker login --username AWS \
      --password-stdin $ECR_REGISTRY
    docker buildx build --platform linux/amd64 \
      --push -t \(ECR_REGISTRY/\)ECR_REPO:$IMAGE_TAG .

Step 7 — EIP Module (Separate Workspace)

Static Elastic IPs are created in their own workspace — separate from the main cluster. This means the NLB always gets the same IP addresses, Route53 A records never need updating, and the cluster can be completely rebuilt without changing DNS:

# eip/main.tf
resource "aws_eip" "nlb" {
  count  = 2
  domain = "vpc"

  tags = {
    Name = "\({var.resource_name}-nlb-eip-\){count.index}"
  }
}

The allocation IDs are then passed as a variable to the main workspace:

# environments/eks-demo-dev.tfvars
nlb_eip_allocation_ids = [
  "eipalloc-09595a182e792f01f",
  "eipalloc-032c83197c359b3fe"
]

Step 8 — Kubernetes Namespaces Module

All namespaces are created before any Helm chart runs. This prevents race conditions where a chart tries to create resources in a namespace that doesn't exist yet:

# modules/k8s_namespaces/main.tf
resource "kubernetes_namespace" "namespaces" {
  for_each = toset([
    "argocd",
    "nginx-ingress",
    "monitoring",
    "external-secrets",
    "cert-manager",
    "eks-demo",
    "shared-os",
    "kubecost"
  ])

  metadata {
    name = each.value
  }
}

This module runs before kubernetes-ingress and argocd_deployment in the dependency graph.

Step 9 — NGINX Ingress Controller + NLB + Route53

The NGINX Ingress Controller is the traffic gateway for the entire cluster. It is deployed via a local Helm chart with NLB annotations that attach the static EIPs:

# modules/kubernetes-ingress/main.tf
resource "helm_release" "nginx_ingress" {
  name      = "nginx-ingress"
  chart     = "${path.module}/../charts/kubernetes-ingress"
  namespace = "nginx-ingress"
  timeout   = 600
  wait      = true
  atomic    = true

  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-type"
    value = "nlb"
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-eip-allocations"
    value = join("\\,", var.nlb_eip_allocation_ids)
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-subnets"
    value = join("\\,", var.public_subnet_ids)
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-cert"
    value = var.acm_certificate_arn
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-ports"
    value = "443"
  }
}

Why NGINX over AWS ALB Controller?

NGINX is free — ALB Controller creates a new ALB per ingress (~$16/mo each)
No per-app ACM ARN annotation required — one cert at the NLB level covers everything
Portable — works identically on any cloud or on-premises

Route53 A records point directly to the static EIP addresses:

data "aws_eip" "nlb" {
  count = length(var.nlb_eip_allocation_ids)
  id    = var.nlb_eip_allocation_ids[count.index]
}

resource "aws_route53_record" "dns_records" {
  for_each = toset(var.dns_names)
  zone_id  = var.route53_zone_id
  name     = "\({each.value}.\){var.domain_name}"
  type     = "A"
  ttl      = 300
  records  = data.aws_eip.nlb[*].public_ip
}

TLS flow:

User → HTTPS
  → NLB (ACM wildcard *.reddycloud.com terminates TLS)
  → HTTP → NGINX Ingress
  → HTTP → App pod

No cert-manager needed. AWS handles certificate renewal automatically.

Step 10 — ArgoCD Deployment

ArgoCD is deployed via a local Helm chart with CRDs managed separately using the alekc/kubectl provider to avoid Helm CRD conflicts:

# modules/argocd_deployment/main.tf

# CRDs managed outside Helm to avoid upgrade conflicts
data "http" "argocd_crds" {
  for_each = toset(local.crd_files)
  url      = each.value
}

resource "kubectl_manifest" "argocd_crds" {
  for_each          = toset(local.crd_files)
  yaml_body         = data.http.argocd_crds[each.value].response_body
  server_side_apply = true
  force_conflicts   = true
  wait              = true
}

resource "helm_release" "argocd" {
  name       = "argocd-chart"
  chart      = "${path.module}/../charts/argocd"
  version    = "9.4.17"
  namespace  = "argocd"
  skip_crds  = true    # CRDs managed by kubectl_manifest above
  replace    = true
  wait       = true
  timeout    = 600

  values = [file("${path.module}/../charts/argocd/clusterValues/values.EksDemo.yaml")]

  depends_on = [kubectl_manifest.argocd_crds]
}

ArgoCD values for NGINX ingress integration:

# charts/argocd/clusterValues/values.EksDemo.yaml
configs:
  params:
    server.insecure: "true"  # TLS terminated at NLB

server:
  extraArgs:
    - --insecure

  ingress:
    enabled: true
    ingressClassName: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
      nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
    hostname: argocd.reddycloud.com
    paths: /
    pathType: Prefix
    https: false

ArgoCD runs in --insecure mode because TLS is already terminated at the NLB. The user always sees HTTPS — ArgoCD just receives plain HTTP from NGINX.

Secrets Management — External Secrets Operator

No secrets are hardcoded anywhere. The External Secrets Operator pulls from AWS Secrets Manager using IRSA:

# K8s manifest deployed via ArgoCD
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: postgres-creds
  namespace: eks-demo
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-store
    kind: ClusterSecretStore
  target:
    name: postgres-creds
    creationPolicy: Owner
  data:
    - secretKey: POSTGRES_PASSWORD
      remoteRef:
        key: ecommerce-k8s-demo/postgres
        property: password
    - secretKey: POSTGRES_USER
      remoteRef:
        key: ecommerce-k8s-demo/postgres
        property: username

The flow:

AWS Secrets Manager
  → ExternalSecret CRD (IRSA authenticated)
  → Kubernetes Secret (auto-created, kept in sync)
  → App pod (env var or volume mount)

Observability Stack

The full observability stack is deployed via ArgoCD:

Signal	Collector	Storage	Query
Metrics	Prometheus (ServiceMonitor scrape)	TSDB on EBS	Grafana PromQL
Traces	OTel Collector (OTLP gRPC :4317)	Jaeger	Grafana / Jaeger UI
Logs	Promtail DaemonSet	Loki	Grafana LogQL
Search	OpenSearch client (direct)	OpenSearch index	OpenSearch Dashboards

The OTel Collector pipeline:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
  resource:

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

Cost Breakdown

Resource	Cost	Optimization
EKS cluster	~$7.20/mo	Fixed control plane cost
SPOT t3.small × 5	~$14/mo	~60% vs on-demand
NAT Gateway	~$5/mo	Single AZ vs per-AZ
NLB	~$16/mo	One NLB for everything
EBS volumes	~$3/mo	gp3 storage class
Route53	~$0.50/mo	Hosted zone
Total	~$46/mo	vs ~$200+ on-demand multi-AZ

Key Takeaways

Zero static credentials — GitHub Actions OIDC means no AWS keys ever touch GitHub secrets. IRSA means no AWS keys ever touch EKS nodes.

Destroy-safe architecture — EIPs and ECR in separate workspaces means the cluster can be completely torn down and rebuilt without updating DNS or rebuilding images.

Single ACM cert covers everything — One wildcard cert on the NLB eliminates cert-manager, Let's Encrypt rate limits, and per-app TLS configuration.

Cost matters — SPOT instances, single NAT Gateway, NGINX instead of per-ALB cost, pods instead of managed services. Same production patterns at a fraction of the cost.

Source code: github.com/rajreddy/ecommerce-k8s-demoInteractive architecture: codepen.io/qckuhtdx-the-scripter/pen/myrLwxPDomain: reddycloud.com