<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[raja.cloud]]></title><description><![CDATA[raja.cloud]]></description><link>https://blog.rajasekharcloud.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1593680282896/kNC7E8IR4.png</url><title>raja.cloud</title><link>https://blog.rajasekharcloud.com</link></image><generator>RSS for Node</generator><lastBuildDate>Thu, 16 Apr 2026 16:36:10 GMT</lastBuildDate><atom:link href="https://blog.rajasekharcloud.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Building a Production Ecommerce Platform on AWS EKS]]></title><description><![CDATA[Building and Deploying a Full-Stack Ecommerce Platform: From Zero to Production on AWS EKS
By Rajasekhar Reddy | Senior DevOps Engineer | April 2026

For years, I wanted to build an ecommerce applicat]]></description><link>https://blog.rajasekharcloud.com/building-a-production-ecommerce-platform-on-aws-eks</link><guid isPermaLink="true">https://blog.rajasekharcloud.com/building-a-production-ecommerce-platform-on-aws-eks</guid><category><![CDATA[FastAPI]]></category><category><![CDATA[EKS]]></category><category><![CDATA[Terraform]]></category><category><![CDATA[Python]]></category><category><![CDATA[ecommerce]]></category><category><![CDATA[eCommerce Website Development]]></category><dc:creator><![CDATA[Rajasekhar Reddy]]></dc:creator><pubDate>Sat, 11 Apr 2026 15:10:17 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69c772587cf2706510bf589c/437ca7a4-5316-40a4-8afe-703a2695d988.svg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Building and Deploying a Full-Stack Ecommerce Platform: From Zero to Production on AWS EKS</h1>
<p><em>By Rajasekhar Reddy | Senior DevOps Engineer | April 2026</em></p>
<hr />
<p>For years, I wanted to build an ecommerce application. I started multiple times and never finished. The scope was always too big, the technology choices paralyzing, the motivation fading after the first few bugs.</p>
<p>Then I decided to stop planning and start shipping. I built a complete ecommerce platform with real payments, deployed it to AWS EKS with full observability, and made it production-ready — all in a few focused sessions.</p>
<p>This is the story of how I built Raj Store, what architectural decisions I made, and why I chose a modular monolith over microservices. Whether you're a developer looking for inspiration, a DevOps engineer curious about full-stack deployment, or a business stakeholder evaluating technology approaches — there's something here for you.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69c772587cf2706510bf589c/71d48cc9-8843-405a-8262-985858f8cd24.png" alt="" style="display:block;margin:0 auto" />

<p><em>Raj Store — 100 products, real images, search with filters, hover animations</em></p>
<hr />
<h2>The Problem I Was Solving</h2>
<p>Every ecommerce tutorial teaches you the basics — a product list, a cart, maybe a checkout page. But none of them show you what it takes to go from "it works on localhost" to "it's running in production with real payments, real observability, and real deployment pipelines."</p>
<p>I wanted to bridge that gap. Not just build an app, but deploy it the way a real engineering team would.</p>
<hr />
<h2>What I Built</h2>
<h3>The Application</h3>
<p>Raj Store is a full-featured ecommerce platform built with Python (FastAPI) on the backend and server-rendered HTML with Jinja2 + TailwindCSS on the frontend. No React, no separate frontend repo — just clean server-side rendering with HTMX for interactivity.</p>
<p><strong>Core features:</strong></p>
<ul>
<li><p>User authentication with JWT tokens stored in secure HTTP-only cookies</p>
</li>
<li><p>Product catalog with 100+ real products seeded from DummyJSON</p>
</li>
<li><p>Full-text search with category filters, price range, and sorting</p>
</li>
<li><p>Shopping cart with quantity management</p>
</li>
<li><p>Real payment processing via Stripe Checkout (test mode)</p>
</li>
<li><p>Complete order lifecycle: Pending → Confirmed → Shipped → Delivered</p>
</li>
<li><p>Amazon-style order numbers (ORD-20260408-A7B2C9)</p>
</li>
<li><p>Admin panel with dashboard, order management, and Stripe refunds</p>
</li>
<li><p>Product reviews and 5-star ratings (purchase-gated — only buyers can review)</p>
</li>
<li><p>Wishlist for saving products</p>
</li>
<li><p>International shipping addresses with 26-country dropdown</p>
</li>
<li><p>Email notifications for registration, order confirmation, and status updates</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/69c772587cf2706510bf589c/628fcdf5-4a11-44e1-965a-9f2d5f9099a3.png" alt="" style="display:block;margin:0 auto" />

<p><em>Product detail page with image, pricing, stock status, Add to Cart, wishlist, and customer reviews</em></p>
<img src="screenshots/cart.png" alt="Shopping Cart" style="display:block;margin:0 auto" />

<p><em>Shopping cart with subtotals, total calculation, and Stripe checkout button</em></p>
<hr />
<h2>Architecture: Why I Chose a Modular Monolith</h2>
<p>This is probably the most important decision I made, and it goes against what most tutorials tell you.</p>
<p>When I started planning, my instinct was to split everything into microservices: auth-service, product-service, cart-service, order-service, payment-service. Seven separate applications, seven databases, seven deployment pipelines.</p>
<p>Then I asked myself: <em>who is this for?</em></p>
<p>I'm a solo developer building a portfolio project. I don't have a team of 50 engineers who need to deploy independently. I don't have millions of requests per second that require different scaling strategies for different components. I don't have organizational boundaries that mandate service separation.</p>
<p>What I do have is one application that needs to work reliably, be easy to debug, and be impressive to demo.</p>
<p><strong>A modular monolith was the right call.</strong> Here's the actual code structure:</p>
<pre><code class="language-plaintext">app/
├── main.py              # Entry point, router registration
├── config.py            # Environment-based settings (Pydantic)
├── database.py          # SQLAlchemy engine + session
├── dependencies.py      # Auth dependencies
├── otel.py              # OpenTelemetry setup (graceful degradation)
├── models/              # SQLAlchemy ORM models
│   ├── user.py
│   ├── product.py
│   ├── cart.py
│   ├── order.py
│   ├── review.py
│   └── ...
├── schemas/             # Pydantic request/response schemas
├── services/            # Business logic layer
│   ├── product_service.py
│   ├── cart_service.py
│   ├── order_service.py
│   ├── search_service.py   # OpenSearch integration
│   └── email_service.py
└── routers/             # HTTP endpoints
    ├── auth.py
    ├── pages.py         # Server-rendered HTML pages
    ├── product.py
    ├── cart.py
    ├── order.py         # Stripe checkout + payments
    ├── admin.py
    └── ...
</code></pre>
<p>Each "module" (auth, products, cart, orders) has its own model, schema, service, and router — but they share one database, one process, and one deployment. This is exactly how Shopify (Rails monolith), GitHub (Rails monolith), and Stack Overflow (.NET monolith) are built.</p>
<p>The key insight: <strong>you can always extract a microservice later when you have real traffic patterns showing which module needs independent scaling.</strong> Starting with microservices before you have that data is premature optimization at the architectural level.</p>
<hr />
<h2>The Tech Stack</h2>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Technology</th>
<th>Why</th>
</tr>
</thead>
<tbody><tr>
<td>Backend</td>
<td>FastAPI (Python 3.11)</td>
<td>Async, type-safe, auto-generated OpenAPI docs</td>
</tr>
<tr>
<td>ORM</td>
<td>SQLAlchemy 2.x</td>
<td>Industry standard, works with SQLite and PostgreSQL</td>
</tr>
<tr>
<td>Templates</td>
<td>Jinja2 + TailwindCSS</td>
<td>Server-rendered, no JavaScript framework needed</td>
</tr>
<tr>
<td>Interactivity</td>
<td>HTMX</td>
<td>Add-to-cart without page reload, minimal JS</td>
</tr>
<tr>
<td>Payments</td>
<td>Stripe Checkout</td>
<td>PCI compliant, hosted payment page, real refunds</td>
</tr>
<tr>
<td>Database</td>
<td>PostgreSQL 16</td>
<td>Production-grade, running as StatefulSet on EKS</td>
</tr>
<tr>
<td>Search</td>
<td>OpenSearch</td>
<td>Fuzzy full-text search with relevance ranking</td>
</tr>
<tr>
<td>Cache</td>
<td>Redis 7 (ready, not yet enabled)</td>
<td>Product listing cache, session store</td>
</tr>
<tr>
<td>Tracing</td>
<td>OpenTelemetry + Jaeger</td>
<td>Distributed tracing across every HTTP request</td>
</tr>
<tr>
<td>Metrics</td>
<td>Prometheus + Grafana</td>
<td>Request rate, latency percentiles, pod health</td>
</tr>
<tr>
<td>Analytics</td>
<td>Apache Superset</td>
<td>Revenue dashboards, sales analytics, customer insights</td>
</tr>
</tbody></table>
<hr />
<h2>Deployment: Production-Grade EKS</h2>
<p>This is where my DevOps background made the difference. The application runs on AWS EKS with a full production deployment pipeline.</p>
<h3>Infrastructure (Terraform)</h3>
<p>Everything is Infrastructure as Code. One <code>terraform apply</code> creates:</p>
<ul>
<li><p><strong>VPC</strong> with private subnets across 3 availability zones</p>
</li>
<li><p><strong>EKS cluster</strong> with managed node groups (5 × t3.small)</p>
</li>
<li><p><strong>IAM roles</strong> with IRSA (IAM Roles for Service Accounts) — no long-lived credentials anywhere</p>
</li>
<li><p><strong>EBS CSI driver</strong> for persistent volume provisioning</p>
</li>
<li><p><strong>ECR</strong> for container image storage</p>
</li>
</ul>
<h3>CI/CD Pipeline</h3>
<p>The deployment pipeline uses GitHub Actions with OIDC authentication to AWS — no access keys stored anywhere.</p>
<pre><code class="language-plaintext">Developer pushes code to main
    ↓
GitHub Actions triggers (OIDC → AWS)
    ↓
Docker image built and pushed to ECR
    ↓
ArgoCD detects new image tag in git
    ↓
ArgoCD syncs Helm chart to EKS
    ↓
Rolling update with zero downtime
</code></pre>
<p><em>GitHub Actions build pipeline — OIDC auth, Docker build, ECR push</em></p>
<h3>GitOps with ArgoCD</h3>
<p>ArgoCD watches the Helm chart in the git repository. Any change to the chart (new image tag, config change, resource limit adjustment) is automatically applied to the cluster. No manual <code>kubectl apply</code> needed.</p>
<img src="screenshots/argocd.png" alt="ArgoCD Sync" style="display:block;margin:0 auto" />

<p><em>ArgoCD application view showing all Kubernetes resources in sync</em></p>
<h3>Secrets Management</h3>
<p>Application secrets (database passwords, Stripe API keys, JWT signing keys) are stored in <strong>AWS Secrets Manager</strong> and synced to Kubernetes via <strong>External Secrets Operator</strong>. The git repository contains zero secrets — only references to which keys to fetch.</p>
<pre><code class="language-plaintext">AWS Secrets Manager (raj-store/prod)
    ↓ (ESO syncs every 1 hour)
Kubernetes Secret (raj-store-secrets)
    ↓ (mounted as env vars)
FastAPI Pod
</code></pre>
<p>Non-sensitive configuration (service hostnames, ports, feature flags) lives in a Kubernetes ConfigMap managed by the Helm chart.</p>
<hr />
<h2>Observability: Seeing Everything</h2>
<p>This is where the project goes from "deployed app" to "production-ready platform." I instrumented the application with three pillars of observability: traces, metrics, and analytics.</p>
<h3>Distributed Tracing (OpenTelemetry + Jaeger)</h3>
<p>Every HTTP request is automatically traced — from the NGINX ingress through the FastAPI route handler, into every SQL query, and out to external API calls (like Stripe).</p>
<img src="screenshots/jaeger-traces.png" alt="Jaeger Traces" style="display:block;margin:0 auto" />

<p><em>Jaeger showing distributed traces — every request broken down into spans with timing</em></p>
<p>The OpenTelemetry integration is auto-instrumented. I added ~30 lines of setup code, and every FastAPI route, every SQLAlchemy query, and every outbound HTTP call is traced automatically. Health check and readiness probe endpoints are excluded from traces to reduce noise.</p>
<p>When debugging a slow checkout, I can see exactly where time is spent:</p>
<pre><code class="language-plaintext">POST /orders/checkout-form           2.3s total
├── get_current_user                     2ms
├── get_all_cart_items (SQL)            18ms
├── get_user_addresses (SQL)             8ms
└── stripe.checkout.Session.create   2200ms  ← Stripe API call
</code></pre>
<h3>Metrics (Prometheus + Grafana)</h3>
<p>Prometheus scrapes metrics from the FastAPI application via the <code>/metrics</code> endpoint (provided by <code>prometheus-fastapi-instrumentator</code>). Grafana dashboards show:</p>
<ul>
<li><p>Request rate per endpoint</p>
</li>
<li><p>Response time percentiles (p50, p95, p99)</p>
</li>
<li><p>Error rate percentage</p>
</li>
<li><p>Pod CPU and memory utilization</p>
</li>
<li><p>HTTP status code distribution</p>
</li>
</ul>
<img src="screenshots/grafana-dashboard.png" alt="Grafana Dashboard" style="display:block;margin:0 auto" />

<p><em>Grafana dashboard — request rate, latency percentiles, error rate, pod resources</em></p>
<h3>Business Analytics (Apache Superset)</h3>
<p>Superset connects directly to the PostgreSQL database (via a read-only user for security) and provides SQL-powered dashboards for business metrics:</p>
<ul>
<li><p>Revenue trends over time</p>
</li>
<li><p>Top-selling products by revenue</p>
</li>
<li><p>Order status distribution (pending vs confirmed vs shipped vs delivered)</p>
</li>
<li><p>Customer geography</p>
</li>
<li><p>Average order value</p>
</li>
<li><p>Product rating distribution</p>
</li>
</ul>
<img src="screenshots/superset-dashboard.png" alt="Superset Dashboard" style="display:block;margin:0 auto" />

<p><em>Superset analytics dashboard — revenue, top products, order status, customer insights</em></p>
<hr />
<h2>Stripe Integration: Real Payments</h2>
<p>The payment flow uses Stripe Checkout — a hosted payment page that handles card validation, 3D Secure, and PCI compliance without any card data touching my server.</p>
<p><strong>Checkout flow:</strong></p>
<ol>
<li><p>User clicks "Proceed to Checkout" in the cart</p>
</li>
<li><p>App creates a Stripe Checkout Session with line items</p>
</li>
<li><p>User is redirected to Stripe's hosted payment page</p>
</li>
<li><p>After payment, Stripe redirects back with a session ID</p>
</li>
<li><p>App verifies payment status with Stripe API</p>
</li>
<li><p>Order is created with status "confirmed" and the Stripe payment intent is saved</p>
</li>
<li><p>Confirmation email is sent</p>
</li>
</ol>
<p><strong>Refunds</strong> are handled through the admin panel. The admin clicks a "Refund" button, which calls the Stripe Refund API with the saved payment intent. The order status updates to "refunded" and the customer receives an email notification. Real money flows back through Stripe — verified in the Stripe Dashboard.</p>
<img src="screenshots/stripe-checkout.png" alt="Stripe Checkout" style="display:block;margin:0 auto" />

<p><em>Stripe Checkout page — real payment processing in test mode</em></p>
<img src="screenshots/order-confirmation.png" alt="Order Confirmation" style="display:block;margin:0 auto" />

<p><em>Order confirmation page with Amazon-style order number</em></p>
<hr />
<h2>What I'd Do Differently</h2>
<p><strong>1. Start with PostgreSQL from day one.</strong> I started with SQLite for local development and migrated to PostgreSQL for production. While SQLAlchemy abstracts most differences, there were small quirks (like <code>ALTER TABLE</code> behavior) that caused unnecessary debugging. Starting with PostgreSQL via Docker Compose would have been smoother.</p>
<p><strong>2. Add Alembic migrations early.</strong> I used <code>Base.metadata.create_all()</code> for table creation and manual <code>ALTER TABLE</code> for schema changes. This works for a solo project but doesn't scale. Alembic would have given me versioned, repeatable migrations from the start.</p>
<p><strong>4. Use Stripe webhooks instead of redirect-based verification.</strong> The current flow relies on the success redirect to verify payment. If the user's browser crashes after payment but before the redirect, the order isn't created even though the card was charged. Webhooks solve this by having Stripe notify the server directly — independent of the browser.</p>
<hr />
<hr />
<h2>For DevOps Engineers</h2>
<p>If you're a DevOps engineer looking to level up your application development skills, this project covers:</p>
<ul>
<li><p>How authentication actually works in web apps (JWT, cookies, middleware)</p>
</li>
<li><p>Why database schema design matters (foreign keys, indexes, relationships)</p>
</li>
<li><p>How payment gateways integrate (Stripe's redirect-based flow)</p>
</li>
<li><p>What "full-stack observability" means in practice (traces + metrics + analytics)</p>
</li>
<li><p>How to deploy a real application, not just infrastructure</p>
</li>
<li><p>When microservices make sense vs when a monolith is the right choice</p>
</li>
</ul>
<p>The biggest insight: <strong>once you've built and deployed your own application, you understand developers' problems from the inside.</strong> That makes you a dramatically better DevOps engineer.</p>
<hr />
<h2>For Business Stakeholders</h2>
<p>If you're evaluating this as a technology approach:</p>
<ul>
<li><p>The modular monolith architecture keeps development velocity high while maintaining code quality</p>
</li>
<li><p>GitOps deployment means every change is auditable, reversible, and automated</p>
</li>
<li><p>Full observability means issues are detected and debugged in minutes, not hours</p>
</li>
<li><p>Scaling is straightforward — increase replica count for the app, upgrade node sizes for the database</p>
</li>
<li><p>The same codebase serves both the HTML storefront and a REST API (future mobile app ready)</p>
</li>
</ul>
<hr />
<h2>Try It Yourself</h2>
<ul>
<li><p>Browse products, search, filter by category</p>
</li>
<li><p>Create an account and place a test order (use card <code>4242 4242 4242 4242</code>)</p>
</li>
<li><p>Check your order history</p>
</li>
<li><p>Leave a product review</p>
</li>
</ul>
<p>Source code: <a href="https://github.com/rajasekhar-cloud25/ecommerce-api"><strong>github.com/rajasekhar-cloud25/ecommerce-api</strong></a></p>
<hr />
<p><em>Rajasekhar Reddy is a Senior DevOps Engineer with 7+ years of experience across AWS, Azure, GCP, and hybrid cloud environments. He holds CKA and Azure Administrator certifications. Connect on</em> <a href="https://rajasekharcloud.com"><em>rajasekharcloud.com</em></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[🤖 How to Build an AI Agent Using MCP and Connect It to Salesforce (Step-by-Step Guide)]]></title><description><![CDATA[AI agents are changing how developers build applications. Instead of hardcoding every step, we can build systems where AI decides what to do, when to do it, and which tools to use.
In this post I’ll w]]></description><link>https://blog.rajasekharcloud.com/how-to-build-an-ai-agent-using-mcp-and-connect-it-to-salesforce-step-by-step-guide</link><guid isPermaLink="true">https://blog.rajasekharcloud.com/how-to-build-an-ai-agent-using-mcp-and-connect-it-to-salesforce-step-by-step-guide</guid><category><![CDATA[ai-agent]]></category><category><![CDATA[Salesforce AI]]></category><dc:creator><![CDATA[reddyj4]]></dc:creator><pubDate>Thu, 02 Apr 2026 11:53:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69c77c787cf2706510c9caad/9e51d970-d9f7-4331-af8e-d004982b24b4.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AI agents are changing how developers build applications. Instead of hardcoding every step, we can build systems where AI <strong>decides what to do, when to do it, and which tools to use</strong>.</p>
<p>In this post I’ll walk you through building a simple AI agent using MCP (Model Context Protocol) and connecting it to Salesforce to fetch real data. The agent will:</p>
<ul>
<li><p>Discover and call tools (Salesforce queries, notifications, updates)</p>
</li>
<li><p>Execute tool functions when needed</p>
</li>
<li><p>Return structured, actionable results</p>
</li>
</ul>
<hr />
<h2>🧠 What is MCP?</h2>
<p>MCP (Model Context Protocol) is a design approach where the model is given:</p>
<ul>
<li><p>A list of available tools (name, description, input schema)</p>
</li>
<li><p>Context (conversation + environment data)</p>
</li>
<li><p>A protocol for asking to call tools and receiving the results</p>
</li>
</ul>
<p>The flow becomes:</p>
<ol>
<li><p>AI understands the user goal</p>
</li>
<li><p>AI chooses a tool and (optionally) constructs structured arguments</p>
</li>
<li><p>System executes the tool</p>
</li>
<li><p>Tool output is fed back into the model for final response or next step</p>
</li>
</ol>
<p>This helps you keep the agent small, auditable, and safe.</p>
<hr />
<h2>⚙️ Prerequisites</h2>
<ul>
<li><p>Node.js (v18+)</p>
</li>
<li><p>Basic JavaScript / Node knowledge</p>
</li>
<li><p>OpenAI API key</p>
</li>
<li><p>Salesforce Developer org (or any org with API access)</p>
</li>
<li><p>dotenv for env variables</p>
</li>
</ul>
<hr />
<h2>🏗️ Project Setup</h2>
<pre><code class="language-bash">mkdir mcp-agent
cd mcp-agent
npm init -y
npm install express openai axios dotenv
</code></pre>
<p>Create a <code>.env</code>:</p>
<pre><code class="language-env">OPENAI_API_KEY=your_openai_key
SF_CLIENT_ID=your_client_id
SF_CLIENT_SECRET=your_client_secret
SF_REFRESH_TOKEN=your_refresh_token
SF_INSTANCE_URL=https://your-instance.salesforce.com
PORT=3000
</code></pre>
<p>Notes:</p>
<ul>
<li><p>Use OAuth with a refresh token (offline access) so your service can refresh access tokens without interactive login.</p>
</li>
<li><p>Store secrets securely (vault/secret manager) in production, not plain <code>.env</code>.</p>
</li>
</ul>
<hr />
<h2>🔗 Connect to Salesforce (recommended approach)</h2>
<ol>
<li><p>In Salesforce: Setup → App Manager → New Connected App</p>
<ul>
<li><p>Enable OAuth</p>
</li>
<li><p>Callback URL: <a href="http://localhost:3000/callback">http://localhost:3000/callback</a> (for dev)</p>
</li>
<li><p>Scopes: api, refresh_token, offline_access (and others only if needed)</p>
</li>
</ul>
</li>
<li><p>Use the refresh token to obtain short-lived access tokens. Example token refresh helper:</p>
</li>
</ol>
<pre><code class="language-javascript">// sfAuth.js
import axios from "axios";

export async function getAccessToken() {
  const params = new URLSearchParams();
  params.append("grant_type", "refresh_token");
  params.append("client_id", process.env.SF_CLIENT_ID);
  params.append("client_secret", process.env.SF_CLIENT_SECRET);
  params.append("refresh_token", process.env.SF_REFRESH_TOKEN);

  const res = await axios.post(
    `${process.env.SF_INSTANCE_URL}/services/oauth2/token`,
    params
  );
  return res.data.access_token;
}
</code></pre>
<p>(Adjust URL to token endpoint if using a different Salesforce instance domain.)</p>
<hr />
<h2>🔧 Creating Salesforce Tools (MCP)</h2>
<p>Tools are plain functions your agent can call. Keep them small, idiomatic, and idempotent where possible.</p>
<p>Example: fetch Accounts.</p>
<pre><code class="language-javascript">// tools.js
import axios from "axios";
import { getAccessToken } from "./sfAuth.js";

export async function getAccounts(limit = 10) {
  const accessToken = await getAccessToken();

  const soql = `SELECT Id, Name, Type, Industry, LastModifiedDate FROM Account ORDER BY LastModifiedDate DESC LIMIT ${Number(limit)}`;
  const encoded = encodeURIComponent(soql);
  const url = `\({process.env.SF_INSTANCE_URL}/services/data/v59.0/query/?q=\){encoded}`;

  const res = await axios.get(url, {
    headers: {
      Authorization: `Bearer ${accessToken}`,
      Accept: "application/json",
    },
  });

  // Return minimal fields and count
  return {
    records: res.data.records,
    totalSize: res.data.totalSize,
  };
}

export async function getAccountById(id) {
  const accessToken = await getAccessToken();
  const url = `\({process.env.SF_INSTANCE_URL}/services/data/v59.0/sobjects/Account/\){id}`;
  const res = await axios.get(url, {
    headers: { Authorization: `Bearer ${accessToken}` },
  });
  return res.data;
}
</code></pre>
<p>Add other tools similarly: query opportunities, create tasks, update fields, post Chatter messages, etc. Each tool should return structured JSON.</p>
<hr />
<h2>🧩 Registering Tools (tool metadata for the model)</h2>
<p>Expose metadata the model can use to decide which tool to call. When you use function-calling (or a simple MCP pattern), the metadata helps the model produce structured calls.</p>
<pre><code class="language-javascript">// mcp.js
import { getAccounts, getAccountById } from "./tools.js";

export const tools = [
  {
    name: "getAccounts",
    description: "Fetch recent Salesforce accounts. Args: { limit: number }",
    function: getAccounts,
    // If using function-calling features, include a JSON Schema for args:
    parameters: {
      type: "object",
      properties: {
        limit: { type: "integer", description: "Max number of accounts to fetch" },
      },
      required: [],
    },
  },
  {
    name: "getAccountById",
    description: "Fetch a single Account by Salesforce Id. Args: { id: string }",
    function: getAccountById,
    parameters: {
      type: "object",
      properties: {
        id: { type: "string" },
      },
      required: ["id"],
    },
  },
];
</code></pre>
<hr />
<h2>🤖 Building the Agent Orchestrator</h2>
<p>Pattern used here (MCP loop):</p>
<ol>
<li><p>Send user input + tool metadata to the model.</p>
</li>
<li><p>If the model returns a function/tool call, run that function locally.</p>
</li>
<li><p>Return the tool output to the model as a new message and ask for the final answer.</p>
</li>
<li><p>Repeat if the model requests additional tools.</p>
</li>
</ol>
<p>Example agent using the OpenAI function-calling pattern (pseudo-real code for the official Node SDK):</p>
<pre><code class="language-javascript">// agent.js
import OpenAI from "openai";
import { tools } from "./mcp.js";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// small helper to map tool metadata for the model's function parameter
function buildFunctionDefs(tools) {
  return tools.map(t =&gt; ({
    name: t.name,
    description: t.description,
    parameters: t.parameters || { type: "object" },
  }));
}

export async function runAgent(userInput) {
  // 1) Ask the model what to do
  const initial = await client.chat.completions.create({
    model: "gpt-4o", // pick a model in your account that supports function-calling
    messages: [
      {
        role: "system",
        content:
          "You are an assistant that can call tools. When you want to call a tool, respond with a function call using JSON arguments matching the declared schema.",
      },
      { role: "user", content: userInput },
    ],
    functions: buildFunctionDefs(tools),
    function_call: "auto",
  });

  const message = initial.choices[0].message;

  // 2) If the model wants to call a function, execute it
  if (message.function_call) {
    const { name, arguments: argsStr } = message.function_call;
    let args = {};
    try {
      args = argsStr ? JSON.parse(argsStr) : {};
    } catch (err) {
      // Bad JSON from model — tell it to reformat
      return {
        error: "Model returned invalid JSON for function call arguments",
        detail: err.message,
      };
    }

    // find the tool function and run it
    const tool = tools.find(t =&gt; t.name === name);
    if (!tool) {
      return { error: `Unknown tool: ${name}` };
    }

    let toolOutput;
    try {
      toolOutput = await tool.function(...Object.values(args));
    } catch (err) {
      toolOutput = { error: err.message };
    }

    // 3) Send the tool output back to the model and ask for finalization
    const followUp = await client.chat.completions.create({
      model: "gpt-4o",
      messages: [
        { role: "system", content: "You are an assistant that can call tools." },
        { role: "user", content: userInput },
        message, // original model function call
        {
          role: "function",
          name,
          content: JSON.stringify(toolOutput),
        },
        {
          role: "user",
          content: "Based on the tool output, provide a concise summary and next steps.",
        },
      ],
    });

    const final = followUp.choices[0].message.content;
    return { result: final, toolOutput };
  } else {
    // Model didn't call a tool — just return its text
    return { result: message.content };
  }
}
</code></pre>
<p>Notes:</p>
<ul>
<li><p>The above uses the Chat Completions function-calling flow. If you're using the newer Responses API, adapt accordingly to send tool metadata and handle tool calls similarly.</p>
</li>
<li><p>Validate model-returned JSON and guard against unexpected inputs.</p>
</li>
</ul>
<hr />
<h2>🖥️ Example Express Server</h2>
<pre><code class="language-javascript">// server.js
import express from "express";
import dotenv from "dotenv";
import { runAgent } from "./agent.js";

dotenv.config();
const app = express();
app.use(express.json());

app.post("/agent", async (req, res) =&gt; {
  try {
    const { input } = req.body;
    const out = await runAgent(input);
    res.json(out);
  } catch (err) {
    console.error(err);
    res.status(500).json({ error: err.message });
  }
});

const port = process.env.PORT || 3000;
app.listen(port, () =&gt; console.log(`Agent server running on ${port}`));
</code></pre>
<hr />
<h2>✅ Practical Patterns &amp; Tips</h2>
<ul>
<li><p>Start with a small set of tools (read-only queries first), then expand to mutate data (create/update) with caution.</p>
</li>
<li><p>Limit scopes in your connected app. Grant only what you need.</p>
</li>
<li><p>Log function calls with correlation IDs for auditing.</p>
</li>
<li><p>Sanitize and validate any model-provided arguments before executing tools.</p>
</li>
<li><p>Add rate limiting and retries when calling external APIs (Salesforce/OpenAI).</p>
</li>
<li><p>Return structured results (JSON) from tools so the model can reason about data reliably.</p>
</li>
<li><p>Implement a “dry-run” or “preview” mode where the agent suggests actions but does not execute them unless explicitly approved.</p>
</li>
</ul>
<hr />
<h2>🛡️ Security &amp; Compliance</h2>
<ul>
<li><p>Never embed long-lived credentials in code. Use refresh-token + client secret flow and replace secrets using a secure secret store.</p>
</li>
<li><p>Rate-limit model and API usage, and monitor costs.</p>
</li>
<li><p>Add RBAC and approvals for destructive operations (e.g., mass-updates).</p>
</li>
<li><p>Keep sensitive data out of prompts when possible; redact or transform before sending to OpenAI if needed.</p>
</li>
</ul>
<hr />
<h2>🧪 Testing &amp; Iteration</h2>
<ul>
<li><p>Start with unit tests for each tool (mock Salesforce responses).</p>
</li>
<li><p>Test the agent with typical and adversarial prompts to see how it selects tools.</p>
</li>
<li><p>Add guardrails: deterministic schemas, explicit allowed-values lists, and user confirmations for risky actions.</p>
</li>
</ul>
<hr />
<h2>📈 Next Steps / Ideas</h2>
<ul>
<li><p>Add human-in-the-loop approvals for any write operations.</p>
</li>
<li><p>Expand tools to query related records, compute metrics, or create tasks.</p>
</li>
<li><p>Build a UI that visualizes the agent’s chosen tool-calls and outputs for auditability.</p>
</li>
<li><p>Record conversations and actions for compliance and debugging.</p>
</li>
</ul>
<hr />
<h2>Final Thoughts</h2>
<p>Using MCP lets you design agents that are flexible yet auditable: the model chooses tools and the system executes them in a controlled environment. Start small, instrument heavily, and gradually add capabilities and safety checks. With a minimal set of tools and a solid orchestration loop, you can automate meaningful Salesforce tasks and free up time for higher-value work.</p>
]]></content:encoded></item><item><title><![CDATA[Building a Production-Grade EKS Platform on AWS with Terraform and GitOps]]></title><description><![CDATA[https://codepen.io/qckuhtdx-the-scripter/pen/myrLwxP

Building a Production-Grade EKS Platform on AWS with Terraform and GitOps
Overview
In this post I walk through how I built a fully automated, prod]]></description><link>https://blog.rajasekharcloud.com/building-a-production-grade-eks-platform-on-aws-with-terraform-and-gitops</link><guid isPermaLink="true">https://blog.rajasekharcloud.com/building-a-production-grade-eks-platform-on-aws-with-terraform-and-gitops</guid><category><![CDATA[EKS]]></category><category><![CDATA[Terraform]]></category><category><![CDATA[GitHub Actions]]></category><category><![CDATA[ArgoCD]]></category><category><![CDATA[AWS]]></category><dc:creator><![CDATA[Rajasekhar Reddy]]></dc:creator><pubDate>Wed, 01 Apr 2026 17:27:16 GMT</pubDate><content:encoded><![CDATA[<p><a class="embed-card" href="https://codepen.io/qckuhtdx-the-scripter/pen/myrLwxP">https://codepen.io/qckuhtdx-the-scripter/pen/myrLwxP</a></p>

<h1>Building a Production-Grade EKS Platform on AWS with Terraform and GitOps</h1>
<h2>Overview</h2>
<p>In this post I walk through how I built a fully automated, production-style Kubernetes platform on AWS EKS using Terraform, GitHub Actions OIDC, and ArgoCD — all optimized for cost without sacrificing reliability. Every component is provisioned as code, deployed without storing a single static AWS credential, and observable from day one.</p>
<p>The full source code is available at: <strong><a href="https://github.com/rajasekhar-cloud25/infrastructure">https://github.com/rajasekhar-cloud25/infrastructure</a></strong></p>
<p>Interactive architecture diagram: <a href="https://codepen.io/qckuhtdx-the-scripter/pen/myrLwxP"><strong>View live →</strong></a></p>
<h2>The Problem with Static Credentials</h2>
<p>The traditional approach looks like this:</p>
<pre><code class="language-yaml"># ❌ The wrong way — credentials stored permanently in GitHub
env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
</code></pre>
<p>Problems with this approach:</p>
<ul>
<li><p>Credentials are long-lived — if leaked, they work until manually rotated</p>
</li>
<li><p>They exist permanently in GitHub's secret store</p>
</li>
<li><p>Any workflow in the repo can use them</p>
</li>
<li><p>Rotation requires updating secrets in every repo that uses them</p>
</li>
<li><p>No audit trail of which workflow run used which credential</p>
</li>
</ul>
<hr />
<h2>The Solution: OIDC Token Exchange</h2>
<p>GitHub Actions supports OpenID Connect (OIDC). Instead of storing credentials, the workflow requests a short-lived token from GitHub's OIDC provider and exchanges it for an AWS IAM role:</p>
<pre><code class="language-plaintext">GitHub Actions runner
  │
  ├─ requests OIDC token from GitHub
  │   (signed JWT containing: repo, branch, workflow, run ID)
  │
  ├─ calls AWS STS AssumeRoleWithWebIdentity
  │   (presents the OIDC token + role ARN)
  │
  ├─ AWS validates: is this token from GitHub? ✅
  │                 is the repo/branch in the trust policy? ✅
  │
  └─ AWS returns: temporary access key + secret + session token
      (expires in 1 hour, scoped to this specific IAM role)
</code></pre>
<p>The credentials exist only for the duration of the job. When the job ends, the credentials expire. Nothing is stored. Nothing can leak.</p>
<hr />
<h2>Project Structure</h2>
<p>The infrastructure is organized as a set of Terraform modules, each with a single responsibility:</p>
<pre><code class="language-plaintext">infrastructure/
  .github/              ← GitHub Actions workflows
  Main/                 ← Root module, wires everything together
  vpc/                  ← VPC, subnets, IGW, NAT GW, route tables, SGs
  iam/                  ← IAM roles, IRSA roles, GitHub OIDC trust
  eks/                  ← EKS cluster, node group, access entries
  ecr/                  ← ECR repositories (separate workspace)
  eip/                  ← Elastic IPs for NLB (separate workspace)
  k8s_namespaces/       ← All K8s namespaces pre-created
  kubernetes-ingress/   ← NGINX Ingress Controller + NLB + Route53
  argocd_deployment/    ← ArgoCD via local Helm chart
  s3/                   ← Terraform state bucket bootstrap
  charts/               ← Local Helm charts (ArgoCD, NGINX)
  bootstrap.sh          ← Creates S3 bucket + DynamoDB table
</code></pre>
<p>The two most important design decisions here: <strong>ECR and EIPs live in separate workspaces</strong>. This means container images and static IP addresses survive a full cluster destroy and recreate — no DNS updates, no image rebuilds.</p>
<hr />
<h2>Step 1 — Bootstrapping State</h2>
<p>Before any Terraform can run, the S3 state backend needs to exist. The <code>bootstrap.sh</code> script handles this:</p>
<pre><code class="language-bash">#!/bin/bash
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="ecommerce-demo-terraform-state-${ACCOUNT_ID}"

aws s3api create-bucket --bucket $BUCKET_NAME --region us-east-1
aws s3api put-bucket-versioning \
  --bucket $BUCKET_NAME \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST
</code></pre>
<p>The bucket name is derived from the AWS account ID at runtime — no secrets needed anywhere.</p>
<hr />
<h2>Step 2 — GitHub Actions OIDC (Zero Static Credentials)</h2>
<p>All three workflows authenticate to AWS using OIDC — the GitHub Actions token is exchanged for a short-lived IAM role. No <code>AWS_ACCESS_KEY_ID</code> is ever stored in GitHub secrets.</p>
<pre><code class="language-yaml"># .github/workflows/tf-apply.yaml
jobs:
  plan:
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ vars.AWS_REGION }}
</code></pre>
<p>Three workflows are defined:</p>
<table>
<thead>
<tr>
<th>Workflow</th>
<th>Trigger</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td><code>tf-plan</code></td>
<td>On every PR</td>
<td>Runs <code>terraform plan</code>, posts diff as comment</td>
</tr>
<tr>
<td><code>tf-apply</code></td>
<td>Merge to main</td>
<td>Requires manual approval, then applies</td>
</tr>
<tr>
<td><code>tf-destroy</code></td>
<td>Manual only</td>
<td>Requires typing "destroy" to confirm</td>
</tr>
</tbody></table>
<hr />
<h2>Step 3 — VPC Module</h2>
<p>The VPC module creates everything the cluster needs to run in a private, secure network:</p>
<pre><code class="language-hcl"># modules/vpc/main.tf (key resources)
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

# Public subnets — NLB, NAT Gateway, Internet Gateway
resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet("10.0.0.0/16", 8, count.index)
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    "kubernetes.io/role/elb" = "1"
  }
}

# Private subnets — EKS nodes only
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    "kubernetes.io/role/internal-elb"                    = "1"
    "kubernetes.io/cluster/${var.resource_name}"         = "shared"
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id  # Single AZ — cost optimized
}
</code></pre>
<p><strong>Why a single NAT Gateway?</strong> One NAT Gateway instead of two saves ~$32/month. For a production portfolio demo this is an acceptable tradeoff — the only risk is losing outbound internet for private nodes if that one AZ goes down.</p>
<p>Route tables are straightforward:</p>
<ul>
<li><p><strong>Public RT:</strong> <code>0.0.0.0/0 → Internet Gateway</code></p>
</li>
<li><p><strong>Private RT:</strong> <code>0.0.0.0/0 → NAT Gateway</code></p>
</li>
</ul>
<hr />
<h2>Step 4 — IAM Module</h2>
<p>The IAM module handles all permissions with a single consolidated design. The key insight is that the EKS OIDC provider is created inside the IAM module — this avoids a circular dependency where IAM needs EKS and EKS needs IAM.</p>
<pre><code class="language-hcl"># modules/iam/main.tf

# EKS cluster role
resource "aws_iam_role" "eks_cluster" {
  name               = "${var.resource_name}-eks-cluster-role"
  assume_role_policy = data.aws_iam_policy_document.eks_cluster_assume.json
}

# GitHub Actions OIDC role (pre-created provider as data source)
data "aws_iam_openid_connect_provider" "github" {
  url = "https://token.actions.githubusercontent.com"
}

resource "aws_iam_role" "github_actions" {
  name = "${var.resource_name}-github-actions-role"
  assume_role_policy = jsonencode({
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = data.aws_iam_openid_connect_provider.github.arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:${var.github_repo}:*"
        }
      }
    }]
  })
}

# IRSA roles — one per component, least privilege
resource "aws_iam_role" "external_secrets" {
  name = "${var.resource_name}-external-secrets-role"
  # Trust policy scoped to the ESO service account only
  assume_role_policy = data.aws_iam_policy_document.irsa_external_secrets.json
}
</code></pre>
<p>Every pod that needs AWS access gets its own IRSA role — no shared node-level credentials:</p>
<table>
<thead>
<tr>
<th>Component</th>
<th>IRSA Permissions</th>
</tr>
</thead>
<tbody><tr>
<td>EBS CSI Driver</td>
<td><code>ec2:CreateVolume</code>, <code>ec2:AttachVolume</code></td>
</tr>
<tr>
<td>Cluster Autoscaler</td>
<td><code>autoscaling:SetDesiredCapacity</code></td>
</tr>
<tr>
<td>External Secrets</td>
<td><code>secretsmanager:GetSecretValue</code></td>
</tr>
<tr>
<td>cert-manager</td>
<td><code>route53:ChangeResourceRecordSets</code></td>
</tr>
</tbody></table>
<hr />
<h2>Step 5 — EKS Module</h2>
<p>The EKS cluster runs on SPOT t3.small instances for cost optimization. Using <code>API_AND_CONFIG_MAP</code> auth mode with access entries instead of the legacy <code>aws-auth</code> ConfigMap:</p>
<pre><code class="language-hcl"># modules/eks/main.tf
resource "aws_eks_cluster" "main" {
  name     = var.resource_name
  version  = var.cluster_version
  role_arn = var.cluster_role_arn

  vpc_config {
    subnet_ids              = var.private_subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = true
  }

  access_config {
    authentication_mode                         = "API_AND_CONFIG_MAP"
    bootstrap_cluster_creator_admin_permissions = true
  }
}

resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_role_arn   = var.node_role_arn
  subnet_ids      = var.private_subnet_ids
  instance_types  = ["t3.small"]
  capacity_type   = "SPOT"        # ~60% cheaper than on-demand

  scaling_config {
    desired_size = 5
    min_size     = 2
    max_size     = 5
  }
}

# Access entries — no aws-auth ConfigMap editing required
resource "aws_eks_access_entry" "github_actions" {
  cluster_name  = aws_eks_cluster.main.name
  principal_arn = var.github_actions_role_arn
  type          = "STANDARD"
}
</code></pre>
<p>The Kubernetes and Helm providers authenticate using <code>exec</code> with <code>aws eks get-token</code> — this avoids plan-time failures when the cluster doesn't exist yet:</p>
<pre><code class="language-hcl">provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_ca_certificate)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    args        = ["eks", "get-token", "--cluster-name", module.eks.cluster_name, "--region", var.aws_region]
  }
}
</code></pre>
<hr />
<h2>Step 6 — ECR Module (Separate Workspace)</h2>
<p>ECR repositories are managed in their own Terraform workspace so images are never accidentally deleted when the main cluster is torn down:</p>
<pre><code class="language-hcl"># ecr/main.tf
resource "aws_ecr_repository" "app" {
  for_each             = toset(var.repository_names)
  name                 = each.value
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }
}
</code></pre>
<p>GitHub Actions builds and pushes on every merge:</p>
<pre><code class="language-yaml">- name: Build and push
  run: |
    aws ecr get-login-password | docker login --username AWS \
      --password-stdin $ECR_REGISTRY
    docker buildx build --platform linux/amd64 \
      --push -t \(ECR_REGISTRY/\)ECR_REPO:$IMAGE_TAG .
</code></pre>
<hr />
<h2>Step 7 — EIP Module (Separate Workspace)</h2>
<p>Static Elastic IPs are created in their own workspace — separate from the main cluster. This means the NLB always gets the same IP addresses, Route53 A records never need updating, and the cluster can be completely rebuilt without changing DNS:</p>
<pre><code class="language-hcl"># eip/main.tf
resource "aws_eip" "nlb" {
  count  = 2
  domain = "vpc"

  tags = {
    Name = "\({var.resource_name}-nlb-eip-\){count.index}"
  }
}
</code></pre>
<p>The allocation IDs are then passed as a variable to the main workspace:</p>
<pre><code class="language-hcl"># environments/eks-demo-dev.tfvars
nlb_eip_allocation_ids = [
  "eipalloc-09595a182e792f01f",
  "eipalloc-032c83197c359b3fe"
]
</code></pre>
<hr />
<h2>Step 8 — Kubernetes Namespaces Module</h2>
<p>All namespaces are created before any Helm chart runs. This prevents race conditions where a chart tries to create resources in a namespace that doesn't exist yet:</p>
<pre><code class="language-hcl"># modules/k8s_namespaces/main.tf
resource "kubernetes_namespace" "namespaces" {
  for_each = toset([
    "argocd",
    "nginx-ingress",
    "monitoring",
    "external-secrets",
    "cert-manager",
    "eks-demo",
    "shared-os",
    "kubecost"
  ])

  metadata {
    name = each.value
  }
}
</code></pre>
<p>This module runs before <code>kubernetes-ingress</code> and <code>argocd_deployment</code> in the dependency graph.</p>
<hr />
<h2>Step 9 — NGINX Ingress Controller + NLB + Route53</h2>
<p>The NGINX Ingress Controller is the traffic gateway for the entire cluster. It is deployed via a local Helm chart with NLB annotations that attach the static EIPs:</p>
<pre><code class="language-hcl"># modules/kubernetes-ingress/main.tf
resource "helm_release" "nginx_ingress" {
  name      = "nginx-ingress"
  chart     = "${path.module}/../charts/kubernetes-ingress"
  namespace = "nginx-ingress"
  timeout   = 600
  wait      = true
  atomic    = true

  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-type"
    value = "nlb"
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-eip-allocations"
    value = join("\\,", var.nlb_eip_allocation_ids)
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-subnets"
    value = join("\\,", var.public_subnet_ids)
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-cert"
    value = var.acm_certificate_arn
  }
  set {
    name  = "controller.service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-ssl-ports"
    value = "443"
  }
}
</code></pre>
<p><strong>Why NGINX over AWS ALB Controller?</strong></p>
<ul>
<li><p>NGINX is free — ALB Controller creates a new ALB per ingress (~$16/mo each)</p>
</li>
<li><p>No per-app ACM ARN annotation required — one cert at the NLB level covers everything</p>
</li>
<li><p>Portable — works identically on any cloud or on-premises</p>
</li>
</ul>
<p>Route53 A records point directly to the static EIP addresses:</p>
<pre><code class="language-hcl">data "aws_eip" "nlb" {
  count = length(var.nlb_eip_allocation_ids)
  id    = var.nlb_eip_allocation_ids[count.index]
}

resource "aws_route53_record" "dns_records" {
  for_each = toset(var.dns_names)
  zone_id  = var.route53_zone_id
  name     = "\({each.value}.\){var.domain_name}"
  type     = "A"
  ttl      = 300
  records  = data.aws_eip.nlb[*].public_ip
}
</code></pre>
<p>TLS flow:</p>
<pre><code class="language-plaintext">User → HTTPS
  → NLB (ACM wildcard *.reddycloud.com terminates TLS)
  → HTTP → NGINX Ingress
  → HTTP → App pod
</code></pre>
<p>No cert-manager needed. AWS handles certificate renewal automatically.</p>
<hr />
<h2>Step 10 — ArgoCD Deployment</h2>
<p>ArgoCD is deployed via a local Helm chart with CRDs managed separately using the <code>alekc/kubectl</code> provider to avoid Helm CRD conflicts:</p>
<pre><code class="language-hcl"># modules/argocd_deployment/main.tf

# CRDs managed outside Helm to avoid upgrade conflicts
data "http" "argocd_crds" {
  for_each = toset(local.crd_files)
  url      = each.value
}

resource "kubectl_manifest" "argocd_crds" {
  for_each          = toset(local.crd_files)
  yaml_body         = data.http.argocd_crds[each.value].response_body
  server_side_apply = true
  force_conflicts   = true
  wait              = true
}

resource "helm_release" "argocd" {
  name       = "argocd-chart"
  chart      = "${path.module}/../charts/argocd"
  version    = "9.4.17"
  namespace  = "argocd"
  skip_crds  = true    # CRDs managed by kubectl_manifest above
  replace    = true
  wait       = true
  timeout    = 600

  values = [file("${path.module}/../charts/argocd/clusterValues/values.EksDemo.yaml")]

  depends_on = [kubectl_manifest.argocd_crds]
}
</code></pre>
<p>ArgoCD values for NGINX ingress integration:</p>
<pre><code class="language-yaml"># charts/argocd/clusterValues/values.EksDemo.yaml
configs:
  params:
    server.insecure: "true"  # TLS terminated at NLB

server:
  extraArgs:
    - --insecure

  ingress:
    enabled: true
    ingressClassName: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
      nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
    hostname: argocd.reddycloud.com
    paths: /
    pathType: Prefix
    https: false
</code></pre>
<p>ArgoCD runs in <code>--insecure</code> mode because TLS is already terminated at the NLB. The user always sees HTTPS — ArgoCD just receives plain HTTP from NGINX.</p>
<hr />
<h2>Secrets Management — External Secrets Operator</h2>
<p>No secrets are hardcoded anywhere. The External Secrets Operator pulls from AWS Secrets Manager using IRSA:</p>
<pre><code class="language-yaml"># K8s manifest deployed via ArgoCD
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: postgres-creds
  namespace: eks-demo
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-store
    kind: ClusterSecretStore
  target:
    name: postgres-creds
    creationPolicy: Owner
  data:
    - secretKey: POSTGRES_PASSWORD
      remoteRef:
        key: ecommerce-k8s-demo/postgres
        property: password
    - secretKey: POSTGRES_USER
      remoteRef:
        key: ecommerce-k8s-demo/postgres
        property: username
</code></pre>
<p>The flow:</p>
<pre><code class="language-plaintext">AWS Secrets Manager
  → ExternalSecret CRD (IRSA authenticated)
  → Kubernetes Secret (auto-created, kept in sync)
  → App pod (env var or volume mount)
</code></pre>
<hr />
<h2>Observability Stack</h2>
<p>The full observability stack is deployed via ArgoCD:</p>
<table>
<thead>
<tr>
<th>Signal</th>
<th>Collector</th>
<th>Storage</th>
<th>Query</th>
</tr>
</thead>
<tbody><tr>
<td>Metrics</td>
<td>Prometheus (ServiceMonitor scrape)</td>
<td>TSDB on EBS</td>
<td>Grafana PromQL</td>
</tr>
<tr>
<td>Traces</td>
<td>OTel Collector (OTLP gRPC :4317)</td>
<td>Jaeger</td>
<td>Grafana / Jaeger UI</td>
</tr>
<tr>
<td>Logs</td>
<td>Promtail DaemonSet</td>
<td>Loki</td>
<td>Grafana LogQL</td>
</tr>
<tr>
<td>Search</td>
<td>OpenSearch client (direct)</td>
<td>OpenSearch index</td>
<td>OpenSearch Dashboards</td>
</tr>
</tbody></table>
<p>The OTel Collector pipeline:</p>
<pre><code class="language-yaml">receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
  resource:

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]
</code></pre>
<hr />
<h2>Cost Breakdown</h2>
<table>
<thead>
<tr>
<th>Resource</th>
<th>Cost</th>
<th>Optimization</th>
</tr>
</thead>
<tbody><tr>
<td>EKS cluster</td>
<td>~$7.20/mo</td>
<td>Fixed control plane cost</td>
</tr>
<tr>
<td>SPOT t3.small × 5</td>
<td>~$14/mo</td>
<td>~60% vs on-demand</td>
</tr>
<tr>
<td>NAT Gateway</td>
<td>~$5/mo</td>
<td>Single AZ vs per-AZ</td>
</tr>
<tr>
<td>NLB</td>
<td>~$16/mo</td>
<td>One NLB for everything</td>
</tr>
<tr>
<td>EBS volumes</td>
<td>~$3/mo</td>
<td>gp3 storage class</td>
</tr>
<tr>
<td>Route53</td>
<td>~$0.50/mo</td>
<td>Hosted zone</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>~$46/mo</strong></td>
<td>vs ~$200+ on-demand multi-AZ</td>
</tr>
</tbody></table>
<hr />
<h2>Key Takeaways</h2>
<p><strong>Zero static credentials</strong> — GitHub Actions OIDC means no AWS keys ever touch GitHub secrets. IRSA means no AWS keys ever touch EKS nodes.</p>
<p><strong>Destroy-safe architecture</strong> — EIPs and ECR in separate workspaces means the cluster can be completely torn down and rebuilt without updating DNS or rebuilding images.</p>
<p><strong>Single ACM cert covers everything</strong> — One wildcard cert on the NLB eliminates cert-manager, Let's Encrypt rate limits, and per-app TLS configuration.</p>
<p><strong>Cost matters</strong> — SPOT instances, single NAT Gateway, NGINX instead of per-ALB cost, pods instead of managed services. Same production patterns at a fraction of the cost.</p>
<hr />
<p><em>Source code: github.com/rajreddy/ecommerce-k8s-demoInteractive architecture: codepen.io/qckuhtdx-the-scripter/pen/myrLwxPDomain: reddycloud.com</em></p>
]]></content:encoded></item></channel></rss>