<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
  <channel>
    <title>Rich Gibbs</title>
    <link>https://blog.richgibbs.dev/feed.xml</link>
    <description>Practical, opinionated notes on Linux server security, AWS hygiene, and indie-founder operations from Rich Gibbs.</description>
    <atom:link href="https://blog.richgibbs.dev/feed.xml" rel="self"/>
    <docs>http://www.rssboard.org/rss-specification</docs>
    <generator>python-feedgen</generator>
    <language>en</language>
    <lastBuildDate>Sun, 31 May 2026 17:11:00 +0000</lastBuildDate>
    <item>
      <title>Ubuntu/Debian EC2 hardening checklist (2026)</title>
      <link>https://blog.richgibbs.dev/ubuntu-debian-ec2-hardening-checklist-2026/</link>
      <description>A practical 2026 hardening checklist for Ubuntu and Debian EC2 instances: SSH, UFW, IMDSv2, updates, logging, backups, and Docker basics.</description>
      <content:encoded>&lt;p&gt;You spun up an EC2 instance, pointed a domain at it, and now real traffic — and real bots — can reach it. Most &amp;ldquo;hardening guides&amp;rdquo; online are either copy-paste cargo cult from 2014 or vendor whitepapers selling a SIEM. This is the version I actually run on Ubuntu 22.04, Ubuntu 24.04, and Debian 12 boxes, written for solo founders and small teams who don&amp;rsquo;t have a dedicated security person.&lt;/p&gt;
&lt;p&gt;Work through it top to bottom on a fresh box. On an existing box, treat it as a diff: read each section, run the audit command, fix the gap, move on.&lt;/p&gt;
&lt;h2 id="why-this-checklist"&gt;Why this checklist&lt;/h2&gt;
&lt;p&gt;The threats most small EC2 fleets actually get hit by aren&amp;rsquo;t APTs. They&amp;rsquo;re:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;SSH brute force from random botnets&lt;/li&gt;
&lt;li&gt;Exposed services you forgot were listening (Redis, Postgres, Docker API, an old admin panel)&lt;/li&gt;
&lt;li&gt;Stolen IAM credentials via SSRF on a misconfigured app reaching the EC2 metadata service&lt;/li&gt;
&lt;li&gt;An unpatched kernel or library with a known CVE&lt;/li&gt;
&lt;li&gt;A compromised dependency or container image that opens a reverse shell&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Everything below is aimed at those concrete risks. There&amp;rsquo;s no checklist on earth that makes you &amp;ldquo;secure&amp;rdquo; — but a tight baseline closes the cheap, automated attack paths so an attacker has to actually work.&lt;/p&gt;
&lt;h2 id="threat-model-assumptions"&gt;Threat model assumptions&lt;/h2&gt;
&lt;p&gt;Before any commands, make these explicit:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This is a single-tenant Linux server (or small fleet) on AWS EC2.&lt;/li&gt;
&lt;li&gt;You are the only admin, or there&amp;rsquo;s a tiny ops team with shared SSH keys.&lt;/li&gt;
&lt;li&gt;The instance runs a public-facing web app and/or some background workers.&lt;/li&gt;
&lt;li&gt;You&amp;rsquo;re not in a regulated environment yet (PCI/HIPAA/SOC 2 controls are &lt;em&gt;not&lt;/em&gt; what this checklist gives you).&lt;/li&gt;
&lt;li&gt;You can tolerate a few minutes of downtime to reboot for kernel updates.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If any of those don&amp;rsquo;t match, adjust before applying.&lt;/p&gt;
&lt;h2 id="1-ssh"&gt;1. SSH&lt;/h2&gt;
&lt;p&gt;SSH is still the single biggest &amp;ldquo;front door&amp;rdquo; on a Linux server.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use keys, not passwords. Disable root login. Limit who can log in.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Edit &lt;code&gt;/etc/ssh/sshd_config&lt;/code&gt; (or drop a file in &lt;code&gt;/etc/ssh/sshd_config.d/&lt;/code&gt; on Ubuntu 22.04+):&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
PubkeyAuthentication yes
PermitEmptyPasswords no
X11Forwarding no
MaxAuthTries 3
LoginGraceTime 20
ClientAliveInterval 300
ClientAliveCountMax 2
AllowUsers ubuntu deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Replace &lt;code&gt;ubuntu deploy&lt;/code&gt; with the actual non-root accounts you use. Then validate and reload:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo sshd -t
sudo systemctl reload ssh
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Optional but worth it on small boxes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Move SSH off port 22. It doesn&amp;rsquo;t stop a determined attacker, but it cuts log noise from internet-wide scanners by ~95%. If you do this, update the EC2 security group too.&lt;/li&gt;
&lt;li&gt;Restrict the SSH security group to your office/VPN IP, your home IP, or a bastion. &lt;code&gt;0.0.0.0/0&lt;/code&gt; on port 22 is a choice, not a default.&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;fail2ban&lt;/code&gt; for cheap brute-force throttling:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;bash
  sudo apt-get update &amp;amp;&amp;amp; sudo apt-get install -y fail2ban
  sudo systemctl enable --now fail2ban&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Audit:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo sshd -T | grep -Ei 'permitrootlogin|passwordauth|pubkeyauth|allowusers|port'
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="2-firewall-and-listeners"&gt;2. Firewall and listeners&lt;/h2&gt;
&lt;p&gt;The cheapest mistake on EC2 is a service binding to &lt;code&gt;0.0.0.0&lt;/code&gt; that you thought was on &lt;code&gt;127.0.0.1&lt;/code&gt;. Defense in depth: lock it down at the OS &lt;em&gt;and&lt;/em&gt; at the security group.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;See what&amp;rsquo;s actually listening:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo ss -tulpn
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Anything bound to &lt;code&gt;0.0.0.0&lt;/code&gt; or &lt;code&gt;::&lt;/code&gt; that isn&amp;rsquo;t your web server, SSH, or something you explicitly want public is a finding. Common offenders: Redis (6379), Postgres (5432), MySQL (3306), Docker API (2375/2376), Elasticsearch (9200), Memcached (11211), &lt;code&gt;node&lt;/code&gt; dev servers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bind to localhost&lt;/strong&gt; in the service config (e.g. &lt;code&gt;bind 127.0.0.1&lt;/code&gt; in &lt;code&gt;/etc/redis/redis.conf&lt;/code&gt;, &lt;code&gt;listen_addresses = 'localhost'&lt;/code&gt; in &lt;code&gt;postgresql.conf&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Then layer UFW&lt;/strong&gt; on top:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo apt-get install -y ufw
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable
sudo ufw status verbose
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;On the AWS side&lt;/strong&gt;, the security group is your real perimeter. Rules of thumb:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One SG per role (web, db, worker), not one giant SG that allows everything internally.&lt;/li&gt;
&lt;li&gt;DB and cache SGs accept traffic &lt;em&gt;only&lt;/em&gt; from the app SG, never from &lt;code&gt;0.0.0.0/0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;SSH SG limited to known IPs or a bastion/VPN SG.&lt;/li&gt;
&lt;li&gt;No &lt;code&gt;0.0.0.0/0&lt;/code&gt; on anything except 80/443 on the public web tier.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Audit:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo ss -tulpn | awk '$5 ~ /0\.0\.0\.0|\[::\]/'
sudo ufw status numbered
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Cross-check the AWS console / CLI:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 describe-security-groups \
  --query 'SecurityGroups[].{Name:GroupName,Ingress:IpPermissions}' \
  --output json
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="3-os-updates-and-reboots"&gt;3. OS updates and reboots&lt;/h2&gt;
&lt;p&gt;Unpatched kernels and OpenSSL/libc libraries are the most boring and most common way servers get owned.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Enable unattended security upgrades:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo apt-get install -y unattended-upgrades apt-listchanges
sudo dpkg-reconfigure -plow unattended-upgrades
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Check &lt;code&gt;/etc/apt/apt.conf.d/50unattended-upgrades&lt;/code&gt; includes the security pocket and that &lt;code&gt;Unattended-Upgrade::Automatic-Reboot&lt;/code&gt; is set deliberately. On a single box with a real user, automatic reboots at 3am can be fine; on production-critical workers, prefer notification + manual.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Patch now:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo apt-get update
sudo apt-get -y dist-upgrade
sudo apt-get -y autoremove --purge
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Detect a needed reboot:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;[ -f /var/run/reboot-required ] &amp;amp;&amp;amp; cat /var/run/reboot-required.pkgs
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If the kernel was updated, schedule a reboot. Live-patching (Ubuntu Pro / Livepatch) is great if you&amp;rsquo;re paying for it, but it doesn&amp;rsquo;t cover everything — you&amp;rsquo;ll still need occasional reboots.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Audit:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;apt list --upgradable 2&amp;gt;/dev/null
uname -r
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="4-admin-surface"&gt;4. Admin surface&lt;/h2&gt;
&lt;p&gt;Every account that can &lt;code&gt;sudo&lt;/code&gt; is part of your admin surface. Trim it.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;getent group sudo
getent group adm
awk -F: '($3 == 0) {print}' /etc/passwd   # any extra UID 0 account is a finding
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Rules:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One sudo user per real human, no shared logins where avoidable.&lt;/li&gt;
&lt;li&gt;Service accounts (&lt;code&gt;www-data&lt;/code&gt;, &lt;code&gt;postgres&lt;/code&gt;, &lt;code&gt;deploy&lt;/code&gt;) should not have shell or sudo. Use &lt;code&gt;usermod -s /usr/sbin/nologin &amp;lt;user&amp;gt;&lt;/code&gt; if needed.&lt;/li&gt;
&lt;li&gt;Rotate or remove SSH keys when someone leaves the team. &lt;code&gt;~/.ssh/authorized_keys&lt;/code&gt; for every login user is your source of truth — review it.&lt;/li&gt;
&lt;li&gt;Disable cloud-init&amp;rsquo;s default password if any (&lt;code&gt;cloud-init&lt;/code&gt; shouldn&amp;rsquo;t set one on official AMIs, but check).&lt;/li&gt;
&lt;li&gt;If you must allow &lt;code&gt;sudo&lt;/code&gt; without a password for automation, scope it to specific commands in &lt;code&gt;/etc/sudoers.d/&lt;/code&gt;, not blanket &lt;code&gt;NOPASSWD: ALL&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Audit:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;for u in $(awk -F: '$7 ~ /sh$/ {print $1}' /etc/passwd); do
  echo &amp;quot;== $u ==&amp;quot;; sudo cat /home/$u/.ssh/authorized_keys 2&amp;gt;/dev/null
done
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="5-ec2-metadata-service-imdsv2"&gt;5. EC2 metadata service (IMDSv2)&lt;/h2&gt;
&lt;p&gt;This one is non-negotiable in 2026. The EC2 instance metadata service hands out IAM role credentials. With IMDSv1 enabled, any server-side request forgery (SSRF) bug in your app can pop those credentials and walk into your AWS account.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Force IMDSv2 only&lt;/strong&gt;, with a low hop limit:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;TOKEN=$(curl -sX PUT &amp;quot;http://169.254.169.254/latest/api/token&amp;quot; \
  -H &amp;quot;X-aws-ec2-metadata-token-ttl-seconds: 60&amp;quot;)
curl -sH &amp;quot;X-aws-ec2-metadata-token: $TOKEN&amp;quot; \
  http://169.254.169.254/latest/meta-data/instance-id
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If that works but the same call without a token also works, you&amp;rsquo;re still on IMDSv1. For old AMIs, containers, or ASGs you cannot blindly rotate, follow the &lt;a href="/aws-imdsv2-migration-without-breaking-things/"&gt;IMDSv2 migration sequence&lt;/a&gt; before making it mandatory.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Enforce v2 on the instance&lt;/strong&gt; (run from your laptop with the AWS CLI):&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 modify-instance-metadata-options \
  --instance-id i-xxxxxxxxxxxxxxxxx \
  --http-tokens required \
  --http-endpoint enabled \
  --http-put-response-hop-limit 1
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;hop-limit 1&lt;/code&gt; means a container or proxy can&amp;rsquo;t trivially relay a request to the metadata service. If you run Docker with bridge networking, you may need &lt;code&gt;2&lt;/code&gt; — but start at &lt;code&gt;1&lt;/code&gt;, raise only if needed, and never to &lt;code&gt;64&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Also: the IAM role attached to the instance should be &lt;strong&gt;least privilege&lt;/strong&gt;. &amp;ldquo;Read this one S3 bucket and write to this one log group&amp;rdquo; beats &lt;code&gt;AdministratorAccess&lt;/code&gt; every time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Audit:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 describe-instances \
  --query 'Reservations[].Instances[].[InstanceId,MetadataOptions.HttpTokens,MetadataOptions.HttpPutResponseHopLimit]' \
  --output table
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Anything where &lt;code&gt;HttpTokens&lt;/code&gt; is not &lt;code&gt;required&lt;/code&gt; is a finding.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="mid-article-cta"&gt;Mid-article CTA&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;d rather have someone else go through this list on your servers and hand you back a clear report, that&amp;rsquo;s exactly what &lt;strong&gt;&lt;a href="https://richgibbs.dev/quickcheck/"&gt;Tuck Sentinel QuickCheck&lt;/a&gt;&lt;/strong&gt; does: a one-shot, read-only audit of a single Linux box with prioritized findings and copy-pasteable fixes. You can see what the output looks like in this &lt;strong&gt;&lt;a href="https://richgibbs.dev/quickcheck/sample-report/"&gt;sample report&lt;/a&gt;&lt;/strong&gt; before deciding.&lt;/p&gt;
&lt;p&gt;Back to the checklist.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="6-logging-and-time-sync"&gt;6. Logging and time sync&lt;/h2&gt;
&lt;p&gt;You can&amp;rsquo;t investigate what you didn&amp;rsquo;t record, and you can&amp;rsquo;t correlate logs that disagree on what time it is.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Time sync.&lt;/strong&gt; Ubuntu 22.04+ and Debian 12 ship &lt;code&gt;systemd-timesyncd&lt;/code&gt; or &lt;code&gt;chrony&lt;/code&gt;. Either is fine, just make sure one is running:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;timedatectl
# or
chronyc tracking
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you&amp;rsquo;re on AWS, the local time source &lt;code&gt;169.254.169.123&lt;/code&gt; is reliable and low-latency. &lt;code&gt;chrony&lt;/code&gt; config example:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Logging.&lt;/strong&gt; &lt;code&gt;journald&lt;/code&gt; is the default. A few sane settings in &lt;code&gt;/etc/systemd/journald.conf&lt;/code&gt;:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;Storage=persistent
SystemMaxUse=1G
SystemMaxFileSize=128M
ForwardToSyslog=no
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo systemctl restart systemd-journald
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For anything beyond a single box, ship logs off the instance — CloudWatch Logs, a Loki/Grafana stack, or any hosted log service. The reason isn&amp;rsquo;t compliance, it&amp;rsquo;s that the first thing an attacker tries to do is &lt;code&gt;rm /var/log/*&lt;/code&gt; and &lt;code&gt;journalctl --rotate --vacuum-time=1s&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Auditd&lt;/strong&gt; is worth installing if you want a record of which user ran which command:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo apt-get install -y auditd
sudo systemctl enable --now auditd
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You don&amp;rsquo;t need elaborate rules to start; the defaults plus shipping &lt;code&gt;/var/log/audit/audit.log&lt;/code&gt; off-box is already a huge upgrade.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Audit:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;journalctl --disk-usage
timedatectl | grep 'System clock synchronized'
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="7-backups-and-restore-drills"&gt;7. Backups and restore drills&lt;/h2&gt;
&lt;p&gt;A backup you&amp;rsquo;ve never restored is a wish, not a backup.&lt;/p&gt;
&lt;p&gt;For a small EC2 setup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;AWS Backup&lt;/strong&gt; or scheduled &lt;strong&gt;EBS snapshots&lt;/strong&gt; for the volume(s).&lt;/li&gt;
&lt;li&gt;For databases, also take &lt;strong&gt;logical&lt;/strong&gt; backups (&lt;code&gt;pg_dump&lt;/code&gt;, &lt;code&gt;mysqldump&lt;/code&gt;) on a schedule and copy them to S3 with versioning + lifecycle to Glacier.&lt;/li&gt;
&lt;li&gt;Encrypt at rest (EBS encryption + S3 SSE-KMS). On modern AWS regions/accounts, EBS encryption-by-default should be on — check it.&lt;/li&gt;
&lt;li&gt;Keep at least one backup copy in a &lt;strong&gt;different AWS account or region&lt;/strong&gt;. Ransomware-style attackers will delete in-region snapshots if they get the chance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Restore drill&lt;/strong&gt; — once a quarter, on a throwaway instance:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pick a recent snapshot/dump.&lt;/li&gt;
&lt;li&gt;Spin up a new instance/volume from it.&lt;/li&gt;
&lt;li&gt;Verify the app starts and recent data is present.&lt;/li&gt;
&lt;li&gt;Time how long it took. That&amp;rsquo;s your real RTO.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you&amp;rsquo;ve never done step 4, you don&amp;rsquo;t know your RTO; you have a hope.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Audit:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[?StartTime&amp;gt;=`2026-01-01`].[SnapshotId,StartTime,VolumeSize,Description]' \
  --output table
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="8-docker-basics-if-applicable"&gt;8. Docker basics (if applicable)&lt;/h2&gt;
&lt;p&gt;If you don&amp;rsquo;t run Docker on the box, skip this. If you do, the most common foot-guns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Don&amp;rsquo;t expose the Docker daemon over TCP.&lt;/strong&gt; &lt;code&gt;2375&lt;/code&gt; unauthenticated is root-on-box for anyone who can reach it. Use the local socket (&lt;code&gt;/var/run/docker.sock&lt;/code&gt;) and SSH for remote control.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mind the &lt;code&gt;-p&lt;/code&gt; flag.&lt;/strong&gt; &lt;code&gt;-p 5432:5432&lt;/code&gt; binds to &lt;code&gt;0.0.0.0&lt;/code&gt; and bypasses UFW on most Docker setups (Docker writes its own iptables rules). If you only need the port locally, use &lt;code&gt;-p 127.0.0.1:5432:5432&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run containers as non-root&lt;/strong&gt; where possible. &lt;code&gt;USER&lt;/code&gt; directive in your Dockerfile, or &lt;code&gt;--user 1000:1000&lt;/code&gt; at runtime.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pin base images&lt;/strong&gt; to a digest (&lt;code&gt;FROM ubuntu:24.04@sha256:...&lt;/code&gt;) for production, and rebuild on a schedule to pick up CVE fixes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Don&amp;rsquo;t bind-mount the Docker socket into containers&lt;/strong&gt; unless you fully understand that&amp;rsquo;s equivalent to giving that container root on the host.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Set &lt;code&gt;--read-only&lt;/code&gt; and &lt;code&gt;--cap-drop=ALL&lt;/code&gt;&lt;/strong&gt; for containers that don&amp;rsquo;t need to write to their filesystem or hold extra capabilities; add back only what&amp;rsquo;s needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A useful audit one-liner:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;docker ps --format '{{.Names}} {{.Ports}}' | grep -E '0\.0\.0\.0|:::'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Anything in that list is reachable from the public internet (modulo the security group). Decide if that&amp;rsquo;s intentional.&lt;/p&gt;
&lt;p&gt;For containerd/k8s setups this barely scratches the surface — but on a single EC2 box running a few containers, those bullets close ~80% of the cheap holes.&lt;/p&gt;
&lt;h2 id="what-this-is-not"&gt;What this is not&lt;/h2&gt;
&lt;p&gt;Be honest with yourself about what a checklist like this does and doesn&amp;rsquo;t do.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;It is not a penetration test.&lt;/strong&gt; Nobody is exploiting your application logic, your auth flows, or your business rules here. A pentest is a different (and more expensive) thing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;It is not compliance.&lt;/strong&gt; SOC 2, HIPAA, PCI, ISO 27001 all require documented policies, evidence collection, access reviews, vendor management, and a lot more. A hardened box is &lt;em&gt;part&lt;/em&gt; of that, not a substitute.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;It is not a guarantee.&lt;/strong&gt; New CVEs ship every week. Your application code changes. Someone leaks a key on GitHub. Hardening is a continuous practice, not a one-time event.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;It is not opinionated about your app stack.&lt;/strong&gt; TLS configuration, WAF rules, secrets management, dependency scanning, CI/CD security — all out of scope here.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What it &lt;em&gt;does&lt;/em&gt; do: dramatically reduce the set of &amp;ldquo;stupid ways your server gets owned by a bot at 3am&amp;rdquo; and give you a baseline you can re-run on every new instance.&lt;/p&gt;
&lt;h2 id="end-article-cta"&gt;End-article CTA&lt;/h2&gt;
&lt;p&gt;If you got this far and want to skip the manual audit, that&amp;rsquo;s exactly what I built &lt;strong&gt;&lt;a href="https://richgibbs.dev/quickcheck/"&gt;Tuck Sentinel QuickCheck&lt;/a&gt;&lt;/strong&gt; for: a single-instance, read-only Linux audit that runs the kind of checks above and produces a prioritized report with concrete fixes — no agent left behind, no ongoing access. Take a look at the &lt;strong&gt;&lt;a href="https://richgibbs.dev/quickcheck/sample-report/"&gt;sample report&lt;/a&gt;&lt;/strong&gt; to see exactly what you&amp;rsquo;d get.&lt;/p&gt;
&lt;p&gt;Either way: run the checklist. Future-you will thank present-you.&lt;/p&gt;
&lt;h2 id="about-tuck-sentinel"&gt;About Tuck Sentinel&lt;/h2&gt;
&lt;p&gt;Tuck Sentinel is a small, focused security tooling project from indie operator Rich Gibbs. It produces practical, no-nonsense audits and content for solo founders and small teams running their own Linux infrastructure — the kind of work most SOC platforms ignore because the deal size is too small. Start with QuickCheck if you want a one-shot review of a single server.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-json"&gt;{
  &amp;quot;@context&amp;quot;: &amp;quot;https://schema.org&amp;quot;,
  &amp;quot;@type&amp;quot;: &amp;quot;Article&amp;quot;,
  &amp;quot;headline&amp;quot;: &amp;quot;Ubuntu/Debian EC2 hardening checklist (2026)&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;A practical 2026 hardening checklist for Ubuntu and Debian EC2 instances: SSH, UFW, IMDSv2, updates, logging, backups, and Docker basics.&amp;quot;,
  &amp;quot;author&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;Person&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;Rich Gibbs&amp;quot;,
    &amp;quot;url&amp;quot;: &amp;quot;https://richgibbs.dev/&amp;quot;
  },
  &amp;quot;publisher&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;Organization&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;Tuck Sentinel&amp;quot;,
    &amp;quot;url&amp;quot;: &amp;quot;https://richgibbs.dev/&amp;quot;
  },
  &amp;quot;mainEntityOfPage&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;WebPage&amp;quot;,
    &amp;quot;@id&amp;quot;: &amp;quot;https://richgibbs.dev/blog/ubuntu-debian-ec2-hardening-checklist-2026/&amp;quot;
  },
  &amp;quot;image&amp;quot;: &amp;quot;https://richgibbs.dev/og/ubuntu-debian-ec2-hardening-2026.png&amp;quot;,
  &amp;quot;datePublished&amp;quot;: &amp;quot;2026-05-10&amp;quot;,
  &amp;quot;dateModified&amp;quot;: &amp;quot;2026-05-10&amp;quot;,
  &amp;quot;keywords&amp;quot;: &amp;quot;ubuntu, debian, ec2, hardening, security, devops, sysadmin, aws, imdsv2, ssh, ufw&amp;quot;,
  &amp;quot;inLanguage&amp;quot;: &amp;quot;en&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/ubuntu-debian-ec2-hardening-checklist-2026/</guid>
      <category>ubuntu</category>
      <category>debian</category>
      <category>ec2</category>
      <category>hardening</category>
      <category>security</category>
      <category>devops</category>
      <category>sysadmin</category>
      <category>aws</category>
      <pubDate>Sun, 10 May 2026 00:20:00 +0000</pubDate>
    </item>
    <item>
      <title>The Indie Founder's VPS Security 101</title>
      <link>https://blog.richgibbs.dev/indie-founder-vps-security-101/</link>
      <description>A practical, no-nonsense guide for solo founders running one Linux VPS. Lock the doors, watch the right things, and skip the security theater.</description>
      <content:encoded>&lt;p&gt;You shipped the thing. It runs on one Linux box at DigitalOcean or Hetzner or wherever. Customers are starting to show up, and somewhere in the back of your head a little voice is asking: &lt;em&gt;is this thing actually safe?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This guide is for that voice.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s written for solo founders and very small teams who are not security professionals but can copy a command into a terminal. The goal is &amp;ldquo;secure enough that you can sleep&amp;rdquo; — not &amp;ldquo;audit-grade fortress.&amp;rdquo; Those are different jobs, and treating one like the other is how you waste a weekend installing seven intrusion detection tools and shipping nothing for a month.&lt;/p&gt;
&lt;h2 id="what-secure-enough-looks-like-for-one-box"&gt;What &amp;ldquo;secure enough&amp;rdquo; looks like for one box&lt;/h2&gt;
&lt;p&gt;For a single VPS running your SaaS, &amp;ldquo;secure enough&amp;rdquo; is a short list:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Nobody can log in as root from the internet.&lt;/li&gt;
&lt;li&gt;Logging in requires a key you have, not a password someone could guess.&lt;/li&gt;
&lt;li&gt;Only the ports you actually use are open.&lt;/li&gt;
&lt;li&gt;The OS gets security patches automatically.&lt;/li&gt;
&lt;li&gt;You&amp;rsquo;d notice if something obviously bad started happening.&lt;/li&gt;
&lt;li&gt;If the disk caught fire tomorrow, you could rebuild from a backup before the end of the day.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;That&amp;rsquo;s the whole bar. Everything else is optimization. Hit those six and you&amp;rsquo;ve already done more than the majority of small-team production servers I&amp;rsquo;ve seen.&lt;/p&gt;
&lt;h2 id="first-day-setup"&gt;First-day setup&lt;/h2&gt;
&lt;p&gt;Do these once, when the server is fresh. They take about twenty minutes.&lt;/p&gt;
&lt;h3 id="1-create-a-non-root-user-with-sudo"&gt;1. Create a non-root user with sudo&lt;/h3&gt;
&lt;p&gt;Logging in as root is a footgun. One typo and you&amp;rsquo;ve nuked the box. Make a normal user instead.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;# As root, on a fresh server
adduser deploy
usermod -aG sudo deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Pick a real password for &lt;code&gt;deploy&lt;/code&gt; even though you&amp;rsquo;ll be using SSH keys — you&amp;rsquo;ll need it for &lt;code&gt;sudo&lt;/code&gt; prompts.&lt;/p&gt;
&lt;h3 id="2-set-up-ssh-keys-and-disable-password-login"&gt;2. Set up SSH keys and disable password login&lt;/h3&gt;
&lt;p&gt;On your laptop, if you don&amp;rsquo;t already have a key:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;ssh-keygen -t ed25519 -C &amp;quot;you@laptop&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Copy it to the server:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;ssh-copy-id deploy@your.server.ip
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now log in as &lt;code&gt;deploy&lt;/code&gt; and confirm &lt;code&gt;sudo&lt;/code&gt; works:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;ssh deploy@your.server.ip
sudo whoami   # should print: root
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once you&amp;rsquo;re sure key login works, lock down SSH. Edit &lt;code&gt;/etc/ssh/sshd_config&lt;/code&gt; (or drop a file in &lt;code&gt;/etc/ssh/sshd_config.d/&lt;/code&gt;):&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo tee /etc/ssh/sshd_config.d/99-hardening.conf &amp;gt;/dev/null &amp;lt;&amp;lt;'EOF'
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
EOF

sudo sshd -t          # test config — must print nothing
sudo systemctl reload ssh
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Do not close your existing SSH session yet.&lt;/strong&gt; Open a second terminal and confirm you can log in fresh. If that works, you&amp;rsquo;re good. If it doesn&amp;rsquo;t, you&amp;rsquo;ve still got the first session to fix things.&lt;/p&gt;
&lt;h3 id="3-turn-on-the-firewall"&gt;3. Turn on the firewall&lt;/h3&gt;
&lt;p&gt;Ubuntu ships with &lt;code&gt;ufw&lt;/code&gt;, which is a friendly wrapper around iptables/nftables. Default-deny inbound, allow only what you need:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable
sudo ufw status verbose
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you don&amp;rsquo;t run a web server on this box, drop the 80/443 lines. The rule is simple: open a port only when something on the box actually needs to listen on it.&lt;/p&gt;
&lt;h3 id="4-enable-automatic-security-updates"&gt;4. Enable automatic security updates&lt;/h3&gt;
&lt;p&gt;Most successful attacks are not clever zero-days — they&amp;rsquo;re known bugs in software you forgot to patch. Let the OS patch itself.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo apt update
sudo apt install -y unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades   # answer &amp;quot;Yes&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then check &lt;code&gt;/etc/apt/apt.conf.d/50unattended-upgrades&lt;/code&gt; and make sure security updates are uncommented. On Ubuntu the default config already covers &lt;code&gt;${distro_id}:${distro_codename}-security&lt;/code&gt;, which is what you want.&lt;/p&gt;
&lt;p&gt;For peace of mind, make it tell you when reboots are needed and when to install them:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo tee /etc/apt/apt.conf.d/51auto-reboot &amp;gt;/dev/null &amp;lt;&amp;lt;'EOF'
Unattended-Upgrade::Automatic-Reboot &amp;quot;true&amp;quot;;
Unattended-Upgrade::Automatic-Reboot-Time &amp;quot;04:00&amp;quot;;
EOF
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Pick a time when nobody&amp;rsquo;s using the app. Yes, this means the box reboots itself sometimes. That&amp;rsquo;s fine. Your app should already survive a reboot — and if it doesn&amp;rsquo;t, that&amp;rsquo;s a bigger problem than security.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s first-day setup. Non-root sudo user, keys-only SSH, default-deny firewall, automatic patching. You&amp;rsquo;re now ahead of a lot of production servers.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="worried-you-missed-something-on-first-day-setup"&gt;Worried you missed something on first-day setup?&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://richgibbs.dev/quickcheck/"&gt;&lt;strong&gt;Run a free QuickCheck on your server →&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s a read-only scan that flags the boring stuff: SSH still allows passwords, port 22 open to the world, no automatic updates configured, sketchy listening services, and so on. No agent, no signup wall. Here&amp;rsquo;s a &lt;a href="https://richgibbs.dev/quickcheck/sample-report/"&gt;sample report&lt;/a&gt; if you&amp;rsquo;d like to see the format first.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="what-to-actually-monitor"&gt;What to actually monitor&lt;/h2&gt;
&lt;p&gt;You don&amp;rsquo;t need a SIEM. You need a few things you can eyeball once a week (or get a tiny script to email you about). For a single VPS, this short list catches almost everything that matters.&lt;/p&gt;
&lt;h3 id="failed-logins"&gt;Failed logins&lt;/h3&gt;
&lt;p&gt;If somebody is hammering your SSH port, this shows it:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo journalctl -u ssh --since &amp;quot;24 hours ago&amp;quot; | grep -i &amp;quot;failed\|invalid&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A handful of attempts per day is internet background noise. Thousands per hour from one IP is worth blocking with &lt;code&gt;ufw&lt;/code&gt; or installing &lt;code&gt;fail2ban&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id="listening-ports"&gt;Listening ports&lt;/h3&gt;
&lt;p&gt;What&amp;rsquo;s actually accepting connections on this box? Run this every so often and make sure nothing surprising is there:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo ss -tulpn
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You&amp;rsquo;re looking for things bound to &lt;code&gt;0.0.0.0:&lt;/code&gt; or &lt;code&gt;:::&lt;/code&gt;. Anything bound to &lt;code&gt;127.0.0.1&lt;/code&gt; is fine — only your box can talk to it. The classic mistake: running a dev database with &lt;code&gt;bind = 0.0.0.0&lt;/code&gt; and no password. Don&amp;rsquo;t do that.&lt;/p&gt;
&lt;h3 id="disk-free"&gt;Disk free&lt;/h3&gt;
&lt;p&gt;Servers don&amp;rsquo;t usually die from hackers. They die from full disks at 3 AM.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;df -h /
du -sh /var/log /var/lib/docker 2&amp;gt;/dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If &lt;code&gt;/&lt;/code&gt; is over 80% full, plan on cleaning it up before it hits 100% and your database refuses to write.&lt;/p&gt;
&lt;h3 id="package-updates-available"&gt;Package updates available&lt;/h3&gt;
&lt;p&gt;Even with unattended-upgrades, it&amp;rsquo;s worth a manual sanity check now and then:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo apt update
apt list --upgradable 2&amp;gt;/dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And: is a reboot pending after a kernel update?&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;[ -f /var/run/reboot-required ] &amp;amp;&amp;amp; cat /var/run/reboot-required
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If yes, schedule one. A patched kernel that hasn&amp;rsquo;t been booted into is just a download.&lt;/p&gt;
&lt;p&gt;You can wire any of these into a weekly cron that emails you a one-page digest. Five lines of bash. Don&amp;rsquo;t overthink it.&lt;/p&gt;
&lt;h2 id="backups-and-restore-drills"&gt;Backups and restore drills&lt;/h2&gt;
&lt;p&gt;This is the boring section everyone skips. Skip it and you have a hobby project, not a business.&lt;/p&gt;
&lt;p&gt;The minimum viable backup setup for a single VPS:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Database&lt;/strong&gt;: nightly dump (&lt;code&gt;pg_dump&lt;/code&gt;, &lt;code&gt;mysqldump&lt;/code&gt;, or your equivalent), encrypted, sent off-box. To S3, B2, or any object store. Keep at least 7 daily and 4 weekly copies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User-uploaded files&lt;/strong&gt;: same deal — sync to object storage on a schedule. &lt;code&gt;restic&lt;/code&gt; and &lt;code&gt;rclone&lt;/code&gt; both work fine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Config&lt;/strong&gt;: keep it in git. If your &lt;code&gt;nginx.conf&lt;/code&gt; lives only on the server, it&amp;rsquo;s already half-lost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That&amp;rsquo;s the easy part. Here&amp;rsquo;s the part people skip:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Actually do a restore. From scratch. On a fresh VPS. Once.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Spin up a new box. Pull last night&amp;rsquo;s backup. Restore the database. Boot the app. Did it work? How long did it take? What did you forget? (Spoiler: an environment variable, an SSL cert, a cron job, or a system package.)&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;ve never done this drill, you don&amp;rsquo;t have backups. You have files you hope will work. There is a meaningful difference, and you really, really don&amp;rsquo;t want to discover it during an outage.&lt;/p&gt;
&lt;p&gt;Re-do the drill at least once a year, or any time you make a big infrastructure change.&lt;/p&gt;
&lt;h2 id="dont-over-do-it"&gt;Don&amp;rsquo;t over-do it&lt;/h2&gt;
&lt;p&gt;There is a tempting path where, in the name of &amp;ldquo;being thorough,&amp;rdquo; you install:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;An intrusion detection system&lt;/li&gt;
&lt;li&gt;A second intrusion detection system in case the first one misses something&lt;/li&gt;
&lt;li&gt;A file integrity monitor&lt;/li&gt;
&lt;li&gt;A custom auditd ruleset you found on a blog&lt;/li&gt;
&lt;li&gt;An EDR agent&lt;/li&gt;
&lt;li&gt;A SIEM forwarder&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;…on a VPS that hosts one Rails app and gets 200 visitors a day.&lt;/p&gt;
&lt;p&gt;Don&amp;rsquo;t. Each of these has a cost: CPU, memory, alert noise you&amp;rsquo;ll learn to ignore, and your time. For one small box, the basics in this article handle 95% of realistic risk. Adding more tools without tuning them often makes you &lt;em&gt;less&lt;/em&gt; secure, because real signals get buried in junk alerts you stop reading.&lt;/p&gt;
&lt;p&gt;If your business actually grows into the territory where you need that stuff (regulated data, big customer base, real compliance), you&amp;rsquo;ll know — and at that point you&amp;rsquo;ll also have the budget to do it properly. Until then: keep the surface small, keep it patched, and keep watching the four things in the monitoring section.&lt;/p&gt;
&lt;h2 id="common-mistakes"&gt;Common mistakes&lt;/h2&gt;
&lt;p&gt;The same handful of things bite small-team servers over and over:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Port 22 open to the entire internet with password login still enabled.&lt;/strong&gt; This is the #1 thing scanners look for. Even with a strong password, you&amp;rsquo;re contributing to the noise. Keys only.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logging in as root.&lt;/strong&gt; Either directly, or via a sudoers rule that means a single mistake takes the whole box down. Make a real user.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skipping reboots after kernel updates.&lt;/strong&gt; A patched-but-not-rebooted kernel still runs the old, vulnerable kernel. &lt;code&gt;unattended-upgrades&lt;/code&gt; with &lt;code&gt;Automatic-Reboot "true"&lt;/code&gt; fixes this for free.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IMDSv1 left enabled on AWS.&lt;/strong&gt; If you&amp;rsquo;re on EC2/Lightsail, the legacy instance metadata endpoint can be reached by anything that can make an outbound HTTP request from the box — including a bug in your app. Use the &lt;a href="/aws-imdsv2-migration-without-breaking-things/"&gt;IMDSv2 migration playbook&lt;/a&gt; to enforce &lt;code&gt;HttpTokens=required&lt;/code&gt; without breaking older agents or SDKs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dev services bound to &lt;code&gt;0.0.0.0&lt;/code&gt;.&lt;/strong&gt; Postgres, Redis, MongoDB, Elasticsearch, a debug UI, that one Jupyter notebook you spun up &amp;ldquo;just for a sec&amp;rdquo; — anything that listens on all interfaces with no auth is a free shell waiting to happen. Bind to &lt;code&gt;127.0.0.1&lt;/code&gt;, or at minimum require a password and put it behind the firewall.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No backups, or backups that have never been restored.&lt;/strong&gt; See previous section. This is the one that ends businesses.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storing secrets in committed &lt;code&gt;.env&lt;/code&gt; files.&lt;/strong&gt; You&amp;rsquo;ll forget, push to a public repo, and your API keys are now public. Use a &lt;code&gt;.env.example&lt;/code&gt; checked in, and the real &lt;code&gt;.env&lt;/code&gt; ignored.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of these are exotic. All of them are still everywhere.&lt;/p&gt;
&lt;h2 id="worth-a-free-second-opinion"&gt;Worth a free second opinion?&lt;/h2&gt;
&lt;p&gt;Even after a careful first-day setup, things drift. A teammate enables password auth &amp;ldquo;just for a minute.&amp;rdquo; A new service starts listening on &lt;code&gt;0.0.0.0&lt;/code&gt;. Auto-updates silently break and stop running. The point of a periodic external check is to catch that drift before it matters.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://richgibbs.dev/quickcheck/"&gt;&lt;strong&gt;Run a QuickCheck on your VPS →&lt;/strong&gt;&lt;/a&gt; — read-only, no install, takes a few minutes. Or look at a &lt;a href="https://richgibbs.dev/quickcheck/sample-report/"&gt;sample report&lt;/a&gt; first to see what it covers.&lt;/p&gt;
&lt;h2 id="what-this-is-not"&gt;What this is not&lt;/h2&gt;
&lt;p&gt;This article is a sensible starting checklist for one Linux VPS run by one person or a tiny team. It is &lt;strong&gt;not&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A replacement for security advice from someone who knows your specific stack and threat model.&lt;/li&gt;
&lt;li&gt;A compliance program. If you handle health data, payment data, or anything else regulated, you need more than a blog post.&lt;/li&gt;
&lt;li&gt;A guarantee. Nothing in security is. The goal is to make yourself a much less appealing target than the millions of other servers on the internet that haven&amp;rsquo;t done any of this.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do the basics, do them well, then go back to building the actual product. That&amp;rsquo;s the job.&lt;/p&gt;
&lt;h2 id="about-tuck-sentinel"&gt;About Tuck Sentinel&lt;/h2&gt;
&lt;p&gt;Tuck Sentinel is a small operation focused on practical security checks for indie founders and small teams running production on a VPS. We build &lt;a href="https://richgibbs.dev/quickcheck/"&gt;QuickCheck&lt;/a&gt;, a free read-only scan that highlights the boring-but-important configuration issues most one-person ops teams miss. No agents, no upsell maze — just the things worth fixing.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-json"&gt;{
  &amp;quot;@context&amp;quot;: &amp;quot;https://schema.org&amp;quot;,
  &amp;quot;@type&amp;quot;: &amp;quot;Article&amp;quot;,
  &amp;quot;headline&amp;quot;: &amp;quot;The Indie Founder's VPS Security 101&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;A practical, no-nonsense guide for solo founders running one Linux VPS. Lock the doors, watch the right things, and skip the security theater.&amp;quot;,
  &amp;quot;author&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;Organization&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;Tuck Sentinel&amp;quot;,
    &amp;quot;url&amp;quot;: &amp;quot;https://richgibbs.dev/&amp;quot;
  },
  &amp;quot;publisher&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;Organization&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;Tuck Sentinel&amp;quot;,
    &amp;quot;url&amp;quot;: &amp;quot;https://richgibbs.dev/&amp;quot;
  },
  &amp;quot;mainEntityOfPage&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;WebPage&amp;quot;,
    &amp;quot;@id&amp;quot;: &amp;quot;https://richgibbs.dev/blog/indie-founder-vps-security-101/&amp;quot;
  },
  &amp;quot;image&amp;quot;: &amp;quot;https://richgibbs.dev/og/indie-founder-vps-security-101.png&amp;quot;,
  &amp;quot;keywords&amp;quot;: &amp;quot;VPS security, indie founder, Linux server hardening, Ubuntu, Debian, SSH, ufw, unattended-upgrades, backups&amp;quot;,
  &amp;quot;articleSection&amp;quot;: &amp;quot;Security&amp;quot;,
  &amp;quot;inLanguage&amp;quot;: &amp;quot;en&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/indie-founder-vps-security-101/</guid>
      <category>vps</category>
      <category>security</category>
      <category>linux</category>
      <category>indie-founder</category>
      <category>ubuntu</category>
      <category>debian</category>
      <category>sysadmin</category>
      <pubDate>Sun, 10 May 2026 00:22:00 +0000</pubDate>
    </item>
    <item>
      <title>AWS IMDSv2 Migration Without Breaking Things</title>
      <link>https://blog.richgibbs.dev/aws-imdsv2-migration-without-breaking-things/</link>
      <description>A practical, indie-founder guide to migrating EC2 instances from IMDSv1 to IMDSv2 without breaking SDKs, containers, kubelet, or the ECS agent.</description>
      <content:encoded>&lt;p&gt;If you have EC2 instances older than a year or two, some of them probably still allow IMDSv1. The Instance Metadata Service is the HTTP endpoint at &lt;code&gt;169.254.169.254&lt;/code&gt; every EC2 instance can hit to learn about itself: instance ID, region, attached IAM role, and the temporary credentials that come with it. IMDSv1 is the original unauthenticated GET protocol. IMDSv2 is the session-token version that blocks a class of SSRF and confused-deputy attacks from walking off with your IAM Role credentials.&lt;/p&gt;
&lt;p&gt;AWS has been nudging everyone toward IMDSv2 for years, but existing fleets, AMIs baked before the change, and ASGs pinned to old launch templates are full of IMDSv1-allowing instances. Migration is conceptually simple — flip a setting per instance — and operationally annoying, because flipping it on the wrong workload breaks credential lookups for SDKs, kubelet, the ECS agent, or your own scripts.&lt;/p&gt;
&lt;p&gt;This guide walks through the migration the way an operator actually has to do it: detect what is still using v1, change instances in safe waves, validate, and have a rollback path. If you are using the migration window to clean up the rest of the instance, pair this with the broader &lt;a href="/ubuntu-debian-ec2-hardening-checklist-2026/#5-ec2-metadata-service-imdsv2"&gt;Ubuntu/Debian EC2 hardening checklist&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="why-migrate"&gt;Why Migrate&lt;/h2&gt;
&lt;p&gt;IMDSv1 is a plain HTTP &lt;code&gt;GET&lt;/code&gt; against the link-local address. Anything inside the instance that can make an outbound HTTP request — including a vulnerable web app with SSRF — can read instance metadata, including the &lt;strong&gt;IAM Role Credentials&lt;/strong&gt; path:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;GET http://169.254.169.254/latest/meta-data/iam/security-credentials/&amp;lt;role-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That returns short-lived credentials for whatever role is attached to the instance. With IMDSv1, no proof of locality is required. An SSRF in a public-facing service can pivot directly to your IAM credentials.&lt;/p&gt;
&lt;p&gt;IMDSv2 changes the protocol in two important ways:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Session tokens.&lt;/strong&gt; Callers &lt;code&gt;PUT&lt;/code&gt; to &lt;code&gt;/latest/api/token&lt;/code&gt; for a session token, then send it back as &lt;code&gt;X-aws-ec2-metadata-token&lt;/code&gt;. SSRF primitives that only allow &lt;code&gt;GET&lt;/code&gt; are blocked.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hop limit.&lt;/strong&gt; The token response honors a TTL hop limit. Default is 1, so a container behind a Docker bridge or a pod behind a CNI cannot reach IMDS unless explicitly allowed.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Set IMDSv2 to &lt;strong&gt;required&lt;/strong&gt; and v1 stops responding. That&amp;rsquo;s the goal state.&lt;/p&gt;
&lt;h2 id="what-breaks"&gt;What Breaks&lt;/h2&gt;
&lt;p&gt;The realistic breakage list is short and well-known. Knowing it upfront is most of the migration.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Old AWS SDKs.&lt;/strong&gt; Anything older than the published cutoffs only knows IMDSv1: AWS CLI v1 &amp;lt; 1.18.x, boto3 &amp;lt; 1.12.x, AWS SDK for Java v1 &amp;lt; 1.11.678, Go SDK v1 &amp;lt; 1.25.38, .NET SDK before late-2019. Modern SDKs auto-negotiate v2 with v1 fallback, but if v2 is &lt;em&gt;required&lt;/em&gt; the fallback never engages.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Containers behind Docker bridge or CNI.&lt;/strong&gt; The default hop limit of 1 denies pods/containers that route through the bridge. Raise the hop limit to 2 — or better, use IRSA on EKS, EC2 Pod Identity, or task roles on ECS so workloads don&amp;rsquo;t depend on instance metadata at all.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;kubelet&lt;/code&gt;&lt;/strong&gt; on self-managed nodes. Older kubelets only spoke v1. Modern EKS-optimized AMIs are fine; legacy kops clusters and old custom AMIs are the usual offenders.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ECS agent.&lt;/strong&gt; &lt;code&gt;amazon-ecs-init&lt;/code&gt; &amp;gt;= 1.50 supports IMDSv2. Old ECS-optimized AMIs not re-rolled in years can fail credential fetch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CloudWatch / SSM agent.&lt;/strong&gt; Recent versions fine; very old pinned versions not.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom scripts.&lt;/strong&gt; &lt;code&gt;curl http://169.254.169.254/latest/meta-data/...&lt;/code&gt; without a token will 401 once v1 is off.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Third-party agents in old AMIs.&lt;/strong&gt; Old Datadog, New Relic, Splunk, or backup agents from years-old golden images can be v1-only.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That&amp;rsquo;s the whole list. Everything else either works on day one or never touched IMDS.&lt;/p&gt;
&lt;h2 id="detect-imdsv1-use"&gt;Detect IMDSv1 Use&lt;/h2&gt;
&lt;p&gt;Don&amp;rsquo;t flip the switch blind. Find the callers first.&lt;/p&gt;
&lt;h3 id="cloudwatch-metric-metadatanotoken"&gt;CloudWatch metric: &lt;code&gt;MetadataNoToken&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Every EC2 instance emits a CloudWatch metric called &lt;code&gt;MetadataNoToken&lt;/code&gt; in the &lt;code&gt;AWS/EC2&lt;/code&gt; namespace. It increments every time something on the instance hits IMDSv1. This is the single most useful signal you have.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name MetadataNoToken \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistics Sum \
  --period 3600 \
  --start-time &amp;quot;$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ)&amp;quot; \
  --end-time   &amp;quot;$(date -u +%Y-%m-%dT%H:%M:%SZ)&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If &lt;code&gt;Sum&lt;/code&gt; across the last 7 days is &lt;code&gt;0&lt;/code&gt;, that instance is not making any IMDSv1 calls and is safe to switch. Anything non-zero means something is still hitting v1.&lt;/p&gt;
&lt;p&gt;For a fleet view, query across all instance IDs or use CloudWatch Metrics Insights / Metric Math to graph &lt;code&gt;MetadataNoToken&lt;/code&gt; aggregated. Tag the noisy instances and dig in.&lt;/p&gt;
&lt;h3 id="inventory-which-instances-even-allow-v1"&gt;Inventory: which instances even allow v1?&lt;/h3&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{
    Id:InstanceId,
    State:State.Name,
    HttpTokens:MetadataOptions.HttpTokens,
    HopLimit:MetadataOptions.HttpPutResponseHopLimit,
    Endpoint:MetadataOptions.HttpEndpoint
  }' \
  --output table
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;HttpTokens&lt;/code&gt; is what you care about. It will be one of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;optional&lt;/code&gt; — IMDSv1 still allowed (the thing you&amp;rsquo;re trying to remove)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;required&lt;/code&gt; — IMDSv2 only (the goal state)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A simple &amp;ldquo;what&amp;rsquo;s left?&amp;rdquo; query:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 describe-instances \
  --filters &amp;quot;Name=metadata-options.http-tokens,Values=optional&amp;quot; \
            &amp;quot;Name=instance-state-name,Values=running&amp;quot; \
  --query 'Reservations[].Instances[].InstanceId' \
  --output text
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id="cloudtrail-and-vpc-flow-logs"&gt;CloudTrail and VPC flow logs&lt;/h3&gt;
&lt;p&gt;CloudTrail does &lt;strong&gt;not&lt;/strong&gt; log calls to IMDS itself — those never leave the instance. What it &lt;em&gt;does&lt;/em&gt; show is the AWS API calls made &lt;em&gt;with&lt;/em&gt; the credentials IMDS handed out, via &lt;code&gt;userIdentity.sessionContext&lt;/code&gt; and the &lt;code&gt;accessKeyId&lt;/code&gt; of the temporary credentials. Useful for finding workloads still authenticating via instance role that should have moved to IRSA or task roles.&lt;/p&gt;
&lt;p&gt;VPC flow logs do not see &lt;code&gt;169.254.169.254&lt;/code&gt; traffic either — link-local stays inside the host. Stick to &lt;code&gt;MetadataNoToken&lt;/code&gt; plus the inventory query.&lt;/p&gt;
&lt;h3 id="on-host-detection"&gt;On-host detection&lt;/h3&gt;
&lt;p&gt;If you have shell access to a candidate instance, run something quick before you change settings:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;# Try IMDSv1 — if this returns data, v1 is still on
curl -s -o /dev/null -w &amp;quot;%{http_code}\n&amp;quot; \
  http://169.254.169.254/latest/meta-data/instance-id

# Try IMDSv2 — should always return 200 once v2 is supported
TOKEN=$(curl -s -X PUT &amp;quot;http://169.254.169.254/latest/api/token&amp;quot; \
  -H &amp;quot;X-aws-ec2-metadata-token-ttl-seconds: 21600&amp;quot;)
curl -s -H &amp;quot;X-aws-ec2-metadata-token: $TOKEN&amp;quot; \
  http://169.254.169.254/latest/meta-data/instance-id
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To find callers on a host, &lt;code&gt;auditd&lt;/code&gt; rules on connects to &lt;code&gt;169.254.169.254&lt;/code&gt; plus &lt;code&gt;ss -tnp&lt;/code&gt; snapshots usually identify the offending process. On a Kubernetes node, look at old DaemonSets and sidecars first.&lt;/p&gt;
&lt;h2 id="migration-steps"&gt;Migration Steps&lt;/h2&gt;
&lt;p&gt;The flow that has worked reliably for small and mid-size fleets:&lt;/p&gt;
&lt;h3 id="1-baseline-and-freeze-new-imdsv1"&gt;1. Baseline and freeze new IMDSv1&lt;/h3&gt;
&lt;p&gt;Set account-level defaults so anything launched from now on is IMDSv2-required and any new AMIs are also v2-required:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;# Default IMDS options for new instances in this region
aws ec2 modify-instance-metadata-defaults \
  --http-tokens required \
  --http-put-response-hop-limit 2 \
  --http-endpoint enabled

# Default for newly-registered AMIs
aws ec2 modify-image-attribute \
  --image-id ami-xxxxxxxxxxxxxxxxx \
  --imds-support v2.0
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Use &lt;code&gt;modify-image-attribute --imds-support v2.0&lt;/code&gt; on each AMI you control. Once set, instances launched from that AMI get v2-required automatically.&lt;/p&gt;
&lt;p&gt;Also set the launch template / Auto Scaling group launch template versions to require IMDSv2:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 create-launch-template-version \
  --launch-template-id lt-0123456789abcdef0 \
  --source-version 1 \
  --launch-template-data '{
    &amp;quot;MetadataOptions&amp;quot;: {
      &amp;quot;HttpTokens&amp;quot;: &amp;quot;required&amp;quot;,
      &amp;quot;HttpPutResponseHopLimit&amp;quot;: 2,
      &amp;quot;HttpEndpoint&amp;quot;: &amp;quot;enabled&amp;quot;
    }
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This stops the bleeding. Old instances may still be on v1, but no new ones are.&lt;/p&gt;
&lt;h3 id="2-sort-instances-into-waves"&gt;2. Sort instances into waves&lt;/h3&gt;
&lt;p&gt;Pull the list of &lt;code&gt;HttpTokens=optional&lt;/code&gt; instances. Group them by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Wave 0 — disposable.&lt;/strong&gt; Stateless workers, batch nodes, dev/test. Cheap to break, cheap to recreate. Migrate first.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wave 1 — replaceable through autoscaling.&lt;/strong&gt; ASG-managed web tiers, ECS/EKS nodes. New launches are already v2-required; old nodes get rotated out by simply triggering an instance refresh.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wave 2 — stateful or hand-built.&lt;/strong&gt; Bastions, databases on EC2, single-instance services, anything pet-shaped.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For waves 0 and 1, prefer &lt;strong&gt;rotation over modification&lt;/strong&gt; — relaunch from updated launch templates rather than mutating live instances. Less risky, fewer surprises.&lt;/p&gt;
&lt;h3 id="3-optional-try-optional-required-with-a-hop-bump"&gt;3. Optional: try &lt;code&gt;optional&lt;/code&gt; → &lt;code&gt;required&lt;/code&gt; with a hop bump&lt;/h3&gt;
&lt;p&gt;For a stateful instance you cannot easily relaunch, raise the hop limit first (so containers keep working), then flip tokens to required:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;# Step A: bump hop limit while still allowing v1
aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-put-response-hop-limit 2 \
  --http-tokens optional \
  --http-endpoint enabled

# Verify everything still works for at least one full agent cycle
# (CloudWatch agent, SSM agent, your app, container credential lookups)

# Step B: require v2
aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-tokens required
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Watch &lt;code&gt;MetadataNoToken&lt;/code&gt; after step A — if any callers are still using v1, they will keep showing up in the metric. Fix or upgrade them before step B.&lt;/p&gt;
&lt;h3 id="4-roll-auto-scaling-groups"&gt;4. Roll Auto Scaling groups&lt;/h3&gt;
&lt;p&gt;After the launch template is updated:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws autoscaling start-instance-refresh \
  --auto-scaling-group-name my-asg \
  --preferences '{&amp;quot;MinHealthyPercentage&amp;quot;: 90, &amp;quot;InstanceWarmup&amp;quot;: 300}'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For EKS managed node groups, the equivalent is updating the node group to a new launch template version and letting AWS drain and replace nodes. For ECS, update the capacity provider&amp;rsquo;s launch template and either drain instances or wait for natural turnover.&lt;/p&gt;
&lt;h3 id="5-sweep-and-confirm"&gt;5. Sweep and confirm&lt;/h3&gt;
&lt;p&gt;After each wave, re-run the inventory query and the &lt;code&gt;MetadataNoToken&lt;/code&gt; check. Anything still on &lt;code&gt;optional&lt;/code&gt; should have a name attached to it and a reason.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mid-article CTA:&lt;/strong&gt; Want a one-shot read-only audit that tells you which of your EC2 instances still allow IMDSv1, plus a dozen other quiet AWS posture issues? That&amp;rsquo;s exactly what &lt;a href="https://richgibbs.dev/quickcheck/"&gt;QuickCheck&lt;/a&gt; is built for. Skim a &lt;a href="https://richgibbs.dev/quickcheck/sample-report/"&gt;sample report&lt;/a&gt; before you decide.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="validation"&gt;Validation&lt;/h2&gt;
&lt;p&gt;After you flip an instance, you want fast confirmation it&amp;rsquo;s actually on v2 and nothing is silently failing.&lt;/p&gt;
&lt;h3 id="confirm-v2-required-at-the-api-level"&gt;Confirm v2-required at the API level&lt;/h3&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 describe-instances \
  --instance-ids i-0123456789abcdef0 \
  --query 'Reservations[0].Instances[0].MetadataOptions'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Expected:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-json"&gt;{
  &amp;quot;State&amp;quot;: &amp;quot;applied&amp;quot;,
  &amp;quot;HttpTokens&amp;quot;: &amp;quot;required&amp;quot;,
  &amp;quot;HttpPutResponseHopLimit&amp;quot;: 2,
  &amp;quot;HttpEndpoint&amp;quot;: &amp;quot;enabled&amp;quot;,
  &amp;quot;HttpProtocolIpv6&amp;quot;: &amp;quot;disabled&amp;quot;,
  &amp;quot;InstanceMetadataTags&amp;quot;: &amp;quot;disabled&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;State: applied&lt;/code&gt; matters — &lt;code&gt;pending&lt;/code&gt; means the change has not landed yet.&lt;/p&gt;
&lt;h3 id="confirm-v1-is-actually-rejected-on-the-host"&gt;Confirm v1 is actually rejected on the host&lt;/h3&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;# Should now return 401 Unauthorized
curl -s -o /dev/null -w &amp;quot;v1: %{http_code}\n&amp;quot; \
  http://169.254.169.254/latest/meta-data/instance-id

# Should return 200 with the instance ID
TOKEN=$(curl -s -X PUT &amp;quot;http://169.254.169.254/latest/api/token&amp;quot; \
  -H &amp;quot;X-aws-ec2-metadata-token-ttl-seconds: 21600&amp;quot;)
curl -s -H &amp;quot;X-aws-ec2-metadata-token: $TOKEN&amp;quot; \
  -w &amp;quot;\nv2: %{http_code}\n&amp;quot; \
  http://169.254.169.254/latest/meta-data/instance-id
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;v1: 401&lt;/code&gt; and &lt;code&gt;v2: 200&lt;/code&gt; is the correct pair.&lt;/p&gt;
&lt;h3 id="confirm-credentials-still-resolve"&gt;Confirm credentials still resolve&lt;/h3&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;TOKEN=$(curl -s -X PUT &amp;quot;http://169.254.169.254/latest/api/token&amp;quot; \
  -H &amp;quot;X-aws-ec2-metadata-token-ttl-seconds: 21600&amp;quot;)
ROLE=$(curl -s -H &amp;quot;X-aws-ec2-metadata-token: $TOKEN&amp;quot; \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/)
curl -s -H &amp;quot;X-aws-ec2-metadata-token: $TOKEN&amp;quot; \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/$ROLE \
  | head -c 200; echo
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You should see &lt;code&gt;AccessKeyId&lt;/code&gt;, &lt;code&gt;SecretAccessKey&lt;/code&gt;, &lt;code&gt;Token&lt;/code&gt;, and &lt;code&gt;Expiration&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id="confirm-app-level-health"&gt;Confirm app-level health&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;aws sts get-caller-identity&lt;/code&gt; from the instance using whichever SDK your workloads use.&lt;/li&gt;
&lt;li&gt;Container credential lookups from inside one container per host (especially if you raised the hop limit).&lt;/li&gt;
&lt;li&gt;ECS agent: &lt;code&gt;curl -s http://localhost:51678/v1/metadata&lt;/code&gt; should still respond.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubelet&lt;/code&gt; health: nodes still &lt;code&gt;Ready&lt;/code&gt;, image pulls from ECR still work.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="confirm-metadatanotoken-is-zero"&gt;Confirm &lt;code&gt;MetadataNoToken&lt;/code&gt; is zero&lt;/h3&gt;
&lt;p&gt;After 24–48 hours on v2-required, &lt;code&gt;MetadataNoToken&lt;/code&gt; should be a flat zero line. If not, something is still calling v1 — which now means it is failing. Find it.&lt;/p&gt;
&lt;h2 id="rollback"&gt;Rollback&lt;/h2&gt;
&lt;p&gt;You want this written down before you need it.&lt;/p&gt;
&lt;p&gt;Per-instance rollback is one CLI call:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-tokens optional \
  --http-put-response-hop-limit 2 \
  --http-endpoint enabled
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That re-enables IMDSv1 immediately, no instance restart required. It is the same call you used to flip forward — just with &lt;code&gt;optional&lt;/code&gt; instead of &lt;code&gt;required&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Launch template rollback: revert to the previous version.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 modify-launch-template \
  --launch-template-id lt-0123456789abcdef0 \
  --default-version 1
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Auto Scaling rollback: trigger another instance refresh against the previous LT version, or roll forward with a fixed template once you know what broke. Avoid the temptation to mutate live ASG instances; relaunch is cleaner.&lt;/p&gt;
&lt;p&gt;For account-level defaults, you can re-relax them, but generally do not. Once new instances are v2-required by default, leave that in place even if you have to roll back individual stragglers.&lt;/p&gt;
&lt;h2 id="quickcheck-cta"&gt;QuickCheck CTA&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;d rather not hand-roll the inventory queries and CloudWatch checks across every account and region, &lt;strong&gt;&lt;a href="https://richgibbs.dev/quickcheck/"&gt;QuickCheck&lt;/a&gt;&lt;/strong&gt; runs a read-only, one-shot review of your AWS posture and produces a plain-English report. IMDSv1 stragglers are one of the dozen things it surfaces — alongside open security groups, public S3, missing MFA on root, untagged keys, and a few other &amp;ldquo;you&amp;rsquo;d rather know&amp;rdquo; items. See an example in the &lt;a href="https://richgibbs.dev/quickcheck/sample-report/"&gt;sample report&lt;/a&gt;. It is not magic and not a replacement for proper cloud security tooling, but it is a fast way to know where you stand before you start migrating.&lt;/p&gt;
&lt;h2 id="what-this-is-not"&gt;What This Is Not&lt;/h2&gt;
&lt;p&gt;To set expectations clearly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This is &lt;strong&gt;not a penetration test&lt;/strong&gt;. It is a configuration migration, not an adversarial exercise.&lt;/li&gt;
&lt;li&gt;This is &lt;strong&gt;not a certification or compliance attestation&lt;/strong&gt;. Migrating to IMDSv2 is a control improvement; it does not by itself constitute SOC 2, ISO 27001, PCI, or anything else. Your auditor still wants the artifacts they always want.&lt;/li&gt;
&lt;li&gt;This is &lt;strong&gt;not a guarantee&lt;/strong&gt;. Cloud security is a portfolio of controls. IMDSv2 closes one well-known SSRF-to-credentials path; it does not address misconfigured security groups, overly broad IAM policies, leaked long-lived keys, or vulnerable application code. Treat it as one item on the list.&lt;/li&gt;
&lt;li&gt;This is &lt;strong&gt;not a substitute&lt;/strong&gt; for moving workloads to IRSA / EC2 Pod Identity / ECS task roles where those fit. IMDSv2 makes instance metadata safer; per-workload identity is still the better long-term answer for containers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Migrate to IMDSv2 because it is cheap, well-understood, and removes a real foot-gun. Then keep going.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="about-tuck-sentinel"&gt;About Tuck Sentinel&lt;/h2&gt;
&lt;p&gt;Tuck Sentinel is the security-focused side of an indie operator workshop by Rich Gibbs. It builds small, sharp tools — like QuickCheck — for founders and small teams who want a competent read of their cloud posture without an enterprise platform. The bias: fast, honest, read-only assessments and migrations you can actually finish.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-json"&gt;{
  &amp;quot;@context&amp;quot;: &amp;quot;https://schema.org&amp;quot;,
  &amp;quot;@type&amp;quot;: &amp;quot;Article&amp;quot;,
  &amp;quot;headline&amp;quot;: &amp;quot;AWS IMDSv2 Migration Without Breaking Things&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;A practical, indie-founder guide to migrating EC2 instances from IMDSv1 to IMDSv2 without breaking SDKs, containers, kubelet, or the ECS agent.&amp;quot;,
  &amp;quot;author&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;Organization&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;Tuck Sentinel&amp;quot;
  },
  &amp;quot;publisher&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;Organization&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;Tuck Sentinel&amp;quot;,
    &amp;quot;url&amp;quot;: &amp;quot;https://richgibbs.dev/&amp;quot;
  },
  &amp;quot;mainEntityOfPage&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;WebPage&amp;quot;,
    &amp;quot;@id&amp;quot;: &amp;quot;https://example.com/blog/aws-imdsv2-migration-without-breaking-things&amp;quot;
  },
  &amp;quot;image&amp;quot;: &amp;quot;https://example.com/og/aws-imdsv2-migration.png&amp;quot;,
  &amp;quot;articleSection&amp;quot;: &amp;quot;Cloud Security&amp;quot;,
  &amp;quot;keywords&amp;quot;: &amp;quot;AWS, EC2, IMDSv2, IMDSv1, cloud security, IAM, SSRF, migration&amp;quot;,
  &amp;quot;about&amp;quot;: [
    { &amp;quot;@type&amp;quot;: &amp;quot;Thing&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;AWS EC2 Instance Metadata Service&amp;quot; },
    { &amp;quot;@type&amp;quot;: &amp;quot;Thing&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;IMDSv2&amp;quot; },
    { &amp;quot;@type&amp;quot;: &amp;quot;Thing&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;Cloud Security Posture&amp;quot; }
  ]
}
&lt;/code&gt;&lt;/pre&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/aws-imdsv2-migration-without-breaking-things/</guid>
      <category>aws</category>
      <category>ec2</category>
      <category>imdsv2</category>
      <category>security</category>
      <category>devops</category>
      <category>cloud-security</category>
      <pubDate>Sun, 10 May 2026 00:22:00 +0000</pubDate>
    </item>
    <item>
      <title>SPF, DKIM, DMARC for indie founders: the 20-minute checklist</title>
      <link>https://blog.richgibbs.dev/spf-dkim-dmarc-indie-founder-checklist/</link>
      <description>If your password resets and receipts keep landing in spam, your server is fine and your email DNS probably isn't. Here is the boring, working playbook to fix it in one sitting.</description>
      <content:encoded>&lt;p&gt;You shipped a product. Stripe sends receipts. Postmark sends magic links. Mailchimp blasts your launch list. You replied to a support ticket from your own founder address.&lt;/p&gt;
&lt;p&gt;Then someone said &lt;em&gt;&amp;ldquo;hey, your password reset went to spam.&amp;rdquo;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This guide is for that moment.&lt;/p&gt;
&lt;p&gt;It is not a deliverability bible. It is the smallest correct version of the SPF / DKIM / DMARC story for a solo founder or a 2-3 person SaaS team, with one custom domain and two-to-five tools that send email on its behalf. If you can edit DNS and copy a record, you can finish it tonight.&lt;/p&gt;
&lt;p&gt;We are also not selling you a deliverability platform. The point of this post is for you to do it yourself, &lt;em&gt;correctly&lt;/em&gt;, in one sitting.&lt;/p&gt;
&lt;h2 id="what-set-up-email-dns-actually-means-in-2026"&gt;What &amp;ldquo;set up email DNS&amp;rdquo; actually means in 2026&lt;/h2&gt;
&lt;p&gt;Mailbox providers — Gmail, Yahoo, Outlook, Apple, ProtonMail — use three DNS-anchored signals to decide whether a message is plausibly from your domain at all:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SPF&lt;/strong&gt; says &lt;em&gt;&amp;ldquo;these IP addresses / hostnames are allowed to send mail using my domain in the envelope sender.&amp;rdquo;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DKIM&lt;/strong&gt; says &lt;em&gt;&amp;ldquo;messages from my domain will carry a cryptographic signature in the headers, signed by a key whose public half lives in DNS.&amp;rdquo;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DMARC&lt;/strong&gt; says &lt;em&gt;&amp;ldquo;if SPF and DKIM both fail to align with my visible From: domain, here is what you should do — nothing, quarantine to spam, or reject — and please send me reports.&amp;rdquo;&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In 2024–2025 Gmail and Yahoo started requiring all three from any sender shipping more than 5,000 messages a day to their users, and they have been quietly tightening the rules for low-volume senders ever since. In practice, by 2026:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If your domain has no SPF and no DKIM, password resets and receipts will sometimes silently disappear into spam.&lt;/li&gt;
&lt;li&gt;If your domain has no DMARC at all, anyone can spoof &amp;ldquo;from your domain&amp;rdquo; until enough recipients complain.&lt;/li&gt;
&lt;li&gt;If your DMARC record is malformed, mailbox providers behave the same as if it isn&amp;rsquo;t there — except now your reports vanish too.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You do not need to be perfect. You need to be &lt;em&gt;not broken&lt;/em&gt;.&lt;/p&gt;
&lt;h2 id="the-20-minute-checklist"&gt;The 20-minute checklist&lt;/h2&gt;
&lt;p&gt;Before you touch DNS, do the boring inventory step. This is the part most founders skip and most spam problems come from.&lt;/p&gt;
&lt;h3 id="1-list-every-tool-that-sends-mail-from-your-domain-3-minutes"&gt;1. List every tool that sends mail &amp;ldquo;from&amp;rdquo; your domain (3 minutes)&lt;/h3&gt;
&lt;p&gt;Open a notes file. Write the domain you want to fix at the top. Then list every place that sends email &lt;em&gt;as&lt;/em&gt; that domain. Real examples for a typical indie SaaS:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Founder mail (you replying from &lt;code&gt;you@yourdomain.com&lt;/code&gt;) — Google Workspace or Fastmail.&lt;/li&gt;
&lt;li&gt;Transactional / product mail — Postmark, Resend, Mailgun, AWS SES, SendGrid, Mailtrap.&lt;/li&gt;
&lt;li&gt;Marketing / newsletter — ConvertKit, Mailchimp, Beehiiv, Buttondown, Substack custom domain.&lt;/li&gt;
&lt;li&gt;Helpdesk — Help Scout, Front, HubSpot, Zendesk, Plain.&lt;/li&gt;
&lt;li&gt;App-platform notifications — Vercel/Render/Heroku notifications using your domain, GitHub on a custom domain.&lt;/li&gt;
&lt;li&gt;Stripe receipts and Tally form notifications, when configured to &amp;ldquo;send from&amp;rdquo; your domain rather than the platform default.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you cannot remember, search your inbox for &lt;code&gt;from:@yourdomain.com&lt;/code&gt; and note every &amp;ldquo;tool integration&amp;rdquo; message you find from the last 90 days.&lt;/p&gt;
&lt;p&gt;This list is the single most useful artifact in this entire process. If anyone ever asks you &amp;ldquo;do you know who sends as your domain?&amp;rdquo;, you can answer in one screen.&lt;/p&gt;
&lt;h3 id="2-pick-exactly-one-spf-record-5-minutes"&gt;2. Pick exactly one SPF record (5 minutes)&lt;/h3&gt;
&lt;p&gt;SPF is one TXT record at the apex of your domain (&lt;code&gt;yourdomain.com&lt;/code&gt;, not &lt;code&gt;mail.yourdomain.com&lt;/code&gt;). You are allowed exactly one. If there are two SPF TXT records in DNS, every conforming mailbox server treats the result as &lt;code&gt;permerror&lt;/code&gt; and ignores both.&lt;/p&gt;
&lt;p&gt;A working SPF for the example list above might be:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;v=spf1 include:_spf.google.com include:spf.mtasv.net include:_spf.mailgun.org include:_spf.constantcontact.com -all
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Rules:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start with &lt;code&gt;v=spf1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;One &lt;code&gt;include:&lt;/code&gt; per provider, taken from each provider&amp;rsquo;s docs. Do not invent them.&lt;/li&gt;
&lt;li&gt;End with &lt;code&gt;-all&lt;/code&gt; (hard fail) or &lt;code&gt;~all&lt;/code&gt; (soft fail). Use &lt;code&gt;~all&lt;/code&gt; while you are setting up DMARC, then move to &lt;code&gt;-all&lt;/code&gt; once DMARC reports are clean.&lt;/li&gt;
&lt;li&gt;Do not put &lt;code&gt;+all&lt;/code&gt; anywhere. Ever. That tells the world &lt;em&gt;anyone&lt;/em&gt; can send as you.&lt;/li&gt;
&lt;li&gt;Do not exceed 10 DNS lookups across all the &lt;code&gt;include:&lt;/code&gt; and &lt;code&gt;redirect=&lt;/code&gt; directives combined. Tools like Google Workspace + Mailgun + Mailchimp + Constant Contact + Help Scout will quietly exceed 10. If you see &lt;code&gt;permerror&lt;/code&gt; reports, this is usually why.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you use &lt;code&gt;mail.yourdomain.com&lt;/code&gt; as a separate sending subdomain (some providers configure it that way), publish a &lt;em&gt;separate&lt;/em&gt; SPF record at that subdomain.&lt;/p&gt;
&lt;h3 id="3-add-dkim-for-each-sending-tool-5-minutes"&gt;3. Add DKIM for each sending tool (5 minutes)&lt;/h3&gt;
&lt;p&gt;DKIM is per-provider. Every provider that sends mail for you should give you one or more &lt;code&gt;selector._domainkey.yourdomain.com&lt;/code&gt; CNAME or TXT records to add.&lt;/p&gt;
&lt;p&gt;Examples of selectors you&amp;rsquo;ll see in a real indie SaaS:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Google Workspace: &lt;code&gt;google._domainkey&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Postmark: &lt;code&gt;&amp;lt;assigned&amp;gt;._domainkey&lt;/code&gt; (Postmark assigns the selector when you verify the domain)&lt;/li&gt;
&lt;li&gt;Mailgun: &lt;code&gt;mailo._domainkey&lt;/code&gt; and &lt;code&gt;pic._domainkey&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;ConvertKit / Mailchimp: their dashboard prints the exact CNAMEs.&lt;/li&gt;
&lt;li&gt;Resend: &lt;code&gt;resend._domainkey&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Two rules that catch people:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;DKIM records &lt;em&gt;do not&lt;/em&gt; show up in plain &lt;code&gt;dig TXT yourdomain.com&lt;/code&gt;. You have to query the selector explicitly: &lt;code&gt;dig TXT selector._domainkey.yourdomain.com&lt;/code&gt;. If you cannot remember selectors, you cannot validate your own DKIM from public DNS — write them down.&lt;/li&gt;
&lt;li&gt;&amp;ldquo;DKIM is set up&amp;rdquo; is not the same as &amp;ldquo;messages are being signed.&amp;rdquo; Each provider has its own toggle for &amp;ldquo;sign outbound mail with this key.&amp;rdquo; If signing is off in the provider dashboard, the selector record alone is useless.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;Authentication-Results&lt;/code&gt; header in any actual sent email is the source of truth. If it says &lt;code&gt;dkim=pass&lt;/code&gt; from your visible domain, signing is real.&lt;/p&gt;
&lt;h3 id="4-publish-a-cautious-dmarc-3-minutes"&gt;4. Publish a &lt;em&gt;cautious&lt;/em&gt; DMARC (3 minutes)&lt;/h3&gt;
&lt;p&gt;DMARC is one TXT record at &lt;code&gt;_dmarc.yourdomain.com&lt;/code&gt;. Start safe:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;v=DMARC1; p=none; rua=mailto:dmarc-reports@yourdomain.com; adkim=r; aspf=r; pct=100
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Translation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;p=none&lt;/code&gt; — do not block anything yet, just ask for reports.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rua=mailto:&lt;/code&gt; — a real mailbox you actually read; &lt;em&gt;not&lt;/em&gt; a personal Gmail you ignore. Many founders use a forwarding alias like &lt;code&gt;dmarc-reports@yourdomain.com&lt;/code&gt; that lands in a labeled folder.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;adkim=r; aspf=r&lt;/code&gt; — relaxed alignment. Strict alignment is for later.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A 14-day &lt;code&gt;p=none&lt;/code&gt; window before you tighten anything is the difference between &amp;ldquo;I learned my newsletter platform sends as &lt;code&gt;mail.mydomain.com&lt;/code&gt;&amp;rdquo; and &amp;ldquo;I broke my newsletter for two days.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;After 14 days of clean reports — meaning every legitimate sender shows up in the reports as passing SPF &lt;em&gt;or&lt;/em&gt; DKIM aligned with &lt;code&gt;yourdomain.com&lt;/code&gt; — move to:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;v=DMARC1; p=quarantine; rua=mailto:dmarc-reports@yourdomain.com; pct=25
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;pct=25&lt;/code&gt; ramp is intentional. It means &amp;ldquo;quarantine 25 % of messages that fail alignment&amp;rdquo; so you can detect any forgotten sender before going full &lt;code&gt;p=quarantine&lt;/code&gt; or &lt;code&gt;p=reject&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you are an indie founder, you may stop at &lt;code&gt;p=quarantine&lt;/code&gt; forever. &lt;code&gt;p=reject&lt;/code&gt; is for senders who are confident no legitimate mail anywhere uses their domain incorrectly.&lt;/p&gt;
&lt;h3 id="5-verify-the-result-with-one-real-email-4-minutes"&gt;5. Verify the result with one real email (4 minutes)&lt;/h3&gt;
&lt;p&gt;Send one email to yourself at Gmail, Yahoo, and Outlook from each sending tool you care about most (founder mail, password reset, newsletter). Open the message header.&lt;/p&gt;
&lt;p&gt;You are looking for an &lt;code&gt;Authentication-Results&lt;/code&gt; line that says all three of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;spf=pass&lt;/code&gt; with &lt;code&gt;smtp.mailfrom=&lt;/code&gt; matching a domain that contains &lt;code&gt;yourdomain.com&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dkim=pass&lt;/code&gt; with &lt;code&gt;header.d=yourdomain.com&lt;/code&gt; (alignment) — &lt;em&gt;not&lt;/em&gt; &lt;code&gt;header.d=postmarkapp.com&lt;/code&gt; or &lt;code&gt;header.d=mailgun.org&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dmarc=pass&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;dkim=pass header.d=mailgun.org&lt;/code&gt; while your visible From: is &lt;code&gt;support@yourdomain.com&lt;/code&gt; is the most common deliverability bug among indie founders. The message is technically signed, but DMARC-wise it is unsigned by &lt;em&gt;your&lt;/em&gt; domain. Fix it by completing the provider&amp;rsquo;s &amp;ldquo;Use my own domain&amp;rdquo; / &amp;ldquo;Custom domain DKIM&amp;rdquo; configuration. Postmark, Mailgun, Resend, SendGrid, Mailchimp, ConvertKit, and AWS SES all support this; they just don&amp;rsquo;t enable it by default.&lt;/p&gt;
&lt;h2 id="things-to-deliberately-ignore-in-v1"&gt;Things to deliberately ignore in v1&lt;/h2&gt;
&lt;p&gt;You do not need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;BIMI. Useful only after DMARC is at &lt;code&gt;p=quarantine&lt;/code&gt; or stricter for a long time, and even then it is a logo-display feature, not a deliverability feature.&lt;/li&gt;
&lt;li&gt;ARC. Mailing-list specific.&lt;/li&gt;
&lt;li&gt;DKIM key rotation. Whatever your provider gave you is fine until they tell you to rotate.&lt;/li&gt;
&lt;li&gt;Per-subdomain DMARC strictness (&lt;code&gt;sp=&lt;/code&gt;). Default is fine until you operate dedicated sending subdomains.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You also do not need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A paid &amp;ldquo;deliverability platform&amp;rdquo; subscription.&lt;/li&gt;
&lt;li&gt;A reputation-monitoring agency.&lt;/li&gt;
&lt;li&gt;An IP warmup schedule (you are using shared IPs from your ESP; they handle warmup).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="common-gotchas-an-indie-founder-will-hit"&gt;Common gotchas an indie founder will hit&lt;/h2&gt;
&lt;p&gt;These are the failure modes I see most often when reviewing single-domain setups:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Two SPF records.&lt;/strong&gt; Often a leftover from when you were trying providers. Merge into one.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;+all&lt;/code&gt; left over from a Google guide that said &amp;ldquo;for testing only.&amp;rdquo;&lt;/strong&gt; Remove.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DMARC &lt;code&gt;rua&lt;/code&gt; pointing at &lt;code&gt;you@yourdomain.com&lt;/code&gt; itself.&lt;/strong&gt; Your inbox will fill with unreadable XML aggregate reports. Use a sub-alias (&lt;code&gt;dmarc-reports@&lt;/code&gt;) that auto-files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DKIM &amp;ldquo;set up&amp;rdquo; but provider has signing disabled.&lt;/strong&gt; Toggle it on in the provider, and confirm with a real test message header.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Marketing tool added later, but DKIM never aligned.&lt;/strong&gt; New newsletter platform turns SPF green, leaves DKIM &lt;code&gt;header.d=&lt;/code&gt; pointing at the platform&amp;rsquo;s domain. DMARC fails alignment for that one tool.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Personal Gmail &amp;ldquo;Send mail as&amp;rdquo; alias used to reply from &lt;code&gt;you@yourdomain.com&lt;/code&gt;.&lt;/strong&gt; Even if Workspace is fine, that alias often sends as &lt;code&gt;gmail.com&lt;/code&gt; underneath. Reply-To is fine; the sending identity matters for alignment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Subdomain forgotten.&lt;/strong&gt; Stripe receipts sometimes go through &lt;code&gt;mail.yourdomain.com&lt;/code&gt;. If subdomain SPF/DKIM is missing, mailbox providers can still apply the apex DMARC. Check at the &lt;em&gt;exact&lt;/em&gt; subdomain.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If any of those sound like a problem you cannot debug from your provider&amp;rsquo;s dashboard alone, that is the moment a second pair of eyes is worth more than another deliverability article.&lt;/p&gt;
&lt;h2 id="next-step-a-99-second-pair-of-eyes"&gt;Next step: a $99 second pair of eyes&lt;/h2&gt;
&lt;p&gt;Once you&amp;rsquo;ve done the 20-minute pass above, the question is usually not &lt;em&gt;&amp;ldquo;is the record there?&amp;rdquo;&lt;/em&gt; It&amp;rsquo;s &lt;em&gt;&amp;ldquo;are all these records aligned with the way I actually send mail?&amp;rdquo;&lt;/em&gt; That answer lives partly in DNS and partly in a few real message headers.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;d like a written, prioritized fix list for one domain — SPF, DKIM, DMARC, MX, sender-tool inventory, and the obvious mistakes — that is exactly the &lt;a href="https://richgibbs.dev/quickcheck/inbox-dns/"&gt;Inbox/DNS QuickCheck&lt;/a&gt; we offer. $99, one domain, no DNS login needed, 24-hour turnaround. No managed retainers, no inbox-placement guarantees, no spam help.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;d rather DIY but want the printable, fillable Markdown version of the entire process — sender inventory template, SPF builder, DKIM provider reference, DMARC ramp, Authentication-Results decoder — that&amp;rsquo;s the &lt;a href="https://gibbs21.gumroad.com/l/inbox-dns-pack"&gt;Indie Founder Email DNS Pack&lt;/a&gt;, $19 (pay what you want, $9 minimum) on Gumroad.&lt;/p&gt;
&lt;p&gt;That is also the point at which most founders realize there was one tool nobody remembered to align. That tool is almost always a marketing platform.&lt;/p&gt;
&lt;p&gt;You don&amp;rsquo;t have to buy anything to follow the checklist above. The above is the whole working answer for most one-domain indie SaaS. The QuickCheck exists for when you&amp;rsquo;ve done the obvious and still have a quiet 5–10 % of legitimate mail disappearing into spam, and you want a second set of eyes before you tighten DMARC further.&lt;/p&gt;
&lt;p&gt;Either way, the goal is the same: your password resets, your receipts, and your founder replies should reach the inbox. The boring DNS hygiene above is most of the answer.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="related-downloadable-pack"&gt;Related downloadable pack&lt;/h3&gt;
&lt;p&gt;If you&amp;rsquo;ve already finished the checklist above and tightened DMARC to &lt;code&gt;p=quarantine&lt;/code&gt;, and now a specific sender — newsletter tool, Stripe receipts, a sub-domain — has started being quarantined or hard-bounced (Gmail &lt;code&gt;5.7.26&lt;/code&gt;, Microsoft &lt;code&gt;5.7.509&lt;/code&gt; / &lt;code&gt;5.7.515&lt;/code&gt;), the &lt;strong&gt;DMARC Quarantine Pack&lt;/strong&gt; is the focused diagnostic runbook for that exact moment. It includes a DSN decoder cheat-sheet, three real-world incident walkthroughs (marketing-tool DKIM drift, forgotten sub-domain, forwarding/ARC breakage), and a single-file Python aggregate-XML reader so you can read your own DMARC reports without paying for a SaaS dashboard.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gibbs21.gumroad.com/l/dmarc-quarantine-pack"&gt;DMARC Quarantine Pack — $29 on Gumroad&lt;/a&gt; · 14-day refund, no questions.&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/spf-dkim-dmarc-indie-founder-checklist/</guid>
      <category>email</category>
      <category>dns</category>
      <category>spf</category>
      <category>dkim</category>
      <category>dmarc</category>
      <category>deliverability</category>
      <category>indie-founder</category>
      <category>saas</category>
      <pubDate>Sun, 10 May 2026 04:30:00 +0000</pubDate>
    </item>
    <item>
      <title>Cloudflare Email Routing for indie founders: the 10-minute support@ setup</title>
      <link>https://blog.richgibbs.dev/cloudflare-email-routing-indie-founders-10-minute-setup/</link>
      <description>Stop paying $6/user/month for a Workspace seat to forward support@yourdomain.com. Cloudflare Email Routing does the same job for free, in ten minutes, with one caveat you need to know about.</description>
      <content:encoded>&lt;p&gt;You launched. Your domain has a website, a payment link, a privacy page that says &amp;ldquo;support@yourdomain.com,&amp;rdquo; and… no actual mailbox at that address. Real customer mail is silently bouncing.&lt;/p&gt;
&lt;p&gt;You don&amp;rsquo;t need a $6-per-user-per-month Workspace seat to fix this. Cloudflare Email Routing forwards &lt;code&gt;support@yourdomain.com&lt;/code&gt; (and any other alias you want) to a Gmail/Fastmail/Proton mailbox you already pay for, for free, in about ten minutes.&lt;/p&gt;
&lt;p&gt;This post is the boring, working playbook to set it up — plus the one thing it can&amp;rsquo;t do that surprises every founder who tries it for the first time.&lt;/p&gt;
&lt;h2 id="what-email-routing-actually-is"&gt;What Email Routing actually is&lt;/h2&gt;
&lt;p&gt;Cloudflare Email Routing is &lt;strong&gt;inbound-only forwarding&lt;/strong&gt; for any domain whose DNS lives at Cloudflare. You publish a few MX and TXT records that Cloudflare manages for you, define some routing rules in the dashboard (&amp;ldquo;send anything to &lt;code&gt;support@yourdomain.com&lt;/code&gt; to my Gmail&amp;rdquo;), and incoming mail gets re-injected into your real mailbox.&lt;/p&gt;
&lt;p&gt;What it is not:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Not a mailbox. You can&amp;rsquo;t log in to a Cloudflare interface to read mail.&lt;/li&gt;
&lt;li&gt;Not an outbound SMTP server. You can&amp;rsquo;t &lt;em&gt;send&lt;/em&gt; from &lt;code&gt;support@yourdomain.com&lt;/code&gt; through Email Routing. Replies will go from whatever mailbox you forwarded &lt;em&gt;into&lt;/em&gt;, unless you also configure your replying client (more on this below).&lt;/li&gt;
&lt;li&gt;Not a deliverability service. It accepts mail from the public internet and re-delivers it. SPF/DKIM/DMARC for your domain are still your job.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="the-10-minute-path"&gt;The 10-minute path&lt;/h2&gt;
&lt;p&gt;Two prerequisites:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Your domain&amp;rsquo;s nameservers are Cloudflare&amp;rsquo;s. (If they aren&amp;rsquo;t, follow Cloudflare&amp;rsquo;s &amp;ldquo;Add a site&amp;rdquo; flow first; it&amp;rsquo;s free, takes about 5 minutes plus DNS propagation.)&lt;/li&gt;
&lt;li&gt;You have a Gmail / Fastmail / Proton / Mailbox.org / etc. mailbox you actually read.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="1-enable-email-routing-1-minute"&gt;1. Enable Email Routing (1 minute)&lt;/h3&gt;
&lt;p&gt;In the Cloudflare dashboard:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pick your zone → &lt;strong&gt;Email&lt;/strong&gt; → &lt;strong&gt;Email Routing&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Get started&lt;/strong&gt;. Cloudflare will offer to add the required DNS records automatically. Say yes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cloudflare will publish:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Three MX records pointing at &lt;code&gt;route1.mx.cloudflare.net&lt;/code&gt;, &lt;code&gt;route2.mx.cloudflare.net&lt;/code&gt;, &lt;code&gt;route3.mx.cloudflare.net&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A TXT record at the apex with &lt;code&gt;v=spf1 include:_spf.mx.cloudflare.net ~all&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A DKIM CNAME (&lt;code&gt;cf2024-1._domainkey&lt;/code&gt;) for the routing service.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you already have an SPF record at the apex, &lt;strong&gt;stop and merge them by hand&lt;/strong&gt;. You should never have two SPF records. We&amp;rsquo;ll come back to that in the gotchas.&lt;/p&gt;
&lt;h3 id="2-verify-the-destination-address-2-minutes"&gt;2. Verify the destination address (2 minutes)&lt;/h3&gt;
&lt;p&gt;Still in &lt;strong&gt;Email&lt;/strong&gt; → &lt;strong&gt;Email Routing&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Destination addresses&lt;/strong&gt; → &lt;strong&gt;Add destination address&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Enter the personal mailbox you want forwarded mail to land in.&lt;/li&gt;
&lt;li&gt;Cloudflare emails a verification link. Click it.&lt;/li&gt;
&lt;li&gt;The destination&amp;rsquo;s status should flip to &lt;strong&gt;Verified&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can verify multiple destinations and route different aliases to different mailboxes. Useful if &lt;code&gt;billing@&lt;/code&gt; should go to a finance address and &lt;code&gt;security@&lt;/code&gt; should go to a different one.&lt;/p&gt;
&lt;h3 id="3-add-a-routing-rule-1-minute"&gt;3. Add a routing rule (1 minute)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Custom addresses&lt;/strong&gt; → &lt;strong&gt;Create address&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Custom address&lt;/code&gt;: &lt;code&gt;support&lt;/code&gt; (so the full address is &lt;code&gt;support@yourdomain.com&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Action&lt;/code&gt;: Send to an email.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Destination&lt;/code&gt;: pick the verified address.&lt;/li&gt;
&lt;li&gt;Save.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Repeat for every alias you advertise: &lt;code&gt;hello@&lt;/code&gt;, &lt;code&gt;billing@&lt;/code&gt;, &lt;code&gt;legal@&lt;/code&gt;, &lt;code&gt;security@&lt;/code&gt;, &lt;code&gt;dmarc-reports@&lt;/code&gt; (very useful, more on this in a minute).&lt;/p&gt;
&lt;h3 id="4-optional-catch-all-30-seconds"&gt;4. (Optional) Catch-all (30 seconds)&lt;/h3&gt;
&lt;p&gt;In &lt;strong&gt;Custom addresses&lt;/strong&gt; → &lt;strong&gt;Catch-all address&lt;/strong&gt;, set it to either:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Drop&lt;/em&gt; — anything not matched is silently dropped. Good for spam hygiene.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Send to&lt;/em&gt; — any unknown alias is forwarded to your fallback mailbox. Good if you advertise lots of aliases on signup forms and don&amp;rsquo;t want misspellings to bounce.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is no third option. Pick one. &amp;ldquo;Drop&amp;rdquo; is what most indie SaaS founders should use.&lt;/p&gt;
&lt;h3 id="5-send-a-real-test-1-minute"&gt;5. Send a real test (1 minute)&lt;/h3&gt;
&lt;p&gt;From a &lt;em&gt;different&lt;/em&gt; mailbox (not the destination — Gmail will helpfully suppress mail you sent to yourself), email &lt;code&gt;support@yourdomain.com&lt;/code&gt;. It should arrive at the destination within a few seconds, with the original sender preserved in the From: header.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s the whole setup. The remaining time is what&amp;rsquo;s between you and a working &lt;strong&gt;outbound&lt;/strong&gt; support address, which is the part that catches everyone.&lt;/p&gt;
&lt;h2 id="the-one-thing-email-routing-does-not-do-and-what-to-do-instead"&gt;The one thing Email Routing does &lt;em&gt;not&lt;/em&gt; do — and what to do instead&lt;/h2&gt;
&lt;p&gt;Email Routing is &lt;strong&gt;inbound only&lt;/strong&gt;. If you reply to a customer&amp;rsquo;s email and you do nothing else, your reply will go from &lt;code&gt;your.personal@gmail.com&lt;/code&gt;, not from &lt;code&gt;support@yourdomain.com&lt;/code&gt;. The customer sees a different address than the one they wrote to, the conversation feels off, and you might also leak a personal address you didn&amp;rsquo;t mean to advertise.&lt;/p&gt;
&lt;p&gt;Three options, in order of effort:&lt;/p&gt;
&lt;h3 id="option-a-live-with-replies-coming-from-the-personal-mailbox"&gt;Option A — Live with replies coming from the personal mailbox&lt;/h3&gt;
&lt;p&gt;OK for a v0 SaaS while you have ten customers. Set the Reply-To header in your mail tool to &lt;code&gt;support@yourdomain.com&lt;/code&gt; so further replies route correctly. Most mail clients let you set a default Reply-To per identity. Customers will see your personal address in the visible From, which they will probably tolerate while you&amp;rsquo;re small.&lt;/p&gt;
&lt;h3 id="option-b-use-gmails-send-mail-as-with-a-real-outbound-smtp-server"&gt;Option B — Use Gmail&amp;rsquo;s &amp;ldquo;Send mail as&amp;rdquo; with a real outbound SMTP server&lt;/h3&gt;
&lt;p&gt;In Gmail: &lt;strong&gt;Settings&lt;/strong&gt; → &lt;strong&gt;Accounts and Import&lt;/strong&gt; → &lt;strong&gt;Send mail as&lt;/strong&gt; → &lt;strong&gt;Add another email address&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;You will need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A real outbound SMTP host that authorizes you to send as &lt;code&gt;support@yourdomain.com&lt;/code&gt;. &lt;strong&gt;Gmail itself will not let you do this without an SMTP server&lt;/strong&gt;; the &amp;ldquo;treat as alias&amp;rdquo; path that worked years ago is gone.&lt;/li&gt;
&lt;li&gt;An SMTP host can be a paid Workspace seat ($6/mo/user — the thing we just avoided), or a transactional ESP like Postmark / Resend / Mailgun / SES configured with your custom domain and DKIM.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you already use Postmark/Resend/Mailgun/SES for product mail, set up an authorized &amp;ldquo;transactional support&amp;rdquo; sender there and feed the SMTP credentials into Gmail&amp;rsquo;s Send-mail-as flow. Postmark and Resend both have specific docs for this. Now your replies go from &lt;code&gt;support@yourdomain.com&lt;/code&gt; over a path that aligns with DKIM.&lt;/p&gt;
&lt;h3 id="option-c-use-a-help-desk-tool-with-custom-domain-support"&gt;Option C — Use a help-desk tool with custom-domain support&lt;/h3&gt;
&lt;p&gt;Help Scout, Plain, Front, Missive, HubSpot Service. These all accept inbound mail forwarded to a tool-specific address (you point Cloudflare Email Routing at it instead of your Gmail) and send outbound replies as &lt;code&gt;support@yourdomain.com&lt;/code&gt; with their own DKIM you authorize. Per-seat pricing varies; some have free tiers up to a few mailboxes.&lt;/p&gt;
&lt;p&gt;For an indie SaaS at 0–500 customers, Option B is usually the sweet spot. For a 2-3 person team that wants conversation handling, Option C earns its keep.&lt;/p&gt;
&lt;h2 id="common-gotchas"&gt;Common gotchas&lt;/h2&gt;
&lt;p&gt;These are the things I see indie founders get wrong with Email Routing.&lt;/p&gt;
&lt;h3 id="gotcha-1-dual-spf-records"&gt;Gotcha 1: dual SPF records&lt;/h3&gt;
&lt;p&gt;If your DNS already had an SPF record (because you set up Postmark or Mailgun before adding Email Routing), Cloudflare will silently publish a &lt;em&gt;second&lt;/em&gt; one. Conforming receivers will treat dual SPF as &lt;code&gt;permerror&lt;/code&gt; and ignore both. Result: legitimate inbound delivery may still work via MX, but your &lt;em&gt;outbound&lt;/em&gt; SPF alignment quietly breaks.&lt;/p&gt;
&lt;p&gt;Fix: keep one record at the apex. If you also send via Postmark and Mailgun and have routing on:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;v=spf1 include:_spf.mx.cloudflare.net include:spf.mtasv.net include:_spf.mailgun.org ~all
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Verify with &lt;code&gt;dig +short TXT yourdomain.com | grep spf1&lt;/code&gt;. You should see exactly one line.&lt;/p&gt;
&lt;h3 id="gotcha-2-forwarded-mail-lands-in-spam"&gt;Gotcha 2: forwarded mail lands in spam&lt;/h3&gt;
&lt;p&gt;Forwarding rewrites the SMTP envelope. The original sender&amp;rsquo;s SPF/DKIM may no longer align by the time Gmail receives the forwarded copy. Symptoms: real customer mail to &lt;code&gt;support@&lt;/code&gt; shows up in Gmail&amp;rsquo;s Spam folder.&lt;/p&gt;
&lt;p&gt;Fix:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In Gmail, open one such message → &lt;strong&gt;More&lt;/strong&gt; → &lt;strong&gt;Filter messages like this&lt;/strong&gt; → set criteria to &lt;code&gt;to:support@yourdomain.com OR deliveredto:support@yourdomain.com&lt;/code&gt;, then &lt;strong&gt;Never send to spam&lt;/strong&gt; + apply a &lt;code&gt;Support&lt;/code&gt; label + optionally categorize as Primary.&lt;/li&gt;
&lt;li&gt;Test from a non-destination address; do not test from another mailbox owned by the same Google account.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is a Gmail filter, not a Cloudflare problem. Email Routing sets ARC headers correctly; some receivers still ding forwarded mail.&lt;/p&gt;
&lt;h3 id="gotcha-3-dmarc-reports-vanish"&gt;Gotcha 3: DMARC reports vanish&lt;/h3&gt;
&lt;p&gt;You set up DMARC at &lt;code&gt;_dmarc.yourdomain.com&lt;/code&gt; with &lt;code&gt;rua=mailto:dmarc-reports@yourdomain.com&lt;/code&gt;. That alias must actually route somewhere. If you forgot to add a Cloudflare Email Routing rule for &lt;code&gt;dmarc-reports&lt;/code&gt;, the reports get dropped, and you&amp;rsquo;ll think DMARC is broken when really you just have no inbox to read it from.&lt;/p&gt;
&lt;p&gt;Fix: add &lt;code&gt;dmarc-reports@&lt;/code&gt; as a routed alias. In Gmail set a filter to auto-label and skip the inbox. Aggregate reports are XML and noisy.&lt;/p&gt;
&lt;h3 id="gotcha-4-your-send-mail-as-alias-still-routes-through-gmailcom"&gt;Gotcha 4: your &amp;ldquo;send mail as&amp;rdquo; alias still routes through gmail.com&lt;/h3&gt;
&lt;p&gt;Even with Send-mail-as configured, if you don&amp;rsquo;t enable &amp;ldquo;Treat as alias&amp;rdquo; or you don&amp;rsquo;t use a true outbound SMTP host, Gmail will sometimes send through &lt;code&gt;gmail.com&lt;/code&gt; and tag it as a forwarded sender. The visible From: looks right, but &lt;code&gt;Authentication-Results&lt;/code&gt; will tell on you.&lt;/p&gt;
&lt;p&gt;Fix: read the Authentication-Results header on a real reply (View original in Gmail). You want &lt;code&gt;dkim=pass header.d=yourdomain.com&lt;/code&gt;, not &lt;code&gt;header.d=gmail.com&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id="gotcha-5-paid-workspace-already-exists-for-the-domain"&gt;Gotcha 5: paid Workspace already exists for the domain&lt;/h3&gt;
&lt;p&gt;If your domain previously had Google Workspace MX records, or M365 MX records, the dashboard will warn before overwriting them. &lt;strong&gt;Do not click through that warning&lt;/strong&gt; unless you intend to abandon the existing mailbox. Cloudflare&amp;rsquo;s MX records replace whatever was there — including your live Workspace inbox.&lt;/p&gt;
&lt;p&gt;Fix: pick one. Either keep Workspace and don&amp;rsquo;t enable Email Routing, or migrate everything to Email Routing first.&lt;/p&gt;
&lt;h2 id="authentication-after-email-routing-is-on"&gt;Authentication after Email Routing is on&lt;/h2&gt;
&lt;p&gt;Once routing is live, &lt;code&gt;dig&lt;/code&gt; your domain. You should see exactly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3 MX records → &lt;code&gt;route{1,2,3}.mx.cloudflare.net&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;1 SPF TXT (only) at the apex.&lt;/li&gt;
&lt;li&gt;1 DKIM TXT (&lt;code&gt;cf2024-1._domainkey&lt;/code&gt;) for the routing service.&lt;/li&gt;
&lt;li&gt;Your existing DKIMs from product/transactional senders.&lt;/li&gt;
&lt;li&gt;1 DMARC TXT at &lt;code&gt;_dmarc&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If any of those is duplicated or missing, you have homework. The matching &lt;a href="/spf-dkim-dmarc-indie-founder-checklist/"&gt;SPF/DKIM/DMARC checklist&lt;/a&gt; walks the rest.&lt;/p&gt;
&lt;h2 id="when-to-upgrade-past-email-routing"&gt;When to upgrade past Email Routing&lt;/h2&gt;
&lt;p&gt;Email Routing is the right answer when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need 1–10 aliases on one domain.&lt;/li&gt;
&lt;li&gt;Volume is &amp;ldquo;real customer mail,&amp;rdquo; not &amp;ldquo;we send 50k newsletters a month from this address.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;A 5–60 second delay on inbound is fine.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Outgrow it when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want a shared inbox for two or more people without forwarding to the same Gmail.&lt;/li&gt;
&lt;li&gt;You need calendar/contacts/Drive on the domain (that&amp;rsquo;s Workspace&amp;rsquo;s actual value, not the email forwarding).&lt;/li&gt;
&lt;li&gt;You need server-to-server inbound webhooks (Email Routing supports a &amp;ldquo;Send to a Worker&amp;rdquo; action for this; useful but past the 10-minute mark).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="want-a-written-second-pair-of-eyes-on-your-setup"&gt;Want a written second-pair-of-eyes on your setup&lt;/h2&gt;
&lt;p&gt;Once routing is live and SPF/DKIM/DMARC are published, the most useful thing you can do is verify &lt;em&gt;every authorized sender&lt;/em&gt; is aligned with your visible From: address. That&amp;rsquo;s exactly the &lt;a href="https://richgibbs.dev/quickcheck/inbox-dns/"&gt;Inbox/DNS QuickCheck&lt;/a&gt; — a $99 written report on one domain, delivered within 24 hours, no DNS login required.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;d rather DIY the whole thing, the same content in printable, fillable Markdown form (sender inventory template, SPF builder, DMARC ramp, Authentication-Results decoder) is in the &lt;a href="https://gibbs21.gumroad.com/l/inbox-dns-pack"&gt;Indie Founder Email DNS Pack&lt;/a&gt; — $19 (pay what you want, $9 minimum) on Gumroad. Either is fine. The point is to do it once, well, then never think about it again.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="related-downloadable-pack"&gt;Related downloadable pack&lt;/h3&gt;
&lt;p&gt;If you set up Cloudflare Email Routing and &lt;em&gt;also&lt;/em&gt; publish DMARC at &lt;code&gt;p=quarantine&lt;/code&gt; or stricter, a small but real failure mode is forwarded mail that breaks SPF or DKIM alignment at the receiving mailbox. The &lt;strong&gt;DMARC Quarantine Pack&lt;/strong&gt; ($29) is the focused runbook for diagnosing that and related cases — Gmail &lt;code&gt;5.7.26&lt;/code&gt; and Microsoft &lt;code&gt;5.7.509&lt;/code&gt; / &lt;code&gt;5.7.515&lt;/code&gt; decoded with source citations, three incident walkthroughs (one of them is a forwarding/ARC case), and a single-file Python aggregate-XML reader for reading your own reports.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gibbs21.gumroad.com/l/dmarc-quarantine-pack"&gt;DMARC Quarantine Pack — $29 on Gumroad&lt;/a&gt; · 14-day refund, no questions.&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/cloudflare-email-routing-indie-founders-10-minute-setup/</guid>
      <category>email</category>
      <category>cloudflare</category>
      <category>email-routing</category>
      <category>dns</category>
      <category>indie-founder</category>
      <category>saas</category>
      <category>support</category>
      <pubDate>Sun, 10 May 2026 05:00:00 +0000</pubDate>
    </item>
    <item>
      <title>I had 80,000 unread emails. Here's the cleanup playbook (no apps, no OAuth)</title>
      <link>https://blog.richgibbs.dev/i-had-80000-unread-emails-cleanup-playbook/</link>
      <description>A working, non-SaaS playbook for clearing tens of thousands of old unread emails from a personal Gmail. Survey first, delete second. The 30-day Trash window is your safety net.</description>
      <content:encoded>&lt;p&gt;Last weekend I sat down to clean out my personal Gmail.&lt;/p&gt;
&lt;p&gt;I had 80,675 unread messages older than one year. Most were newsletters from companies I&amp;rsquo;d long since stopped caring about — receipts from a 2019 ride-share account, password-reset emails from accounts that no longer exist, every &amp;ldquo;weekly digest&amp;rdquo; I&amp;rsquo;d ever opted into and forgotten.&lt;/p&gt;
&lt;p&gt;The cleanup itself took about 20 minutes once I had a plan. The &lt;em&gt;having a plan&lt;/em&gt; part took three evenings.&lt;/p&gt;
&lt;p&gt;This post is the playbook I actually used. It isn&amp;rsquo;t a SaaS pitch. It doesn&amp;rsquo;t ask you to log into anything. It&amp;rsquo;s the boring, working sequence for someone who has tens of thousands of old unread emails in a personal Gmail and wants them gone tonight, without nuking something they&amp;rsquo;ll wish they&amp;rsquo;d kept.&lt;/p&gt;
&lt;h2 id="why-most-inbox-zero-advice-fails-on-a-real-mailbox"&gt;Why most &amp;ldquo;inbox zero&amp;rdquo; advice fails on a real mailbox&lt;/h2&gt;
&lt;p&gt;If you Google &amp;ldquo;how to delete old unread emails Gmail bulk,&amp;rdquo; you get three kinds of answers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;SaaS apps that want full mailbox OAuth.&lt;/strong&gt; Mailstrom, Clean Email, Cleanfox. They work, but the permission cost is large for a job that runs once.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Blog posts from 2014.&lt;/strong&gt; They reference Outlook 2010, IMAP folders, and Gmail&amp;rsquo;s old desktop UI. The screenshots don&amp;rsquo;t match anything you actually see today.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&amp;ldquo;5 tips&amp;rdquo; listicles&lt;/strong&gt; that assume you have 800 unread emails, not 80,000. The &amp;ldquo;select all&amp;rdquo; trick doesn&amp;rsquo;t survive paging through 1,600 pages of 50 messages each.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;None of these are useful when you&amp;rsquo;re staring down a five-figure unread count.&lt;/p&gt;
&lt;p&gt;The reason isn&amp;rsquo;t the operations — Gmail&amp;rsquo;s &lt;code&gt;older_than:&lt;/code&gt; operator does most of the heavy lifting. The reason is &lt;strong&gt;order&lt;/strong&gt;. If you don&amp;rsquo;t survey first, you&amp;rsquo;ll start deleting things you wanted to keep, panic, stop halfway, and end up with a mailbox that&amp;rsquo;s somehow worse than when you started.&lt;/p&gt;
&lt;h2 id="the-order-i-followed-and-you-can-copy"&gt;The order I followed (and you can copy)&lt;/h2&gt;
&lt;p&gt;The whole playbook is five steps. None of them involve installing anything.&lt;/p&gt;
&lt;h3 id="1-survey-before-you-touch-anything"&gt;1. Survey before you touch anything&lt;/h3&gt;
&lt;p&gt;Open Gmail. Open the search bar. Run these queries one at a time and write the result count down on a piece of paper:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;is:unread older_than:1y&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;is:unread older_than:3y&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;from:newsletters older_than:6m&lt;/code&gt; (or substitute a sender domain you know is noise)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;has:attachment older_than:2y larger:10M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;category:promotions older_than:6m&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Five numbers. That&amp;rsquo;s your map.&lt;/p&gt;
&lt;p&gt;If the first number is over 5,000, congratulations — you have the same shape of problem most indie founders have. Mine was 80,675 against the first query. Yours will be different but the playbook scales.&lt;/p&gt;
&lt;p&gt;If you want a more thorough survey — top 20 senders, oldest cohorts by year, attachment age buckets, label sprawl — that&amp;rsquo;s what the &lt;a href="https://richgibbs.dev/quickcheck/inbox-cleanup/"&gt;Inbox Cleanup Pack&lt;/a&gt; ships: a small read-only shell script that calls the Gmail API under your own OAuth client and writes a single &lt;code&gt;survey.json&lt;/code&gt; file. No message bodies, no subjects, no message ids. Just counts. You can also do the survey by hand with the queries above; the script just makes it faster on big mailboxes.&lt;/p&gt;
&lt;h3 id="2-filter-the-recurring-noise-first"&gt;2. Filter the recurring noise &lt;em&gt;first&lt;/em&gt;&lt;/h3&gt;
&lt;p&gt;Before you delete anything, kill the inbound flow.&lt;/p&gt;
&lt;p&gt;Open Gmail → Settings → Filters and Blocked Addresses → Create a new filter.&lt;/p&gt;
&lt;p&gt;For each of the top 5 newsletter senders you can name off the top of your head, create a filter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;from:(news@somecompany.com)&lt;/code&gt; → &amp;ldquo;Skip the Inbox&amp;rdquo; + &amp;ldquo;Mark as read&amp;rdquo; + &amp;ldquo;Apply label: Newsletters&amp;rdquo; + &amp;ldquo;Also apply filter to existing matching conversations.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &amp;ldquo;also apply&amp;rdquo; checkbox is the part most people miss. It silently archives the existing 4,000 unread newsletters from that sender in one click. No manual select-all required.&lt;/p&gt;
&lt;p&gt;Repeat this for your top 5–10 noisy senders. You&amp;rsquo;ll be surprised how much of the unread count is concentrated in a small number of domains. In my case, six senders accounted for 41% of the 80,675.&lt;/p&gt;
&lt;h3 id="3-bulk-archive-the-old-promotions-cohort"&gt;3. Bulk-archive the old promotions cohort&lt;/h3&gt;
&lt;p&gt;Now that the inbound is filtered, attack the standing cohort.&lt;/p&gt;
&lt;p&gt;In the search bar:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;is:unread category:promotions older_than:1y
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Click the small &lt;strong&gt;&amp;ldquo;Select all conversations that match this search&amp;rdquo;&lt;/strong&gt; link that appears above the message list. (The plain &amp;ldquo;select all&amp;rdquo; checkbox at the top only ticks the 50 visible messages — this is the most common gotcha and the reason people quit halfway.)&lt;/p&gt;
&lt;p&gt;Then &lt;strong&gt;Archive&lt;/strong&gt;, not Delete. Archiving keeps the messages in All Mail; deleting moves them to Trash. For promotions older than a year, archive is enough — they&amp;rsquo;ll never come up in your inbox again unless you specifically search for them.&lt;/p&gt;
&lt;h3 id="4-delete-the-truly-dead-with-older_than-and-a-safety-net"&gt;4. Delete the truly dead — with &lt;code&gt;older_than:&lt;/code&gt; and a safety net&lt;/h3&gt;
&lt;p&gt;For the cohort that genuinely has no future use:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;is:unread older_than:2y
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Same &amp;ldquo;Select all conversations that match this search&amp;rdquo; link, then &lt;strong&gt;Delete&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Two things to know:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Gmail&amp;rsquo;s Trash auto-purges after 30 days.&lt;/strong&gt; That&amp;rsquo;s your real undo window. If you delete 50,000 messages today, you have 30 days to walk into Trash and pull anything important back.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deleting from Gmail does not delete from Google Takeout history.&lt;/strong&gt; If you&amp;rsquo;ve ever exported your mail with Takeout, that snapshot is still in Drive. The Trash purge is a Gmail-only concept.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I deliberately did not delete anything younger than two years. The marginal value of &amp;ldquo;unread receipt from 2024&amp;rdquo; is small but non-zero — there&amp;rsquo;s still a chance you&amp;rsquo;ll need to find one. The marginal value of &amp;ldquo;unread newsletter from 2019&amp;rdquo; is zero.&lt;/p&gt;
&lt;h3 id="5-set-up-the-maintenance-youll-actually-keep"&gt;5. Set up the maintenance you&amp;rsquo;ll actually keep&lt;/h3&gt;
&lt;p&gt;The cleanup is a one-day job. Maintenance is what keeps you from being here again in two years.&lt;/p&gt;
&lt;p&gt;Three filters that survive long-term:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One filter per platform that sends transactional mail you don&amp;rsquo;t read in real time (Stripe receipts, Vercel deploy notifications, GitHub digest mails) → &amp;ldquo;Skip the Inbox&amp;rdquo; + &amp;ldquo;Apply label: Transactional.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;One filter for &lt;code&gt;unsubscribe&lt;/code&gt; in body → &amp;ldquo;Apply label: Newsletter.&amp;rdquo; This labels every newsletter going forward without skipping the inbox; once a quarter you can sweep the label.&lt;/li&gt;
&lt;li&gt;One filter for &lt;code&gt;from:(*@yourdomain.com)&lt;/code&gt; → &amp;ldquo;Star&amp;rdquo; or &amp;ldquo;Mark as important.&amp;rdquo; Mail from your own domain to yourself is almost always something you actually wanted to act on.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Don&amp;rsquo;t go past three. Filter sprawl is the second cause of inbox bankruptcy after newsletter sprawl.&lt;/p&gt;
&lt;h2 id="what-i-deliberately-did-not-do"&gt;What I deliberately did not do&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No third-party apps.&lt;/strong&gt; No Mailstrom, no Clean Email. They work for the people they fit; they&amp;rsquo;re the wrong shape for a one-time cleanup.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No &amp;ldquo;inbox zero&amp;rdquo; rules.&lt;/strong&gt; Inbox zero is a discipline, not a software problem. Either you&amp;rsquo;ll keep up or you won&amp;rsquo;t; no app changes that.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No deletion of mail younger than two years&lt;/strong&gt; — too much chance of needing one of them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No bulk-unsubscribe service.&lt;/strong&gt; Most of them either MITM your unsubscribe (and re-sell the implied opt-in signal) or get blocked by sender reputation systems. Manual unsubscribe from the noisiest five senders, then filter the rest, beats a bulk service every time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No DNS or sender-config changes.&lt;/strong&gt; That&amp;rsquo;s a different problem — see the &lt;a href="https://gibbs21.gumroad.com/l/inbox-dns-pack"&gt;Inbox/DNS Pack&lt;/a&gt; and &lt;a href="https://richgibbs.dev/quickcheck/inbox-dns/"&gt;Inbox/DNS QuickCheck&lt;/a&gt; for the SPF/DKIM/DMARC side.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="the-boring-summary"&gt;The boring summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Survey before you touch anything.&lt;/li&gt;
&lt;li&gt;Filter the recurring noise first, with &amp;ldquo;Also apply filter to existing matching conversations&amp;rdquo; checked.&lt;/li&gt;
&lt;li&gt;Archive (not delete) the old Promotions cohort.&lt;/li&gt;
&lt;li&gt;Delete the cohort older than two years, knowing the 30-day Trash window is your safety net.&lt;/li&gt;
&lt;li&gt;Three maintenance filters, no more.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That&amp;rsquo;s the whole playbook. It&amp;rsquo;s enough for most personal Gmails carrying tens of thousands of unread.&lt;/p&gt;
&lt;h2 id="want-this-packaged"&gt;Want this packaged?&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;d rather have the survey script, the printable Markdown version of the cleanup order, the full filter templates, and the cohort-by-cohort cleanup-order I followed, that&amp;rsquo;s exactly the &lt;a href="https://richgibbs.dev/quickcheck/inbox-cleanup/"&gt;Inbox Cleanup Pack&lt;/a&gt; — $19 (pay-what-you-want, $9 minimum) on Gumroad.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;d rather hand me the counts-only &lt;code&gt;survey.json&lt;/code&gt; from the script and get a written, prioritized cleanup plan tailored to your mailbox in 24 hours, that&amp;rsquo;s the $79 &lt;a href="https://richgibbs.dev/quickcheck/inbox-cleanup/"&gt;Inbox Cleanup QuickCheck&lt;/a&gt;. I never see message content; just the counts.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;re a small Workspace team with up to 10 mailboxes — typical pre-migration scenario — the $499 &lt;a href="https://richgibbs.dev/quickcheck/inbox-cleanup/"&gt;Enterprise tier&lt;/a&gt; handles it under your own internal-app OAuth path, no third-party permissions added.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Either way, the playbook above is the working answer for most personal Gmails.&lt;/strong&gt; The product exists for the case where you&amp;rsquo;d rather pay $19 for a pre-written cleanup order than reverse-engineer one yourself, or pay $79 for a custom plan, or have a teammate run the same survey across 10 mailboxes before a migration.&lt;/p&gt;
&lt;p&gt;The 30-day Trash window is your safety net. Use it.&lt;/p&gt;
&lt;p&gt;— Rich&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/i-had-80000-unread-emails-cleanup-playbook/</guid>
      <category>email</category>
      <category>gmail</category>
      <category>inbox-cleanup</category>
      <category>inbox-zero</category>
      <category>indie-founder</category>
      <category>productivity</category>
      <pubDate>Sun, 10 May 2026 17:11:52 +0000</pubDate>
    </item>
    <item>
      <title>I wouldn't give a SaaS my Gmail to clean it. Here's the 30-line read-only alternative.</title>
      <link>https://blog.richgibbs.dev/delete-thousands-emails-gmail-without-oauth-scope-creep/</link>
      <description>The Survey-then-Delete method for cleaning a five-figure unread Gmail backlog using a read-only script that runs under your own Google account. No third-party OAuth scope creep, no mailbox handed to a SaaS, no tokens leaving your laptop.</description>
      <content:encoded>&lt;p&gt;I sat down three weekends in a row to clean out my personal Gmail.&lt;/p&gt;
&lt;p&gt;The first two weekends I did what most people do. I opened the &amp;ldquo;free inbox cleaner&amp;rdquo; tab everyone keeps tweeting about, read the OAuth consent screen — &lt;em&gt;Read, compose, send, and permanently delete all your email from Gmail&lt;/em&gt; — closed the tab, and went back to scrolling. The cost felt wrong for a job I&amp;rsquo;d only run once.&lt;/p&gt;
&lt;p&gt;The third weekend I wrote my own script. Survey-only, read-only, runs under my own Google account, no tokens ever leave my laptop. The actual cleanup, once I had a plan, took less than half an hour: I cleared 80,675 messages older than a year, archived another 14,000-odd, and built three filters that have kept the backlog at zero ever since.&lt;/p&gt;
&lt;p&gt;This is what I learned about doing it without handing a stranger the keys to my mailbox.&lt;/p&gt;
&lt;h2 id="the-oauth-scope-problem-nobody-wants-to-say-out-loud"&gt;The OAuth scope problem nobody wants to say out loud&lt;/h2&gt;
&lt;p&gt;A free email cleanup app is not, in fact, free.&lt;/p&gt;
&lt;p&gt;When you click &amp;ldquo;Sign in with Google&amp;rdquo; and the consent screen asks for &lt;code&gt;https://mail.google.com/&lt;/code&gt; — that&amp;rsquo;s the &lt;strong&gt;all-of-Gmail scope&lt;/strong&gt;. It&amp;rsquo;s not &amp;ldquo;look at counts.&amp;rdquo; It&amp;rsquo;s not &amp;ldquo;look at senders.&amp;rdquo; It&amp;rsquo;s &lt;em&gt;read every message, write every message, delete every message, send mail as you to anyone&lt;/em&gt;. There is no narrower scope that lets a third-party app do bulk cleanup the way most of these tools do it.&lt;/p&gt;
&lt;p&gt;A few honest consequences of granting that scope:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The app&amp;rsquo;s server can read message bodies, attachments, contacts, calendar invites, and 2FA codes&lt;/strong&gt; at any time it holds a valid token. Most don&amp;rsquo;t &lt;em&gt;advertise&lt;/em&gt; doing that. The capability exists either way.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OAuth refresh tokens last for months by default.&lt;/strong&gt; Removing the app from your Google account dashboard revokes new tokens, not stored ones. If the vendor&amp;rsquo;s database was already scraped, the bird has flown.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;You are now an upstream dependency&lt;/strong&gt; of every breach that vendor will ever have. The 2014–2024 history of mailbox-OAuth apps is not encouraging on that point — look up any of the big-name &amp;ldquo;smart inbox&amp;rdquo; companies and you&amp;rsquo;ll find at least one incident.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This isn&amp;rsquo;t a hit piece on any specific tool. I won&amp;rsquo;t name any. The economics of &amp;ldquo;free, ad-funded inbox cleaner with full mailbox OAuth&amp;rdquo; are the same regardless of who&amp;rsquo;s running it. The product is the inbox.&lt;/p&gt;
&lt;p&gt;For a recurring assistant you trust — a calendar app, a CRM you live in — that scope is sometimes a fair trade. For a &lt;em&gt;one-time cleanup&lt;/em&gt;, it isn&amp;rsquo;t. The right tool for a one-time job is one that doesn&amp;rsquo;t outlive the job.&lt;/p&gt;
&lt;h2 id="survey-then-delete-the-methodology"&gt;Survey-then-Delete: the methodology&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s the method that actually worked. I call it &lt;strong&gt;Survey-then-Delete&lt;/strong&gt; because reversing those two words is what causes most cleanups to fail halfway.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Survey, counts only.&lt;/strong&gt; Don&amp;rsquo;t look at bodies. Don&amp;rsquo;t look at subjects. Don&amp;rsquo;t even pull message IDs. Just ask Gmail &amp;ldquo;how many messages match this query?&amp;rdquo; for a handful of useful queries — top senders, age cohorts, attachment sizes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Identify the top 12 senders&lt;/strong&gt; by volume. In every five-figure mailbox I&amp;rsquo;ve audited, fewer than 15 senders account for 40–70% of the noise. This is universal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Filter the recurring inbound first.&lt;/strong&gt; For each of those senders, build a Gmail filter that skips the inbox, marks as read, and &amp;ldquo;Also apply filter to existing matching conversations.&amp;rdquo; That single checkbox is where most manual cleanups stall.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bulk delete by sender and age cohort, not by clicking individual messages.&lt;/strong&gt; Use Gmail&amp;rsquo;s &lt;code&gt;from:&lt;/code&gt; and &lt;code&gt;older_than:&lt;/code&gt; operators. The 30-day Trash window is your safety net — anything you delete is recoverable for 30 days.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run two maintenance filters&lt;/strong&gt; so you never have to do this again.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Notice what&amp;rsquo;s &lt;em&gt;not&lt;/em&gt; on the list: no mailbox migration, no archive-everything panic, no &amp;ldquo;select all 80,000 and pray.&amp;rdquo; You don&amp;rsquo;t even need to know which individual messages you&amp;rsquo;re deleting. You&amp;rsquo;re operating on counts and senders, like a sysadmin culling logs, not on individual emails.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s the whole product worldview. Cleanup is a one-time &lt;em&gt;cohort&lt;/em&gt; operation, and a third-party app with permanent mailbox access is overkill for it.&lt;/p&gt;
&lt;h2 id="what-the-survey-actually-looks-like"&gt;What the survey actually looks like&lt;/h2&gt;
&lt;p&gt;The survey step is the part the script does. It calls the Gmail API under your own OAuth — read-only scope, &lt;code&gt;gmail.metadata&lt;/code&gt; plus &lt;code&gt;gmail.readonly&lt;/code&gt; for counts — and writes a single &lt;code&gt;survey.json&lt;/code&gt; to your laptop. No message bodies, no subjects, no message IDs. Just counts.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s a redacted version of what one row looks like, the way the script renders it so you can read it before deciding anything:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code&gt;sender                          count    oldest                  recommended action
─────────────────────────────────────────────────────────────────────────────────────
news@&amp;lt;redacted-saas&amp;gt;.com       11,842   2017-03-12   filter+delete (&amp;gt;1y)
deals@&amp;lt;redacted-airline&amp;gt;.com    7,901   2014-08-19   filter+delete (&amp;gt;1y)
updates@&amp;lt;redacted-network&amp;gt;.com  5,617   2016-01-08   filter+delete (&amp;gt;1y)
no-reply@&amp;lt;redacted-bank&amp;gt;.com    3,402   2018-04-22   filter only (keep — statements)
receipts@&amp;lt;redacted-cart&amp;gt;.com    2,883   2019-06-30   filter only (keep — receipts)
hello@&amp;lt;redacted-newsletter&amp;gt;     2,114   2020-11-04   filter+delete (&amp;gt;2y)
… 6 more rows …                              
─────────────────────────────────────────────────────────────────────────────────────
top 12 senders                 51,883   covers 64.3% of unread &amp;gt;1y
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That&amp;rsquo;s the whole output. Four columns, twelve rows, one summary line. With that table you can decide, in 90 seconds, which senders you want to &lt;em&gt;filter and delete&lt;/em&gt; (most of them), which you want to &lt;em&gt;filter only&lt;/em&gt; (anything with statements, receipts, security alerts), and which you want to leave alone (the 30%-ish tail of senders you might still care about).&lt;/p&gt;
&lt;p&gt;The actual deletion is a second pass — different command, explicit confirmation, dry-run by default. You read the count, you say yes, Gmail moves the cohort to Trash, the 30-day undo window protects you.&lt;/p&gt;
&lt;h2 id="safety-properties-in-plain-english"&gt;Safety properties, in plain English&lt;/h2&gt;
&lt;p&gt;This is the bit I want to be very precise about, because &amp;ldquo;we never see your mail&amp;rdquo; is something every cleaner says, and most of them are stretching.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read-only OAuth at survey time.&lt;/strong&gt; The survey command requests &lt;code&gt;gmail.metadata&lt;/code&gt; + &lt;code&gt;gmail.readonly&lt;/code&gt;. Those scopes &lt;em&gt;cannot&lt;/em&gt; delete, send, or modify mail. Google enforces this at the API edge; it&amp;rsquo;s not a promise, it&amp;rsquo;s a permission.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deletion runs under a separate, on-demand &lt;code&gt;gmail.modify&lt;/code&gt; scope&lt;/strong&gt; that you grant only when you actually want to delete, and revoke from your Google account afterwards in one click. The script doesn&amp;rsquo;t ask for &lt;code&gt;mail.google.com/&lt;/code&gt; (the all-powerful scope) — ever.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The OAuth client is yours.&lt;/strong&gt; You create the Google Cloud project in your own account, paste the client ID and secret into a config file on your laptop. The tokens are written to a file in your home directory with &lt;code&gt;0600&lt;/code&gt; permissions. &lt;strong&gt;They never touch our infrastructure.&lt;/strong&gt; I literally cannot read your mail; the credentials only exist on your machine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Enterprise tier sidesteps the same problem differently:&lt;/strong&gt; your IT admin publishes the script as an &lt;em&gt;Internal&lt;/em&gt; app inside your Google Cloud organization, which means it&amp;rsquo;s exempt from Google&amp;rsquo;s app verification process and the 100-user cap, but also that there&amp;rsquo;s no &amp;ldquo;third-party app&amp;rdquo; to revoke — the script runs as you, on your own org&amp;rsquo;s Cloud project.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you&amp;rsquo;re the kind of person who reads OAuth scope strings before clicking through them — same — that&amp;rsquo;s the design.&lt;/p&gt;
&lt;h2 id="the-three-ways-to-do-this"&gt;The three ways to do this&lt;/h2&gt;
&lt;p&gt;Pick the one that matches how much DIY you want to wrangle.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;$19 — Inbox Cleanup Pack (DIY).&lt;/strong&gt; &lt;a href="https://gibbs21.gumroad.com/l/inbox-cleanup-pack"&gt;Get it on Gumroad →&lt;/a&gt; Pay-what-you-want, $9 floor. The same shell script I used (read-only survey + opt-in deletion), the Gmail filter templates, the exact cohort-by-cohort cleanup order, and the printable Markdown playbook. You run everything on your own laptop under your own Google Cloud OAuth client. No third-party permissions added to your account.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;$79 — Inbox Cleanup QuickCheck (we write the plan).&lt;/strong&gt; &lt;a href="https://buy.stripe.com/cNi28tdxa3AC5NLdtJ5ZC03"&gt;Buy on Stripe →&lt;/a&gt; You run the same survey script. You send me the &lt;code&gt;survey.json&lt;/code&gt; file (counts only — no message bodies, no subjects, no IDs). I send back a written, prioritized cleanup plan tailored to your top senders, your age cohorts, and your tolerance for &amp;ldquo;delete vs archive.&amp;rdquo; Delivered within 24 hours, plus one async clarification pass within 14 days — up to 30 minutes&amp;rsquo; worth of follow-up questions over email at &lt;code&gt;support@richgibbs.dev&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;$499 — Inbox Cleanup Enterprise (up to 10 Workspace mailboxes).&lt;/strong&gt; &lt;a href="https://buy.stripe.com/28E14peBe9Z0b85cpF5ZC04"&gt;Buy on Stripe →&lt;/a&gt; For pre-migration or pre-acquisition cleanups across a small team — typically a 2-to-10-person Google Workspace org. Your IT admin publishes our script as an Internal app under your own Cloud project (no third-party verification, no app-store entry, no shared tokens). You run the survey across up to 10 mailboxes, send the merged &lt;code&gt;survey.json&lt;/code&gt;, and we write a per-mailbox plan plus the cross-mailbox patterns (shared newsletters worth bulk-filtering across the org, etc.). 5 business day SLA, one async clarification pass within 14 days — up to 30 minutes&amp;rsquo; worth of follow-up questions, via email to &lt;code&gt;support@richgibbs.dev&lt;/code&gt;. More details on the &lt;a href="https://richgibbs.dev/quickcheck/inbox-cleanup/"&gt;Inbox Cleanup service page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;All three deliverables are async-only. Email is the only follow-up channel.&lt;/p&gt;
&lt;h2 id="related-reading"&gt;Related reading&lt;/h2&gt;
&lt;p&gt;If you came in through the deliverability rabbit-hole — receipts going to spam, password resets vanishing — the inbox problem is downstream of the &lt;em&gt;outbox&lt;/em&gt; problem, and both are fixable in one sitting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="/spf-dkim-dmarc-indie-founder-checklist/"&gt;SPF, DKIM, DMARC for indie founders: the 20-minute checklist&lt;/a&gt; — the matching DNS-side hygiene pass for your sending domain.&lt;/li&gt;
&lt;li&gt;&lt;a href="/cloudflare-email-routing-indie-founders-10-minute-setup/"&gt;Cloudflare Email Routing for indie founders: the 10-minute support@ setup&lt;/a&gt; — if you don&amp;rsquo;t even have a &lt;code&gt;support@yourdomain.com&lt;/code&gt; yet, start here before you do anything else.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="the-short-version"&gt;The short version&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The &amp;ldquo;free inbox cleaner&amp;rdquo; model is a scope-creep trap for a job that runs once.&lt;/li&gt;
&lt;li&gt;Survey-then-Delete: count first, identify the top 12 senders, filter the inbound, then bulk-delete by sender and age cohort.&lt;/li&gt;
&lt;li&gt;Read-only OAuth at survey time; on-demand &lt;code&gt;gmail.modify&lt;/code&gt; only when you&amp;rsquo;re actually deleting; tokens live on &lt;em&gt;your&lt;/em&gt; laptop, not ours.&lt;/li&gt;
&lt;li&gt;$19 if you want to run it yourself. $79 if you want me to write the cleanup plan. $499 if you need to do it across a small Workspace team without exposing tokens to a third party.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The 30-day Trash window is your safety net. So is reading the OAuth scope string before you click &amp;ldquo;Allow.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;— Rich&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;Tuck Sentinel — independent. Not affiliated with, endorsed by, or certified by Google, Yahoo, Microsoft, AWS, Cloudflare, Stripe, Tally, or any email or cloud provider.&lt;/em&gt;&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/delete-thousands-emails-gmail-without-oauth-scope-creep/</guid>
      <category>email</category>
      <category>gmail</category>
      <category>inbox-cleanup</category>
      <category>oauth</category>
      <category>privacy</category>
      <category>indie-founder</category>
      <pubDate>Mon, 11 May 2026 16:55:00 +0000</pubDate>
    </item>
    <item>
      <title>DMARC aggregate reports without a SaaS: read your own rua XML in 30 minutes</title>
      <link>https://blog.richgibbs.dev/dmarc-aggregate-reports-without-a-saas/</link>
      <description>You don't need Postmark, Valimail, or dmarcian to read DMARC aggregate reports. 120 lines of stdlib Python on the same VPS that runs your mail will tell you the truth — here is what those XML reports actually contain and how to parse them yourself.</description>
      <content:encoded>&lt;p&gt;You published a DMARC record. The &lt;code&gt;rua=mailto:&lt;/code&gt; part is pointing at a real mailbox you actually read. Reports started arriving 24 hours later. They are zipped XML files with names like &lt;code&gt;google.com!yourdomain.com!1715472000!1715558400.zip&lt;/code&gt;, you cannot read them, and every blog post you find tells you to sign up for Postmark, Valimail, dmarcian, EasyDMARC, or some other $20–$200/month SaaS to &amp;ldquo;decode&amp;rdquo; them.&lt;/p&gt;
&lt;p&gt;You don&amp;rsquo;t need any of that. The DMARC aggregate-report format is a stable, well-defined XML schema published in &lt;a href="https://datatracker.ietf.org/doc/html/rfc7489"&gt;RFC 7489&lt;/a&gt; (&lt;a href="https://datatracker.ietf.org/doc/html/rfc7489#section-7.2"&gt;§7.2&lt;/a&gt;), and a working reader takes about 120 lines of stdlib-only Python — no extra packages, no API keys, runs in cron on the same $20 VPS that already runs your mail.&lt;/p&gt;
&lt;p&gt;This post is the &lt;em&gt;reading&lt;/em&gt; half of that story. What the reports actually contain, what every field means in practice, the 20-line skeleton to walk the XML yourself, what the full reader adds beyond the skeleton, and the cron-friendly workflow that makes the data actionable. If you ever decide you want a SaaS, you will be a much better customer for it.&lt;/p&gt;
&lt;h2 id="why-dmarc-aggregate-rua-reports-exist"&gt;Why DMARC aggregate (&lt;code&gt;rua&lt;/code&gt;) reports exist&lt;/h2&gt;
&lt;p&gt;DMARC, defined in &lt;a href="https://datatracker.ietf.org/doc/html/rfc7489"&gt;RFC 7489&lt;/a&gt;, is a policy layer on top of SPF and DKIM. A receiver (Gmail, Microsoft, Yahoo, Apple, ProtonMail, …) checks each incoming message and decides three things: does SPF pass and align with the &lt;code&gt;RFC5322.From&lt;/code&gt; domain, does DKIM pass and align, and what does the domain owner&amp;rsquo;s published &lt;code&gt;_dmarc&lt;/code&gt; TXT record say to do when neither aligns.&lt;/p&gt;
&lt;p&gt;The receiver acts on every message immediately. But the &lt;em&gt;domain owner&lt;/em&gt; (you) has no idea what happened until somebody complains. Aggregate reports close that loop. From &lt;a href="https://datatracker.ietf.org/doc/html/rfc7489#section-7.2"&gt;RFC 7489 §7.2&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Aggregate reports are most useful when they all contain the same data; thus this section describes a single report format, generated daily, sent via email, encoded as XML.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In practice, every major receiver who supports DMARC sends one aggregate report per UTC day per sender domain they saw mail from, to the address (or addresses) listed in the &lt;code&gt;rua=&lt;/code&gt; tag of your &lt;code&gt;_dmarc&lt;/code&gt; record. Each report says: &lt;em&gt;&amp;ldquo;Here is every source IP that claimed to be sending as your domain in the last 24 hours, how many messages each one sent, and what we decided about each one.&amp;rdquo;&lt;/em&gt; It does &lt;strong&gt;not&lt;/strong&gt; contain message bodies, subjects, recipients, or any other PII. It is metadata only.&lt;/p&gt;
&lt;p&gt;That is what makes the format safe to receive, store, and parse on a $20 VPS. Failure reports (&lt;code&gt;ruf=&lt;/code&gt;, separate spec) sometimes carry redacted message content; aggregate reports do not. We are talking about &lt;code&gt;rua&lt;/code&gt; only.&lt;/p&gt;
&lt;h2 id="the-shape-of-a-real-aggregate-xml-report"&gt;The shape of a real aggregate XML report&lt;/h2&gt;
&lt;p&gt;A real Gmail aggregate report, opened in a text editor after &lt;code&gt;gunzip&lt;/code&gt;, looks roughly like this (one record shown; a real report typically has 5–50):&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-xml"&gt;&amp;lt;feedback&amp;gt;
  &amp;lt;report_metadata&amp;gt;
    &amp;lt;org_name&amp;gt;google.com&amp;lt;/org_name&amp;gt;
    &amp;lt;email&amp;gt;noreply-dmarc-support@google.com&amp;lt;/email&amp;gt;
    &amp;lt;report_id&amp;gt;1234567890123456789&amp;lt;/report_id&amp;gt;
    &amp;lt;date_range&amp;gt;
      &amp;lt;begin&amp;gt;1715472000&amp;lt;/begin&amp;gt;
      &amp;lt;end&amp;gt;1715558400&amp;lt;/end&amp;gt;
    &amp;lt;/date_range&amp;gt;
  &amp;lt;/report_metadata&amp;gt;
  &amp;lt;policy_published&amp;gt;
    &amp;lt;domain&amp;gt;yourdomain.com&amp;lt;/domain&amp;gt;
    &amp;lt;adkim&amp;gt;r&amp;lt;/adkim&amp;gt;
    &amp;lt;aspf&amp;gt;r&amp;lt;/aspf&amp;gt;
    &amp;lt;p&amp;gt;quarantine&amp;lt;/p&amp;gt;
    &amp;lt;sp&amp;gt;quarantine&amp;lt;/sp&amp;gt;
    &amp;lt;pct&amp;gt;100&amp;lt;/pct&amp;gt;
  &amp;lt;/policy_published&amp;gt;
  &amp;lt;record&amp;gt;
    &amp;lt;row&amp;gt;
      &amp;lt;source_ip&amp;gt;50.31.156.6&amp;lt;/source_ip&amp;gt;
      &amp;lt;count&amp;gt;42&amp;lt;/count&amp;gt;
      &amp;lt;policy_evaluated&amp;gt;
        &amp;lt;disposition&amp;gt;none&amp;lt;/disposition&amp;gt;
        &amp;lt;dkim&amp;gt;pass&amp;lt;/dkim&amp;gt;
        &amp;lt;spf&amp;gt;pass&amp;lt;/spf&amp;gt;
      &amp;lt;/policy_evaluated&amp;gt;
    &amp;lt;/row&amp;gt;
    &amp;lt;identifiers&amp;gt;
      &amp;lt;header_from&amp;gt;yourdomain.com&amp;lt;/header_from&amp;gt;
    &amp;lt;/identifiers&amp;gt;
    &amp;lt;auth_results&amp;gt;
      &amp;lt;dkim&amp;gt;
        &amp;lt;domain&amp;gt;yourdomain.com&amp;lt;/domain&amp;gt;
        &amp;lt;selector&amp;gt;pm&amp;lt;/selector&amp;gt;
        &amp;lt;result&amp;gt;pass&amp;lt;/result&amp;gt;
      &amp;lt;/dkim&amp;gt;
      &amp;lt;spf&amp;gt;
        &amp;lt;domain&amp;gt;pm-bounces.yourdomain.com&amp;lt;/domain&amp;gt;
        &amp;lt;result&amp;gt;pass&amp;lt;/result&amp;gt;
      &amp;lt;/spf&amp;gt;
    &amp;lt;/auth_results&amp;gt;
  &amp;lt;/record&amp;gt;
&amp;lt;/feedback&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Every aggregate report from any receiver has the same top-level shape: one &lt;code&gt;&amp;lt;feedback&amp;gt;&lt;/code&gt; element, one &lt;code&gt;&amp;lt;report_metadata&amp;gt;&lt;/code&gt; block, one &lt;code&gt;&amp;lt;policy_published&amp;gt;&lt;/code&gt; block, and one &lt;code&gt;&amp;lt;record&amp;gt;&lt;/code&gt; per source IP per disposition outcome. The schema is fixed by &lt;a href="https://datatracker.ietf.org/doc/html/rfc7489#appendix-C"&gt;RFC 7489 Appendix C&lt;/a&gt;; receivers don&amp;rsquo;t get to invent new fields.&lt;/p&gt;
&lt;h2 id="the-report-decoder-ring"&gt;The report decoder ring&lt;/h2&gt;
&lt;p&gt;Once you have the XML in front of you, three fields do most of the work.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;&amp;lt;source_ip&amp;gt;&lt;/code&gt;&lt;/strong&gt; is the IP address the receiver saw the message arrive from. If it is one of your sending platform&amp;rsquo;s IPs (a Postmark, Resend, Mailgun, SES, ConvertKit, Mailchimp range), that is good. If it is an IP you have never heard of &lt;em&gt;and&lt;/em&gt; the count is non-trivial &lt;em&gt;and&lt;/em&gt; alignment failed, that is either a forwarder you forgot about or somebody actively spoofing your domain. Both are worth investigating, but in 2026, the boring forwarder explanation is the answer about 95 % of the time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;&amp;lt;policy_evaluated&amp;gt;&lt;/code&gt;&lt;/strong&gt; is the receiver&amp;rsquo;s verdict on this batch of messages. Three sub-fields matter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;disposition&amp;gt;&lt;/code&gt; — what the receiver did. &lt;code&gt;none&lt;/code&gt; means delivered normally; &lt;code&gt;quarantine&lt;/code&gt; means spam-foldered; &lt;code&gt;reject&lt;/code&gt; means refused at SMTP time. This is the &lt;em&gt;applied&lt;/em&gt; outcome, after any &lt;code&gt;pct=&lt;/code&gt; ramp and local override.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;dkim&amp;gt;&lt;/code&gt; — whether DKIM passed &lt;em&gt;and aligned&lt;/em&gt; with the &lt;code&gt;RFC5322.From&lt;/code&gt; domain in &lt;code&gt;&amp;lt;identifiers&amp;gt;&amp;lt;header_from&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;spf&amp;gt;&lt;/code&gt; — same, for SPF alignment (the &lt;code&gt;MAIL FROM&lt;/code&gt; / Return-Path domain must align with &lt;code&gt;header_from&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The single most common indie-founder confusion is the difference between &lt;code&gt;&amp;lt;auth_results&amp;gt;&lt;/code&gt; (the raw SPF/DKIM verification result on whatever domains the message presented) and &lt;code&gt;&amp;lt;policy_evaluated&amp;gt;&lt;/code&gt; (whether those results &lt;em&gt;aligned&lt;/em&gt; with the visible From: domain). A message can have &lt;code&gt;&amp;lt;auth_results&amp;gt;&amp;lt;dkim&amp;gt;&amp;lt;result&amp;gt;pass&amp;lt;/result&amp;gt;&amp;lt;/dkim&amp;gt;&lt;/code&gt; and still show &lt;code&gt;&amp;lt;policy_evaluated&amp;gt;&amp;lt;dkim&amp;gt;fail&amp;lt;/dkim&amp;gt;&lt;/code&gt; — DKIM technically passed, but the signing domain was &lt;code&gt;mailgun.org&lt;/code&gt; instead of your domain, so DMARC alignment failed. That is the most common deliverability bug in this whole article. Fix it by enabling &amp;ldquo;Custom domain DKIM&amp;rdquo; on the offending provider.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;&amp;lt;header_from&amp;gt;&lt;/code&gt;&lt;/strong&gt; under &lt;code&gt;&amp;lt;identifiers&amp;gt;&lt;/code&gt; is the &lt;code&gt;RFC5322.From&lt;/code&gt; domain — what the recipient sees. If this is ever a domain other than yours (a subdomain you forgot, an old sending domain), every alignment decision in the same record is being judged against &lt;em&gt;that&lt;/em&gt; domain, not your apex.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://datatracker.ietf.org/doc/html/rfc7960"&gt;RFC 7960&lt;/a&gt; (&amp;ldquo;Interoperability Issues Between DMARC and Indirect Email Flows&amp;rdquo;) is the official, RFC-blessed description of why honest forwarders break DMARC alignment — mailing lists, forward-to-personal-inbox aliases, and any hop that rewrites headers will show &lt;code&gt;&amp;lt;policy_evaluated&amp;gt;&amp;lt;dkim&amp;gt;fail&amp;lt;/dkim&amp;gt;&lt;/code&gt; on aggregate reports while not being malicious. That is the moment to read the &lt;a href="https://datatracker.ietf.org/doc/html/rfc8617"&gt;ARC spec, RFC 8617&lt;/a&gt;, and decide whether to enable ARC on your forwarder or just stop forwarding mail you publish DMARC for.&lt;/p&gt;
&lt;h2 id="a-20-line-stdlib-only-skeleton"&gt;A 20-line stdlib-only skeleton&lt;/h2&gt;
&lt;p&gt;You can read every report on disk with nothing but the Python standard library. Here is the smallest correct skeleton that walks every record in a single XML file:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-python"&gt;import xml.etree.ElementTree as ET
from pathlib import Path

def walk(path: Path):
    tree = ET.parse(path)
    root = tree.getroot()
    org = root.findtext(&amp;quot;report_metadata/org_name&amp;quot;, default=&amp;quot;?&amp;quot;)
    dom = root.findtext(&amp;quot;policy_published/domain&amp;quot;, default=&amp;quot;?&amp;quot;)
    for rec in root.findall(&amp;quot;record&amp;quot;):
        ip    = rec.findtext(&amp;quot;row/source_ip&amp;quot;, default=&amp;quot;?&amp;quot;)
        count = int(rec.findtext(&amp;quot;row/count&amp;quot;, default=&amp;quot;0&amp;quot;))
        disp  = rec.findtext(&amp;quot;row/policy_evaluated/disposition&amp;quot;, default=&amp;quot;?&amp;quot;)
        dkim  = rec.findtext(&amp;quot;row/policy_evaluated/dkim&amp;quot;, default=&amp;quot;?&amp;quot;)
        spf   = rec.findtext(&amp;quot;row/policy_evaluated/spf&amp;quot;, default=&amp;quot;?&amp;quot;)
        hfrom = rec.findtext(&amp;quot;identifiers/header_from&amp;quot;, default=&amp;quot;?&amp;quot;)
        yield (org, dom, ip, count, disp, dkim, spf, hfrom)

for f in Path(&amp;quot;reports&amp;quot;).glob(&amp;quot;*.xml&amp;quot;):
    for r in walk(f):
        print(&amp;quot;\t&amp;quot;.join(map(str, r)))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That is the whole reading layer. Run it against a directory of un-gzipped XML reports and you have a tab-separated table you can pipe into &lt;code&gt;awk&lt;/code&gt;, &lt;code&gt;sort -k4 -n&lt;/code&gt;, or just &lt;code&gt;grep fail&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;What a &lt;em&gt;full&lt;/em&gt; reader adds on top of this 20-line skeleton — and what the paid pack ships pre-built — is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transparent &lt;code&gt;.gz&lt;/code&gt;, &lt;code&gt;.zip&lt;/code&gt;, and raw-&lt;code&gt;.xml&lt;/code&gt; handling (receivers disagree on compression; some send a &lt;code&gt;.zip&lt;/code&gt; containing an &lt;code&gt;.xml&lt;/code&gt;, some send a &lt;code&gt;.xml.gz&lt;/code&gt;, Microsoft used to email both).&lt;/li&gt;
&lt;li&gt;Grouping by source domain and by sending sub-domain, so the report says &lt;em&gt;&amp;ldquo;Postmark sent 1,420 messages on your behalf today, all aligned&amp;rdquo;&lt;/em&gt; instead of one row per IP.&lt;/li&gt;
&lt;li&gt;Disposition rollups: how many &lt;code&gt;none&lt;/code&gt; vs. &lt;code&gt;quarantine&lt;/code&gt; vs. &lt;code&gt;reject&lt;/code&gt; per sender, per day.&lt;/li&gt;
&lt;li&gt;ARC results from &lt;code&gt;&amp;lt;auth_results&amp;gt;&lt;/code&gt; (per &lt;a href="https://datatracker.ietf.org/doc/html/rfc8617"&gt;RFC 8617&lt;/a&gt;), so legit forwarders are flagged as &lt;em&gt;&amp;ldquo;ARC-rescued, ignore&amp;rdquo;&lt;/em&gt; instead of &lt;em&gt;&amp;ldquo;DKIM fail, panic.&amp;rdquo;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;A multi-day rolling view so a one-bad-day spike does not page you but a seven-day trend does.&lt;/li&gt;
&lt;li&gt;An &amp;ldquo;unknown sender&amp;rdquo; alert for any source IP that has never appeared in your historical reports and is sending more than N messages a day.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The 20-line skeleton is enough to &lt;em&gt;learn the data&lt;/em&gt;. The 120-line full reader is what you keep in cron.&lt;/p&gt;
&lt;h2 id="a-cron-friendly-daily-workflow"&gt;A cron-friendly daily workflow&lt;/h2&gt;
&lt;p&gt;Once you can read the reports, the workflow is short.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Use a dedicated mailbox.&lt;/strong&gt; Point &lt;code&gt;rua=mailto:dmarc-reports@yourdomain.com&lt;/code&gt; at an alias you do not read directly. Cloudflare Email Routing forwarding into a labeled Gmail folder works perfectly for this; so does a Postfix &lt;code&gt;.forward&lt;/code&gt; into a Maildir on the same VPS. Google&amp;rsquo;s own &lt;a href="https://support.google.com/a/answer/2466580"&gt;Workspace Admin Help DMARC guide&lt;/a&gt; recommends the same separation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fetch on a schedule.&lt;/strong&gt; Pull the new attachments out of that mailbox once an hour. IMAP, the Gmail API (under your own internal-app OAuth client), or a simple &lt;code&gt;notmuch new&lt;/code&gt; + maildir scan all work. Drop the attachments under &lt;code&gt;~/dmarc/reports/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parse and roll up.&lt;/strong&gt; Run the reader nightly. Append a row per &lt;code&gt;(date, sender_domain, source_ip, count, dkim_aligned, spf_aligned, disposition)&lt;/code&gt; to a CSV or SQLite file. This is the historical record you query when something breaks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert only on change.&lt;/strong&gt; Mail yourself when (a) a brand-new source IP appears and sends more than ~50 messages, (b) a previously-aligned sender&amp;rsquo;s DKIM-alignment rate drops below 95 % for two consecutive days, or (c) any &lt;code&gt;disposition=reject&lt;/code&gt; count goes above zero for a sender you care about.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;That is the entire pipeline. There is no dashboard, no per-domain license, no &amp;ldquo;trust score.&amp;rdquo; The data is the data.&lt;/p&gt;
&lt;h2 id="when-you-actually-do-need-a-saas"&gt;When you actually do need a SaaS&lt;/h2&gt;
&lt;p&gt;Be honest: the boring DIY pipeline above is correct for a single domain, one or two sending sub-domains, and 5–50 reports a day. The point at which a SaaS starts pulling its weight is roughly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;More than ~5 active domains, especially if a deliverability team wants a shared dashboard.&lt;/li&gt;
&lt;li&gt;High-volume marketing senders (&amp;gt;500k messages/month) where you want forensic (&lt;code&gt;ruf=&lt;/code&gt;) reports correlated with bounce categories.&lt;/li&gt;
&lt;li&gt;Anything that needs SPF/DKIM hygiene enforced across an org with 50+ employees and rotating contractors.&lt;/li&gt;
&lt;li&gt;Compliance contexts (&lt;a href="https://learn.microsoft.com/en-us/defender-office-365/anti-spam-protection-about"&gt;Microsoft anti-spam configuration docs&lt;/a&gt; are worth reading here too) where someone external wants an audit trail of the policy itself.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For one indie founder with one domain and three senders? You are the worst customer dmarcian will ever have. Read your own reports.&lt;/p&gt;
&lt;h2 id="related-downloadable-pack"&gt;Related downloadable pack&lt;/h2&gt;
&lt;p&gt;If you want the full Python reader (gzip- and zip-aware, sub-domain rollups, ARC handling, the &lt;code&gt;unknown_sender&lt;/code&gt; alert function, plus three real incident walkthroughs — marketing-tool DKIM drift, forgotten sub-domain, forwarder/ARC breakage — and a DSN decoder cheat-sheet for Gmail &lt;code&gt;5.7.26&lt;/code&gt; and Microsoft &lt;code&gt;5.7.509&lt;/code&gt; / &lt;code&gt;5.7.515&lt;/code&gt;) in one bundle, the &lt;strong&gt;&lt;a href="https://gibbs21.gumroad.com/l/dmarc-quarantine-pack"&gt;DMARC Quarantine Pack — $29 on Gumroad&lt;/a&gt;&lt;/strong&gt; has it. 14-day refund, no questions.&lt;/p&gt;
&lt;h2 id="related-posts"&gt;Related posts&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="/spf-dkim-dmarc-indie-founder-checklist/"&gt;SPF, DKIM, DMARC for indie founders: the 20-minute checklist&lt;/a&gt;&lt;/strong&gt; — the prerequisite. Publish a sane &lt;code&gt;_dmarc&lt;/code&gt; record and a &lt;code&gt;rua=&lt;/code&gt; target first; then the aggregate reports in this post will actually start arriving.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="/cloudflare-email-routing-indie-founders-10-minute-setup/"&gt;Cloudflare Email Routing for indie founders: the 10-minute support@ setup&lt;/a&gt;&lt;/strong&gt; — the cleanest way to give your &lt;code&gt;dmarc-reports@yourdomain.com&lt;/code&gt; alias a real destination without paying for a Workspace seat, and the post that explains the one forwarder hop ARC (&lt;a href="https://datatracker.ietf.org/doc/html/rfc8617"&gt;RFC 8617&lt;/a&gt;) is designed to rescue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Related downloadable pack:&lt;/strong&gt; &lt;a href="https://gibbs21.gumroad.com/l/dmarc-quarantine-pack"&gt;DMARC Quarantine Pack — $29 on Gumroad&lt;/a&gt; — the full single-file Python reader, three real-incident walkthroughs, and the DSN decoder cheat-sheet for when DMARC moves from &lt;code&gt;p=none&lt;/code&gt; to &lt;code&gt;p=quarantine&lt;/code&gt; and a specific sender starts getting bounced. 14-day refund, no questions.&lt;/li&gt;
&lt;/ul&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/dmarc-aggregate-reports-without-a-saas/</guid>
      <category>email</category>
      <category>dns</category>
      <category>dmarc</category>
      <category>deliverability</category>
      <category>indie-founder</category>
      <category>python</category>
      <category>rua</category>
      <category>aggregate-reports</category>
      <category>saas</category>
      <pubDate>Tue, 12 May 2026 14:45:00 +0000</pubDate>
    </item>
    <item>
      <title>Security audit vs penetration test: which one does an indie founder actually need?</title>
      <link>https://blog.richgibbs.dev/security-audit-vs-penetration-test-indie-founder-2026/</link>
      <description>A read-only security audit and a penetration test are not the same thing, and asking for the wrong one will either waste your money or leave the actual problem in place. Here is the boring, working distinction for a 1-5 person SaaS team in 2026.</description>
      <content:encoded>&lt;p&gt;You shipped a product. It runs on one or two VPS, maybe a Postgres, a Stripe webhook, a small Workspace tenant, and a marketing site. A prospect asks for &amp;ldquo;your latest pen test.&amp;rdquo; A compliance template asks if you&amp;rsquo;ve had a &amp;ldquo;security audit in the past 12 months.&amp;rdquo; A vendor pitches you a $9,000 engagement. A friend tells you to &amp;ldquo;just run a Nessus scan.&amp;rdquo; Three of those four things are different jobs, and one of them is barely a job at all.&lt;/p&gt;
&lt;p&gt;This guide is the plain-English version of &lt;strong&gt;security audit vs penetration test&lt;/strong&gt; for an indie founder or a 1-5 person SaaS team, in 2026, with no security staff and no compliance department behind you. It tells you what each thing actually is, what each one is for, what each one costs roughly, and how to pick. It is deliberately short on jargon and long on &amp;ldquo;use this when, use that when.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;We are also not selling you a $9,000 engagement. The whole point is that &lt;em&gt;most&lt;/em&gt; indie founders need the cheaper, read-only thing first, and only sometimes need the more expensive intrusive thing later.&lt;/p&gt;
&lt;h2 id="the-actual-definitions-from-the-standards-not-from-a-sales-deck"&gt;The actual definitions (from the standards, not from a sales deck)&lt;/h2&gt;
&lt;p&gt;The cleanest split comes from NIST Special Publication 800-115, &lt;em&gt;Technical Guide to Information Security Testing and Assessment&lt;/em&gt;, which divides &amp;ldquo;security testing and examination&amp;rdquo; into three top-level techniques:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Review techniques&lt;/strong&gt; — documentation review, log review, configuration review, network sniffing, file integrity checking. Non-intrusive. The system stays untouched. (NIST SP 800-115, §3, &amp;ldquo;Review Techniques.&amp;rdquo;) &lt;a href="https://csrc.nist.gov/pubs/sp/800/115/final"&gt;https://csrc.nist.gov/pubs/sp/800/115/final&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Target identification and analysis techniques&lt;/strong&gt; — network discovery, port and service identification, vulnerability scanning, wireless scanning. Mostly non-intrusive, can be noisy. (NIST SP 800-115, §4.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Target vulnerability validation techniques&lt;/strong&gt; — password cracking, &lt;strong&gt;penetration testing&lt;/strong&gt;, social engineering. Actively exploit the things the scanner found, to prove they are real. Intrusive by design. (NIST SP 800-115, §5.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In everyday founder language:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;security audit&lt;/strong&gt; (sometimes &amp;ldquo;security assessment,&amp;rdquo; &amp;ldquo;security review,&amp;rdquo; &amp;ldquo;configuration audit&amp;rdquo;) is mostly &lt;em&gt;review&lt;/em&gt; plus &lt;em&gt;vulnerability scanning&lt;/em&gt;. Read-only. Nobody is trying to log into your box without permission. It asks: &lt;em&gt;given the configuration you have, what would a competent attacker probably notice first?&lt;/em&gt; The deliverable is a prioritized fix list.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;penetration test&lt;/strong&gt; is &lt;em&gt;validation&lt;/em&gt;. With your written permission, a human tester (or a team) actually tries to break in — exploit a finding, chain two boring misconfigs into one bad outcome, see how far they can get from the outside without your help. The deliverable is a report of &lt;em&gt;what they actually achieved&lt;/em&gt;, with reproduction steps, plus the fix list.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OWASP&amp;rsquo;s &lt;em&gt;Web Security Testing Guide&lt;/em&gt; (WSTG) makes the same distinction in its introduction: a vulnerability assessment lists potential issues, while a penetration test attempts to exploit them and demonstrate impact. &lt;a href="https://owasp.org/www-project-web-security-testing-guide/stable/"&gt;https://owasp.org/www-project-web-security-testing-guide/stable/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;CIS Controls v8 puts penetration testing in its own control (Control 18, &amp;ldquo;Penetration Testing&amp;rdquo;), separate from the configuration/audit controls earlier in the framework — it&amp;rsquo;s deliberately the &lt;em&gt;last&lt;/em&gt; control, on the assumption you&amp;rsquo;ve already done the boring read-only hygiene of the first seventeen. &lt;a href="https://www.cisecurity.org/controls/v8"&gt;https://www.cisecurity.org/controls/v8&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That ordering is the answer to &amp;ldquo;which one do I need first,&amp;rdquo; for almost every indie founder reading this post.&lt;/p&gt;
&lt;h2 id="what-an-indie-founder-security-audit-actually-looks-like"&gt;What an indie-founder security audit actually looks like&lt;/h2&gt;
&lt;p&gt;For a 1-5 person SaaS team running on a VPS or two (or EC2 / Lightsail / Hetzner / DigitalOcean / Fly / Render), a read-only security audit usually covers, at a minimum:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SSH posture.&lt;/strong&gt; Key-only auth, no password login, no root login, sane &lt;code&gt;MaxAuthTries&lt;/code&gt;, sane &lt;code&gt;LoginGraceTime&lt;/code&gt;, your actual &lt;code&gt;authorized_keys&lt;/code&gt; files, stale keys from ex-contractors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Firewall / security-group state.&lt;/strong&gt; What ports are actually open to &lt;code&gt;0.0.0.0/0&lt;/code&gt; versus what you &lt;em&gt;think&lt;/em&gt; are open. SG rules that opened temporarily during an incident and never closed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Patch state.&lt;/strong&gt; Whether &lt;code&gt;unattended-upgrades&lt;/code&gt; (or the equivalent) is actually running, when the last reboot was, how many security updates are pending.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Instance metadata posture.&lt;/strong&gt; IMDSv2 required (on AWS), no v1 fallback, sensible hop limits. (See the EC2 hardening checklist below for the migration path.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TLS posture.&lt;/strong&gt; Cert expiry windows, weak ciphers, the Cloudflare-origin cert versus the in-host cert, SNI/origin drift.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Container exposure.&lt;/strong&gt; Whether the Docker socket is mounted into anything web-facing; whether any container runs as root with a public port bound to &lt;code&gt;0.0.0.0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Email DNS hygiene.&lt;/strong&gt; SPF/DKIM/DMARC published, aligned, and not contradicting each other for any of the senders you actually use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Backup posture.&lt;/strong&gt; Whether the off-host copy exists, whether anyone has ever tested a restore, retention windows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logging posture.&lt;/strong&gt; Whether &lt;code&gt;auditd&lt;/code&gt; / &lt;code&gt;journald&lt;/code&gt; are actually retaining anything useful, whether logs ship off-host, whether they survive a host being terminated.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OAuth and third-party access.&lt;/strong&gt; Which third-party SaaS still has live tokens against your Workspace mailbox / Drive / Calendar. The Zapier from 18 months ago, the contractor&amp;rsquo;s n8n, the AI summarizer somebody clicked through.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of that requires the auditor to exploit anything. Most of it the auditor can read off the host with a short-lived, read-only SSH key plus a few API tokens scoped to &lt;code&gt;Describe*&lt;/code&gt; actions, plus a handful of externally observable signals (DNS, TLS handshakes, HTTP headers).&lt;/p&gt;
&lt;p&gt;Cost range for an indie-scale read-only audit on one host in 2026: roughly &lt;strong&gt;$100–$500&lt;/strong&gt; for a fixed-scope one-shot deliverable (our $149 VPS/EC2 Hardening QuickCheck sits at the low end of that), and a few thousand for a thorough multi-host audit with a written report. The expensive consultancy engagements you see quoted at $5k–$15k are typically wrapping an audit &lt;em&gt;and&lt;/em&gt; a light pen test together, plus several hours of advisory time.&lt;/p&gt;
&lt;p&gt;There is also a free version of the externally observable subset — DNS records, exposed ports, TLS posture, HTTP security headers, public IMDS reachability — which you can run yourself in a few minutes; see &lt;a href="https://blog.richgibbs.dev/quickcheck-mini/"&gt;QuickCheck Mini&lt;/a&gt; for the script-only version we ship.&lt;/p&gt;
&lt;h2 id="what-a-penetration-test-actually-looks-like"&gt;What a penetration test actually looks like&lt;/h2&gt;
&lt;p&gt;A real pen test is somebody (or a small team) given a defined scope, a defined window, and written authorization, trying to compromise that scope. They might:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Phish your contractors using a lookalike domain (only if you signed off on social-engineering scope).&lt;/li&gt;
&lt;li&gt;Find an exposed &lt;code&gt;.env&lt;/code&gt; file in a misconfigured nginx alias and pivot to your database.&lt;/li&gt;
&lt;li&gt;Find an outdated dependency, write or borrow an exploit, and get a shell.&lt;/li&gt;
&lt;li&gt;Chain a low-severity SSRF in your app into AWS credential theft via IMDSv1 (this is why IMDSv2-only is non-negotiable).&lt;/li&gt;
&lt;li&gt;Try to escalate from a low-privilege application user to root on the host.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The deliverable is &lt;em&gt;what they achieved&lt;/em&gt;, with full reproduction steps, screenshots, and a prioritized remediation list. NIST SP 800-115 §5.2 frames this as the &amp;ldquo;Planning / Discovery / Attack / Reporting&amp;rdquo; four-phase pattern that almost every pen-test methodology since has adopted. CIS Controls v8 Control 18 explicitly states the goal is &amp;ldquo;to identify vulnerabilities and attack vectors that may be used to exploit enterprise systems.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;That is much more expensive than an audit, for three reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;It is human-time-intensive.&lt;/strong&gt; A meaningful external pen test of one small SaaS is one to two weeks of senior practitioner time, not an automated scan.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;It carries production risk.&lt;/strong&gt; Even a careful tester occasionally trips a fail2ban rule, fills a log volume, or knocks a small VPS over. You need monitoring and a rollback plan.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The report has to be defensible.&lt;/strong&gt; Reproduction steps, CVSS-style severity (or equivalent), screenshots, retest. The write-up alone is usually a third of the engagement.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Cost range for a small-scope external SaaS pen test in 2026: roughly &lt;strong&gt;$5,000–$15,000&lt;/strong&gt; for a one-or-two-week engagement covering a single web application plus its hosting surface. More for cloud-environment pen tests or anything involving social engineering.&lt;/p&gt;
&lt;h2 id="when-the-audit-is-the-right-answer-which-is-most-of-the-time"&gt;When the &lt;em&gt;audit&lt;/em&gt; is the right answer (which is most of the time)&lt;/h2&gt;
&lt;p&gt;For an indie founder, ask the audit first if any of these are true:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You have never had anyone look at the box other than you.&lt;/li&gt;
&lt;li&gt;A prospect asked for &amp;ldquo;evidence of security review&amp;rdquo; or your last security questionnaire response.&lt;/li&gt;
&lt;li&gt;You are about to enable a new sender (marketing tool, transactional ESP) and your DMARC is at &lt;code&gt;p=none&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;You inherited the infrastructure from a co-founder or contractor who left.&lt;/li&gt;
&lt;li&gt;You just migrated to a new VPS or cloud account and want to know what carried over wrong.&lt;/li&gt;
&lt;li&gt;A friend told you to &amp;ldquo;run a pen test&amp;rdquo; and you don&amp;rsquo;t yet have a fix list to validate against.&lt;/li&gt;
&lt;li&gt;You are pre-revenue or sub-$1M ARR and your security budget is &amp;ldquo;one weekend per quarter.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The reason is simple: a pen test against a host that has never had a configuration audit will mostly produce findings that the audit would have produced for a fraction of the cost. You pay a senior tester $200/hour to discover that SSH still allows password login, that IMDSv1 is enabled, and that DMARC is at &lt;code&gt;p=none&lt;/code&gt;. That is not a good use of either of your money.&lt;/p&gt;
&lt;p&gt;CIS Controls v8 makes the same point structurally: penetration testing is Control 18, &lt;em&gt;after&lt;/em&gt; asset inventory, secure configuration, vulnerability management, audit logging, account management, and the rest. The intended flow is audit → fix → audit again → &lt;em&gt;then&lt;/em&gt; pen test, to see what got past the audit.&lt;/p&gt;
&lt;h2 id="when-the-pen-test-is-the-right-answer"&gt;When the &lt;em&gt;pen test&lt;/em&gt; is the right answer&lt;/h2&gt;
&lt;p&gt;Ask for the pen test when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You&amp;rsquo;ve already done the audit, fixed the findings, and want to know what an attacker would still notice.&lt;/li&gt;
&lt;li&gt;A specific enterprise customer or insurer is contractually requiring an external pen test letter.&lt;/li&gt;
&lt;li&gt;You handle regulated data (PCI scope, HIPAA, certain regional privacy frameworks) where pen testing has a specific cadence requirement.&lt;/li&gt;
&lt;li&gt;Your application has meaningful auth/authz complexity (multi-tenant, SSO, role hierarchies) that a configuration audit cannot meaningfully validate without exploiting it.&lt;/li&gt;
&lt;li&gt;You&amp;rsquo;re raising and a sophisticated investor&amp;rsquo;s diligence shop is going to run their own tester against you anyway, and you&amp;rsquo;d rather be the one who knows the findings first.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If none of those apply, the honest answer most indie-scale auditors will give you — and the one this site gives — is &lt;em&gt;do the audit first.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="a-quick-decision-table"&gt;A quick decision table&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Probably the right ask&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;ldquo;I&amp;rsquo;ve never had anyone look at the box.&amp;rdquo;&lt;/td&gt;
&lt;td&gt;Read-only audit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;ldquo;Prospect wants &amp;lsquo;evidence of security review.&amp;rsquo;&amp;rdquo;&lt;/td&gt;
&lt;td&gt;Read-only audit + letter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;ldquo;We&amp;rsquo;re about to go from $19/mo Postgres to $999/mo customer.&amp;rdquo;&lt;/td&gt;
&lt;td&gt;Audit now, pen test in 6 mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;ldquo;Compliance says pen test annually.&amp;rdquo;&lt;/td&gt;
&lt;td&gt;Pen test (after audit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;ldquo;We just migrated cloud accounts.&amp;rdquo;&lt;/td&gt;
&lt;td&gt;Read-only audit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;ldquo;Auditor already gave us a fix list six months ago.&amp;rdquo;&lt;/td&gt;
&lt;td&gt;Pen test, against the fix list&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;ldquo;Insurer is requiring a pen-test letter for renewal.&amp;rdquo;&lt;/td&gt;
&lt;td&gt;Pen test (scoped to their ask)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;ldquo;We don&amp;rsquo;t even know what we run.&amp;rdquo;&lt;/td&gt;
&lt;td&gt;Asset inventory, then audit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="what-to-deliberately-ignore-in-v1"&gt;What to deliberately ignore in v1&lt;/h2&gt;
&lt;p&gt;You do not need, on day one:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A SOC 2 Type II report. That is a control-attestation engagement, not a security test. Different deliverable, different price tier, different timeline.&lt;/li&gt;
&lt;li&gt;A red-team engagement. That is pen testing with social engineering and stealth requirements bolted on. Massively more expensive, only useful once the audit and pen-test layers are mature.&lt;/li&gt;
&lt;li&gt;A bug-bounty program. Useful for a public-facing app that is &lt;em&gt;already&lt;/em&gt; hardened. A bug bounty on day one mostly buys you reports about missing security headers from people farming for $50.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You also do not need a dedicated SIEM, a managed-detection vendor, or a third-party &amp;ldquo;compliance platform&amp;rdquo; subscription until you have actual customers asking for one of those things by name.&lt;/p&gt;
&lt;h2 id="common-indie-founder-gotchas"&gt;Common indie-founder gotchas&lt;/h2&gt;
&lt;p&gt;These are the failure modes that show up most often when somebody asks &amp;ldquo;audit or pen test?&amp;rdquo; for a 1-5 person team:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Asking for a pen test when the real ask is an audit.&lt;/strong&gt; The deliverable comes back full of &amp;ldquo;informational&amp;rdquo; findings about TLS posture and SSH config, and the founder concludes &amp;ldquo;pen tests are useless.&amp;rdquo; They aren&amp;rsquo;t — that was an audit wearing a pen-test invoice.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Asking for an audit when the real ask is a SOC 2 readiness assessment.&lt;/strong&gt; Different deliverable. An audit will not get you a SOC 2 report.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Letting a vendor define &amp;ldquo;audit&amp;rdquo; to mean &amp;ldquo;we ran a Nessus scan against your IP.&amp;rdquo;&lt;/strong&gt; A scan is one input to an audit, not the audit itself. Without the configuration review and the human prioritization, you&amp;rsquo;re getting a 200-page PDF of CVE noise.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pen testing without an asset inventory.&lt;/strong&gt; The tester scopes &amp;ldquo;your production environment&amp;rdquo; and you forget the marketing site on Fly, the staging box on Hetzner, and the contractor&amp;rsquo;s old IP. The actual attack surface goes untested.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Treating an audit letter as a deliverability or compliance shortcut.&lt;/strong&gt; A clean audit letter says you passed an audit. It does not say your DMARC is green or that you are PCI in scope.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If any of those sound like a problem you&amp;rsquo;ve already had once and would rather not have again, that is exactly the kind of clarity a short, read-only, fixed-scope audit is for.&lt;/p&gt;
&lt;h2 id="further-reading-on-this-site"&gt;Further reading on this site&lt;/h2&gt;
&lt;p&gt;Two pieces in the same cluster, both prerequisites to whichever direction you choose:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="/indie-founder-vps-security-101/"&gt;Indie founder VPS security 101 →&lt;/a&gt; — the one-VPS baseline that every audit will assume you&amp;rsquo;ve already done. If you haven&amp;rsquo;t, do this first; it will save you most of the audit findings before anyone gets to your box.&lt;/li&gt;
&lt;li&gt;&lt;a href="/ubuntu-debian-ec2-hardening-checklist-2026/"&gt;Ubuntu / Debian EC2 hardening checklist (2026) →&lt;/a&gt; — the longer, EC2-flavored version of the same baseline. Security groups, IMDSv2, patch cadence, the things any external auditor will ask you about first.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h3 id="related-read-only-audit-paid"&gt;Related read-only audit (paid)&lt;/h3&gt;
&lt;p&gt;If you&amp;rsquo;ve done the baseline above and would like a written, prioritized fix list for one VPS or EC2 instance — SSH posture, firewall/SG rules, IMDSv2, patch state, TLS, Docker exposure, logging, backups — that is exactly the &lt;a href="https://richgibbs.dev/quickcheck/"&gt;VPS/EC2 Hardening QuickCheck&lt;/a&gt; we offer. $149, one host, read-only, no agents installed, 24-hour turnaround. No managed retainers, no exploitation, no surprise add-ons. Multi-host setups upgrade to the $249 tier; everything stays read-only.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;d rather DIY the externally observable subset first, the free &lt;a href="https://blog.richgibbs.dev/quickcheck-mini/"&gt;QuickCheck Mini&lt;/a&gt; script runs the DNS / ports / TLS / headers / public-IMDS checks against your own host in about a minute.&lt;/p&gt;
&lt;p&gt;14-day refund on the paid QuickCheck, no questions asked.&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/security-audit-vs-penetration-test-indie-founder-2026/</guid>
      <category>security</category>
      <category>audit</category>
      <category>penetration-testing</category>
      <category>indie-founder</category>
      <category>saas</category>
      <category>nist</category>
      <category>vps</category>
      <category>aws</category>
      <pubDate>Thu, 14 May 2026 14:30:00 +0000</pubDate>
    </item>
    <item>
      <title>Encrypting Your EBS Root Volume Without Rebuilding the Server (AWS 2026)</title>
      <link>https://blog.richgibbs.dev/encrypting-ebs-root-volume-without-rebuilding/</link>
      <description>A practical, indie-founder guide to migrating an unencrypted EC2 root volume to KMS-encrypted EBS — without rebuilding the instance, losing data, or fighting AZ mismatch and root device name traps.</description>
      <content:encoded>&lt;p&gt;You checked your EC2 console, opened the &lt;strong&gt;Volumes&lt;/strong&gt; view, and noticed it: the &lt;code&gt;Encrypted&lt;/code&gt; column on your root volume says &lt;strong&gt;No&lt;/strong&gt;. You probably launched that instance from a community AMI a year or two ago, before AWS started defaulting to encrypted EBS in most regions. Everything since then — your OS, your app code, your customer data, your secrets — has been sitting on an unencrypted block device. Snapshots inherit that state, AMIs you shared with another account inherited it, and any future restore continues to inherit it until you do something about it.&lt;/p&gt;
&lt;p&gt;This guide walks through how to fix that without rebuilding the box from scratch. The path is well-known to anyone who has done it before, and full of small AWS-specific traps if you haven&amp;rsquo;t. The summary: &lt;strong&gt;you cannot encrypt an existing EBS volume in place. You snapshot it, copy the snapshot with encryption enabled, create a new volume from the copy, and swap the root.&lt;/strong&gt; That&amp;rsquo;s it — but every step has at least one way to get wrong on the first try.&lt;/p&gt;
&lt;p&gt;It pairs naturally with the &lt;a href="/ubuntu-debian-ec2-hardening-checklist-2026/"&gt;Ubuntu/Debian EC2 hardening checklist&lt;/a&gt; and the &lt;a href="/aws-imdsv2-migration-without-breaking-things/"&gt;AWS IMDSv2 migration guide&lt;/a&gt; — same broader topic: tightening cloud posture on workloads that were launched before you cared about any of this.&lt;/p&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;The standard pushback on &amp;ldquo;we should encrypt EBS&amp;rdquo; is: &lt;em&gt;the disk never leaves AWS&amp;rsquo;s data center, so what are we protecting against?&lt;/em&gt; That argument misses where the risk actually lives.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snapshots and AMIs are portable.&lt;/strong&gt; An unencrypted snapshot can be shared to another AWS account or made public with a single API call. An encrypted snapshot can&amp;rsquo;t — sharing requires KMS grants. The encryption flag is a hard guard against accidental cross-account data leakage, including the classic &amp;ldquo;I made this AMI public to debug something&amp;rdquo; mistake.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;KMS gives you a second access-control layer.&lt;/strong&gt; With unencrypted EBS, anyone with &lt;code&gt;ec2:CreateVolume&lt;/code&gt; and &lt;code&gt;ec2:AttachVolume&lt;/code&gt; on the snapshot can mount it on a new instance and read everything. With KMS, they also need &lt;code&gt;kms:Decrypt&lt;/code&gt; and (often) &lt;code&gt;kms:CreateGrant&lt;/code&gt; on the key. That separation is the single biggest practical reason to encrypt — it forces a deliberate second permission for &amp;ldquo;I want to actually read this data&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compliance, even the informal kind.&lt;/strong&gt; SOC 2, ISO 27001, PCI, HIPAA, most state privacy laws, and most enterprise customer security questionnaires ask whether data at rest is encrypted. The honest answer for an unencrypted EBS volume is &amp;ldquo;no&amp;rdquo;. Encrypted-with-KMS gets you a clean &amp;ldquo;yes&amp;rdquo; with zero application changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Future-proofing.&lt;/strong&gt; AWS now enables &amp;ldquo;Always encrypt new EBS volumes&amp;rdquo; by default in most regions for new accounts. Older accounts keep their old default. New volumes you create going forward will be encrypted; the old one keeps being the outlier. Migrating now eliminates the special case before it becomes the only special case left.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of this changes how your app behaves. EBS encryption is transparent. The OS doesn&amp;rsquo;t know. The application doesn&amp;rsquo;t know. There is no measurable performance hit on current-generation instance types. The reason most people haven&amp;rsquo;t done it is that the migration is fiddly, not that the destination is.&lt;/p&gt;
&lt;h2 id="what-enable-default-encryption-does-and-doesnt-fix"&gt;What &amp;ldquo;enable default encryption&amp;rdquo; does and doesn&amp;rsquo;t fix&lt;/h2&gt;
&lt;p&gt;In EC2 → &lt;strong&gt;Account attributes&lt;/strong&gt; → &lt;strong&gt;EBS encryption&lt;/strong&gt;, there&amp;rsquo;s a setting called &lt;strong&gt;Always encrypt new EBS volumes&lt;/strong&gt;. Turning it on is good and you should do it. But understand what it does:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;✅ New volumes you create from scratch are encrypted.&lt;/li&gt;
&lt;li&gt;✅ New snapshots you take of &lt;em&gt;already-encrypted&lt;/em&gt; volumes stay encrypted.&lt;/li&gt;
&lt;li&gt;✅ Volumes restored from encrypted snapshots stay encrypted.&lt;/li&gt;
&lt;li&gt;❌ Existing unencrypted volumes are &lt;strong&gt;not&lt;/strong&gt; retroactively encrypted.&lt;/li&gt;
&lt;li&gt;❌ Snapshots of existing unencrypted volumes are &lt;strong&gt;not&lt;/strong&gt; encrypted by default — they inherit the source&amp;rsquo;s state.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That last point trips people up. Once &amp;ldquo;default encryption&amp;rdquo; is on, you might assume taking a fresh snapshot of an old volume gives you an encrypted snapshot. It does not. Snapshots match the source volume&amp;rsquo;s encryption state. You have to make the encrypted copy explicit, which is exactly what the migration path below does.&lt;/p&gt;
&lt;p&gt;So &lt;strong&gt;step zero&lt;/strong&gt; is: flip on default encryption for the region (EC2 → Account attributes → EBS encryption → Always encrypt new EBS volumes), pick a default KMS key (&lt;code&gt;aws/ebs&lt;/code&gt; is fine to start, or a CMK you control), and then deal with the old volume.&lt;/p&gt;
&lt;h2 id="the-two-migration-paths"&gt;The two migration paths&lt;/h2&gt;
&lt;p&gt;You have two viable options for the root volume itself. Pick based on how much downtime you can take and whether you want to swap instance types or AMIs at the same time.&lt;/p&gt;
&lt;h3 id="path-a-swap-the-root-volume-on-the-same-instance-downtime-path"&gt;Path A — Swap the root volume on the same instance (downtime path)&lt;/h3&gt;
&lt;p&gt;Same instance ID, same Elastic IP, same security groups, same instance profile, same launch template. You take a planned outage (typically 5-15 minutes), encrypt the volume, swap it back in, boot, verify.&lt;/p&gt;
&lt;p&gt;Use Path A when:
- The instance has external state (Elastic IP attached directly, hard-coded references to the instance ID, attached IAM role, manually-created security group memberships) you don&amp;rsquo;t want to redo.
- You&amp;rsquo;re fine with a single short maintenance window.
- You haven&amp;rsquo;t already planned an OS or AMI refresh.&lt;/p&gt;
&lt;h3 id="path-b-build-a-new-instance-from-an-encrypted-ami-rebuild-path"&gt;Path B — Build a new instance from an encrypted AMI (rebuild path)&lt;/h3&gt;
&lt;p&gt;Snapshot → copy with encryption → create AMI → launch a fresh instance from the encrypted AMI → swap the Elastic IP (or DNS) → terminate the old box.&lt;/p&gt;
&lt;p&gt;Use Path B when:
- You also want to refresh the OS or instance type.
- You can run two instances side-by-side briefly and cut over with a CNAME or EIP move.
- You&amp;rsquo;d rather not stop a production box, even briefly.
- The instance has accumulated hand-applied config you&amp;rsquo;d like to leave behind anyway.&lt;/p&gt;
&lt;p&gt;Path B is more cleanup work but lower risk because the old volume stays untouched until you&amp;rsquo;re confident the new instance is healthy. Path A is faster and surgical.&lt;/p&gt;
&lt;p&gt;The rest of this guide focuses on Path A, because that&amp;rsquo;s what most solo operators on a single-box deployment actually want. The same KMS copy step is the core of Path B; the difference is what you do with the resulting AMI afterwards.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mid-article CTA:&lt;/strong&gt; Want a read-only audit that tells you which EBS volumes in your account are still unencrypted — plus open security groups, IMDSv1 stragglers, exposed IAM users, and a few other &amp;ldquo;you&amp;rsquo;d rather know&amp;rdquo; items? That&amp;rsquo;s exactly what &lt;a href="https://richgibbs.dev/quickcheck/"&gt;QuickCheck&lt;/a&gt; is built for. One run, plain-English report, no install on your account.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="path-a-step-by-step"&gt;Path A: Step-by-step&lt;/h2&gt;
&lt;h3 id="1-inventory-and-prepare"&gt;1. Inventory and prepare&lt;/h3&gt;
&lt;p&gt;First, capture the things you&amp;rsquo;ll need to put back.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;INSTANCE_ID=i-0123456789abcdef0
REGION=us-east-1

aws ec2 describe-instances \
  --instance-ids &amp;quot;$INSTANCE_ID&amp;quot; \
  --region &amp;quot;$REGION&amp;quot; \
  --query 'Reservations[0].Instances[0].{
    AZ:Placement.AvailabilityZone,
    Root:RootDeviceName,
    RootVol:BlockDeviceMappings[?DeviceName==`/dev/xvda` || DeviceName==`/dev/sda1`].Ebs.VolumeId | [0],
    Type:InstanceType,
    AMI:ImageId
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Write down &lt;code&gt;AZ&lt;/code&gt;, &lt;code&gt;Root&lt;/code&gt; (will be &lt;code&gt;/dev/xvda&lt;/code&gt; or &lt;code&gt;/dev/sda1&lt;/code&gt; — the difference matters in step 5), and &lt;code&gt;RootVol&lt;/code&gt; (the source volume ID).&lt;/p&gt;
&lt;p&gt;Pick the KMS key you&amp;rsquo;ll encrypt with. The default AWS-managed &lt;code&gt;alias/aws/ebs&lt;/code&gt; is fine for most setups; use a customer-managed key (CMK) if you want explicit grant control or cross-account isolation later.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;KMS_KEY_ID=&amp;quot;alias/aws/ebs&amp;quot;   # or arn:aws:kms:us-east-1:&amp;lt;acct&amp;gt;:key/&amp;lt;uuid&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Confirm your caller can actually use that key for encrypt-decrypt — the migration fails late if it can&amp;rsquo;t:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws kms describe-key --key-id &amp;quot;$KMS_KEY_ID&amp;quot; --region &amp;quot;$REGION&amp;quot; \
  --query 'KeyMetadata.{KeyState:KeyState,KeyUsage:KeyUsage,Arn:Arn}'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;KeyState&lt;/code&gt; must be &lt;code&gt;Enabled&lt;/code&gt; and &lt;code&gt;KeyUsage&lt;/code&gt; must be &lt;code&gt;ENCRYPT_DECRYPT&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id="2-stop-the-instance"&gt;2. Stop the instance&lt;/h3&gt;
&lt;p&gt;EBS root volumes can only be detached when the instance is stopped. There is no online path for the root. Plan a maintenance window now.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 stop-instances --instance-ids &amp;quot;$INSTANCE_ID&amp;quot; --region &amp;quot;$REGION&amp;quot;
aws ec2 wait instance-stopped --instance-ids &amp;quot;$INSTANCE_ID&amp;quot; --region &amp;quot;$REGION&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is the start of your downtime clock. From here, the only thing protecting you is that the original volume still exists and is unmodified.&lt;/p&gt;
&lt;h3 id="3-snapshot-the-unencrypted-volume"&gt;3. Snapshot the unencrypted volume&lt;/h3&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;SRC_VOL=vol-0aaaabbbbccccdddd0   # from step 1

SNAP_ID=$(aws ec2 create-snapshot \
  --volume-id &amp;quot;$SRC_VOL&amp;quot; \
  --description &amp;quot;pre-encryption snapshot of $SRC_VOL&amp;quot; \
  --region &amp;quot;$REGION&amp;quot; \
  --query 'SnapshotId' --output text)

aws ec2 wait snapshot-completed --snapshot-ids &amp;quot;$SNAP_ID&amp;quot; --region &amp;quot;$REGION&amp;quot;
echo &amp;quot;Source snapshot: $SNAP_ID&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This snapshot is &lt;strong&gt;unencrypted&lt;/strong&gt; (inherits source state). It&amp;rsquo;s also your rollback insurance for the rest of the migration — do not delete it until you&amp;rsquo;re done and verified.&lt;/p&gt;
&lt;h3 id="4-copy-the-snapshot-with-kms-encryption"&gt;4. Copy the snapshot with KMS encryption&lt;/h3&gt;
&lt;p&gt;This is the only step in the whole process where encryption actually happens. It&amp;rsquo;s a same-region copy with &lt;code&gt;--encrypted&lt;/code&gt; and a key ID.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;ENC_SNAP_ID=$(aws ec2 copy-snapshot \
  --source-region &amp;quot;$REGION&amp;quot; \
  --region &amp;quot;$REGION&amp;quot; \
  --source-snapshot-id &amp;quot;$SNAP_ID&amp;quot; \
  --description &amp;quot;encrypted copy of $SNAP_ID&amp;quot; \
  --encrypted \
  --kms-key-id &amp;quot;$KMS_KEY_ID&amp;quot; \
  --query 'SnapshotId' --output text)

aws ec2 wait snapshot-completed --snapshot-ids &amp;quot;$ENC_SNAP_ID&amp;quot; --region &amp;quot;$REGION&amp;quot;
echo &amp;quot;Encrypted snapshot: $ENC_SNAP_ID&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Confirm it&amp;rsquo;s actually encrypted:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 describe-snapshots --snapshot-ids &amp;quot;$ENC_SNAP_ID&amp;quot; --region &amp;quot;$REGION&amp;quot; \
  --query 'Snapshots[0].{Encrypted:Encrypted,KmsKeyId:KmsKeyId}'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You want &lt;code&gt;"Encrypted": true&lt;/code&gt; and a &lt;code&gt;KmsKeyId&lt;/code&gt; ARN that matches the key you used.&lt;/p&gt;
&lt;h3 id="5-create-the-new-encrypted-volume-in-the-right-az"&gt;5. Create the new encrypted volume in the right AZ&lt;/h3&gt;
&lt;p&gt;This is the single most common place to lose 20 minutes. &lt;strong&gt;The new volume must be in the same Availability Zone as the instance.&lt;/strong&gt; EBS volumes are AZ-scoped; you can&amp;rsquo;t attach a &lt;code&gt;us-east-1a&lt;/code&gt; volume to a &lt;code&gt;us-east-1b&lt;/code&gt; instance.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;AZ=us-east-1b   # from step 1

NEW_VOL=$(aws ec2 create-volume \
  --snapshot-id &amp;quot;$ENC_SNAP_ID&amp;quot; \
  --availability-zone &amp;quot;$AZ&amp;quot; \
  --volume-type gp3 \
  --encrypted \
  --kms-key-id &amp;quot;$KMS_KEY_ID&amp;quot; \
  --region &amp;quot;$REGION&amp;quot; \
  --query 'VolumeId' --output text)

aws ec2 wait volume-available --volume-ids &amp;quot;$NEW_VOL&amp;quot; --region &amp;quot;$REGION&amp;quot;
echo &amp;quot;New encrypted root: $NEW_VOL&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notes:
- &lt;code&gt;gp3&lt;/code&gt; is the modern default. If your old volume was &lt;code&gt;gp2&lt;/code&gt; or &lt;code&gt;io1&lt;/code&gt;, this is a fine moment to upgrade. &lt;code&gt;gp3&lt;/code&gt; is cheaper than &lt;code&gt;gp2&lt;/code&gt; at equivalent performance.
- If you need a specific size (larger than the snapshot), add &lt;code&gt;--size N&lt;/code&gt; in GiB. You can grow but not shrink.&lt;/p&gt;
&lt;h3 id="6-detach-the-old-root-attach-the-new-one"&gt;6. Detach the old root, attach the new one&lt;/h3&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 detach-volume --volume-id &amp;quot;$SRC_VOL&amp;quot; --region &amp;quot;$REGION&amp;quot;
aws ec2 wait volume-available --volume-ids &amp;quot;$SRC_VOL&amp;quot; --region &amp;quot;$REGION&amp;quot;

# Use the ROOT device name from step 1 — /dev/xvda for Nitro Ubuntu/Amazon Linux 2+
# Use /dev/sda1 for older Xen-virt instance types
ROOT_DEV=/dev/xvda

aws ec2 attach-volume \
  --instance-id &amp;quot;$INSTANCE_ID&amp;quot; \
  --volume-id &amp;quot;$NEW_VOL&amp;quot; \
  --device &amp;quot;$ROOT_DEV&amp;quot; \
  --region &amp;quot;$REGION&amp;quot;

aws ec2 wait volume-in-use --volume-ids &amp;quot;$NEW_VOL&amp;quot; --region &amp;quot;$REGION&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Get the root device name wrong and the instance will refuse to boot — it can&amp;rsquo;t find the kernel because EC2 looks at &lt;code&gt;RootDeviceName&lt;/code&gt; to know what to chain-load. Check &lt;code&gt;Root&lt;/code&gt; from step 1, not your assumptions.&lt;/p&gt;
&lt;h3 id="7-start-the-instance-and-verify"&gt;7. Start the instance and verify&lt;/h3&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 start-instances --instance-ids &amp;quot;$INSTANCE_ID&amp;quot; --region &amp;quot;$REGION&amp;quot;
aws ec2 wait instance-running --instance-ids &amp;quot;$INSTANCE_ID&amp;quot; --region &amp;quot;$REGION&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;SSH in and confirm:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;lsblk -f
# nvme0n1p1 (or xvda1) should be your root /, mounted, ext4/xfs, same UUID as before

mount | grep ' on / '
# Should show your root device

aws ec2 describe-volumes --volume-ids &amp;quot;$NEW_VOL&amp;quot; \
  --query 'Volumes[0].{Encrypted:Encrypted,KmsKeyId:KmsKeyId}'
# Encrypted: true, KmsKeyId: matches your key
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run your app&amp;rsquo;s health check. If it had systemd services, &lt;code&gt;systemctl --failed&lt;/code&gt; should be empty. Check that any external monitoring is green.&lt;/p&gt;
&lt;h3 id="8-clean-up"&gt;8. Clean up&lt;/h3&gt;
&lt;p&gt;Once you&amp;rsquo;re confident — wait a day or two on stateful boxes — delete the old unencrypted volume and the unencrypted snapshot.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;aws ec2 delete-volume   --volume-id   &amp;quot;$SRC_VOL&amp;quot;   --region &amp;quot;$REGION&amp;quot;
aws ec2 delete-snapshot --snapshot-id &amp;quot;$SNAP_ID&amp;quot;   --region &amp;quot;$REGION&amp;quot;
# Keep $ENC_SNAP_ID — it's your encrypted baseline going forward.
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="pitfalls-to-know-up-front"&gt;Pitfalls to know up front&lt;/h2&gt;
&lt;p&gt;A few traps that have caused real outages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;AZ mismatch on create-volume.&lt;/strong&gt; Already covered, worth repeating. The error message reads &lt;code&gt;InvalidParameterValue: The Availability Zone is not the same&lt;/code&gt;. Recreate the volume in the right AZ; cost is just the create/delete time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root device name &lt;code&gt;/dev/xvda&lt;/code&gt; vs &lt;code&gt;/dev/sda1&lt;/code&gt; vs &lt;code&gt;/dev/nvme0n1&lt;/code&gt;.&lt;/strong&gt; The &lt;em&gt;AWS API&lt;/em&gt; root device name is &lt;code&gt;/dev/xvda&lt;/code&gt; or &lt;code&gt;/dev/sda1&lt;/code&gt;. The kernel may surface the volume as &lt;code&gt;/dev/nvme0n1&lt;/code&gt;. Use the &lt;strong&gt;API name&lt;/strong&gt; for &lt;code&gt;attach-volume --device&lt;/code&gt;; the kernel name is irrelevant at attach time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;KMS permission gaps.&lt;/strong&gt; If you&amp;rsquo;re using a CMK in a different account, or restricting your IAM role tightly, you need &lt;code&gt;kms:Decrypt&lt;/code&gt;, &lt;code&gt;kms:GenerateDataKeyWithoutPlaintext&lt;/code&gt;, &lt;code&gt;kms:ReEncrypt*&lt;/code&gt;, &lt;code&gt;kms:CreateGrant&lt;/code&gt;, and &lt;code&gt;kms:DescribeKey&lt;/code&gt; somewhere in the chain. The error is &lt;code&gt;OptInRequired&lt;/code&gt; or &lt;code&gt;InvalidKmsKey&lt;/code&gt;. Don&amp;rsquo;t grant &lt;code&gt;kms:*&lt;/code&gt; on the key to make it go away — grant exactly those five.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Forgetting the instance profile.&lt;/strong&gt; Detaching and reattaching volumes does not touch the IAM role on the instance. But if you were using IMDSv1 and your migration coincides with an AMI change, double-check the role survived. Pair this work with the &lt;a href="/aws-imdsv2-migration-without-breaking-things/"&gt;IMDSv2 migration&lt;/a&gt; and you only do the maintenance window once.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Application state outside the root volume.&lt;/strong&gt; If your app keeps data on a secondary EBS volume, &lt;strong&gt;encrypt that one too&lt;/strong&gt; with the same process. Root encryption alone leaves your real data unencrypted on disk, which is the worst possible posture: you get the operational cost of the migration with none of the actual data protection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Backups inherit, but old backups don&amp;rsquo;t migrate.&lt;/strong&gt; Once the volume is encrypted, AWS Backup / DLM snapshots inherit the encryption. Old snapshots from before today are still unencrypted. Either re-snapshot via the same copy-with-encryption trick or expire them through retention policy.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="rollback-plan"&gt;Rollback plan&lt;/h2&gt;
&lt;p&gt;You should write this down before you start, not figure it out at 11 p.m.&lt;/p&gt;
&lt;p&gt;If the new volume won&amp;rsquo;t boot:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;# 1. Stop the instance again
aws ec2 stop-instances --instance-ids &amp;quot;$INSTANCE_ID&amp;quot; --region &amp;quot;$REGION&amp;quot;
aws ec2 wait instance-stopped --instance-ids &amp;quot;$INSTANCE_ID&amp;quot; --region &amp;quot;$REGION&amp;quot;

# 2. Detach the new (broken) volume
aws ec2 detach-volume --volume-id &amp;quot;$NEW_VOL&amp;quot; --region &amp;quot;$REGION&amp;quot;
aws ec2 wait volume-available --volume-ids &amp;quot;$NEW_VOL&amp;quot; --region &amp;quot;$REGION&amp;quot;

# 3. Reattach the original unencrypted volume as the root
aws ec2 attach-volume \
  --instance-id &amp;quot;$INSTANCE_ID&amp;quot; \
  --volume-id &amp;quot;$SRC_VOL&amp;quot; \
  --device &amp;quot;$ROOT_DEV&amp;quot; \
  --region &amp;quot;$REGION&amp;quot;

# 4. Start the instance
aws ec2 start-instances --instance-ids &amp;quot;$INSTANCE_ID&amp;quot; --region &amp;quot;$REGION&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You are now back where you started, with the encrypted copy intact for forensic investigation of why the boot failed. Typical causes: wrong root device name, the volume was created in the wrong AZ and silently attached as a non-root device, or the source volume had a corrupt boot record that the snapshot faithfully preserved.&lt;/p&gt;
&lt;p&gt;Do not delete &lt;code&gt;$SRC_VOL&lt;/code&gt; or &lt;code&gt;$SNAP_ID&lt;/code&gt; until rollback is no longer a concern — typically 24-72 hours of successful operation.&lt;/p&gt;
&lt;h2 id="what-this-is-not"&gt;What this is not&lt;/h2&gt;
&lt;p&gt;To set expectations clearly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This is &lt;strong&gt;not full-disk encryption against a local-machine threat model&lt;/strong&gt;. KMS-encrypted EBS protects data at rest as managed by AWS. It does not stop a process running on the live instance from reading its own filesystem. For that, you need LUKS or filesystem-level encryption on top, with key management of your own — a separate project.&lt;/li&gt;
&lt;li&gt;This is &lt;strong&gt;not a compliance attestation&lt;/strong&gt;. Migrating to encrypted EBS is a posture improvement, not a certification. Your SOC 2 / ISO / HIPAA auditor will still want to see your key management policy, KMS key rotation status, and the inventory query that proves &lt;em&gt;all&lt;/em&gt; volumes are encrypted, not just the one you remembered.&lt;/li&gt;
&lt;li&gt;This is &lt;strong&gt;not a guarantee&lt;/strong&gt;. EBS encryption closes a specific cross-account data leakage path and gives you a KMS-grant access boundary. It does not address misconfigured security groups, leaked long-lived IAM keys, S3 buckets without encryption, or application vulnerabilities. Treat it as one item on the list.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Encrypt EBS because it is cheap, well-understood, and removes a real foot-gun in your snapshot and sharing surface. Then keep going on the rest of the list.&lt;/p&gt;
&lt;h2 id="quickcheck-cta"&gt;QuickCheck CTA&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;d rather not write the inventory queries to find every unencrypted volume across every account and region — and then chase down the missing KMS grants, the public AMIs, and the other quiet posture issues that pile up over time — &lt;strong&gt;&lt;a href="https://richgibbs.dev/quickcheck/"&gt;QuickCheck&lt;/a&gt;&lt;/strong&gt; runs a read-only, one-shot review of your AWS posture and produces a plain-English report. Unencrypted EBS volumes are one of the dozen items it surfaces, alongside open security groups, IMDSv1 stragglers, missing MFA on root, untagged keys, and a few other &amp;ldquo;you&amp;rsquo;d rather know&amp;rdquo; things. See an example in the &lt;a href="https://richgibbs.dev/quickcheck/sample-report/"&gt;sample report&lt;/a&gt;. Not magic, not a replacement for proper cloud security tooling, but a fast way to know where you stand.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="about-tuck-sentinel"&gt;About Tuck Sentinel&lt;/h2&gt;
&lt;p&gt;Tuck Sentinel is the security-focused side of an indie operator workshop by Rich Gibbs. It builds small, sharp tools — like QuickCheck — for founders and small teams who want a competent read of their cloud posture without an enterprise platform. The bias: fast, honest, read-only assessments and migrations you can actually finish.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-json"&gt;{
  &amp;quot;@context&amp;quot;: &amp;quot;https://schema.org&amp;quot;,
  &amp;quot;@type&amp;quot;: &amp;quot;Article&amp;quot;,
  &amp;quot;headline&amp;quot;: &amp;quot;Encrypting Your EBS Root Volume Without Rebuilding the Server (AWS 2026)&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;A practical, indie-founder guide to migrating an unencrypted EC2 root volume to KMS-encrypted EBS — without rebuilding the instance, losing data, or fighting AZ mismatch and root device name traps.&amp;quot;,
  &amp;quot;author&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;Organization&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;Tuck Sentinel&amp;quot;
  },
  &amp;quot;publisher&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;Organization&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;Tuck Sentinel&amp;quot;,
    &amp;quot;url&amp;quot;: &amp;quot;https://richgibbs.dev/&amp;quot;
  },
  &amp;quot;mainEntityOfPage&amp;quot;: {
    &amp;quot;@type&amp;quot;: &amp;quot;WebPage&amp;quot;,
    &amp;quot;@id&amp;quot;: &amp;quot;https://blog.richgibbs.dev/encrypting-ebs-root-volume-without-rebuilding/&amp;quot;
  },
  &amp;quot;image&amp;quot;: &amp;quot;https://blog.richgibbs.dev/static/og-default.png&amp;quot;,
  &amp;quot;articleSection&amp;quot;: &amp;quot;Cloud Security&amp;quot;,
  &amp;quot;keywords&amp;quot;: &amp;quot;AWS, EC2, EBS, KMS, encryption, cloud security, snapshot, migration&amp;quot;,
  &amp;quot;about&amp;quot;: [
    { &amp;quot;@type&amp;quot;: &amp;quot;Thing&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;AWS EBS encryption&amp;quot; },
    { &amp;quot;@type&amp;quot;: &amp;quot;Thing&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;AWS KMS&amp;quot; },
    { &amp;quot;@type&amp;quot;: &amp;quot;Thing&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;Cloud Security Posture&amp;quot; }
  ]
}
&lt;/code&gt;&lt;/pre&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/encrypting-ebs-root-volume-without-rebuilding/</guid>
      <category>aws</category>
      <category>ec2</category>
      <category>ebs</category>
      <category>kms</category>
      <category>encryption</category>
      <category>security</category>
      <category>cloud-security</category>
      <pubDate>Thu, 14 May 2026 23:30:00 +0000</pubDate>
    </item>
    <item>
      <title>EC2 read-only hardening audit: what Inspector misses, and what to check by hand (2026)</title>
      <link>https://blog.richgibbs.dev/ec2-read-only-hardening-audit-approach-indie-2026/</link>
      <description>AWS Inspector and IAM Access Analyzer are great at IAM-side and CVE-side findings, and they will quietly miss several of the EC2 instance-level problems most likely to get a small SaaS owned. Here is the read-only EC2 hardening audit a one-person ops team can actually run in an hour.</description>
      <content:encoded>&lt;p&gt;You turned on AWS Inspector, you wired up IAM Access Analyzer, and the consoles are mostly green. Your EC2 fleet is two-to-five instances, the team is one-to-five people, and you&amp;rsquo;d like to believe the AWS-native tooling is enough.&lt;/p&gt;
&lt;p&gt;This post is the honest answer to the question &lt;em&gt;&amp;ldquo;is an Inspector scan the same thing as an EC2 read-only hardening audit?&amp;rdquo;&lt;/em&gt; — no. They overlap on maybe 30 % of the surface that actually gets indie SaaS owned. The other 70 % is instance-level configuration that AWS-native tooling, by design, doesn&amp;rsquo;t look at.&lt;/p&gt;
&lt;p&gt;What follows is the read-only EC2 hardening audit a solo founder or small ops team can actually run in about an hour on a single instance: five categories of checks, each runnable from inside the box with no agent, no third-party SaaS, and no write access. None of it replaces Inspector. It is the &lt;em&gt;other&lt;/em&gt; layer.&lt;/p&gt;
&lt;h2 id="what-aws-inspector-and-access-analyzer-actually-cover"&gt;What AWS Inspector and Access Analyzer actually cover&lt;/h2&gt;
&lt;p&gt;It is worth being precise about what the native tools do well, because the gap is what we&amp;rsquo;re going to walk.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Amazon Inspector&lt;/strong&gt; is a managed vulnerability-management service. It does three things, per the AWS docs: (1) continuously scans EC2 instances for software vulnerabilities and unintended network exposure, (2) scans container images in ECR, and (3) scans Lambda functions and layers. The EC2 scan is essentially CVE-package matching against the OS package inventory plus a network-reachability layer powered by the same engine as VPC Reachability Analyzer. See &lt;a href="https://docs.aws.amazon.com/inspector/latest/user/what-is-inspector.html"&gt;What is Amazon Inspector?&lt;/a&gt; for the canonical scope statement.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IAM Access Analyzer&lt;/strong&gt; is a policy-and-resource analyzer. Per the &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/what-is-access-analyzer.html"&gt;Access Analyzer guide&lt;/a&gt;, it identifies resources in your account shared with external principals, validates IAM policies against best practices, and (in the newer &amp;ldquo;unused access&amp;rdquo; findings) flags unused permissions and roles. It does not look at anything &lt;em&gt;inside&lt;/em&gt; an EC2 instance.&lt;/p&gt;
&lt;p&gt;Both are valuable. Neither was designed to answer &amp;ldquo;is &lt;code&gt;sshd&lt;/code&gt; accepting password auth on this box right now?&amp;rdquo;, because that&amp;rsquo;s not a control-plane question — it&amp;rsquo;s an instance-level configuration question, and the AWS control plane has no opinion about the contents of &lt;code&gt;/etc/ssh/sshd_config&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/welcome.html"&gt;AWS Well-Architected Framework — Security Pillar&lt;/a&gt; is explicit that defense in depth requires both: the &lt;em&gt;protect compute&lt;/em&gt; design principle calls out reducing attack surface, hardening operating systems, and enforcing service-level configuration as distinct from identity and detective controls.&lt;/p&gt;
&lt;p&gt;That instance-level layer is what the rest of this post is about.&lt;/p&gt;
&lt;h2 id="the-five-check-read-only-ec2-hardening-audit"&gt;The five-check read-only EC2 hardening audit&lt;/h2&gt;
&lt;p&gt;Each of the five sections below is something a non-root SSH session can answer in under ten minutes. Read-only means: no &lt;code&gt;apt install&lt;/code&gt;, no agent, no IAM changes, no Inspector activation toggles. You are just observing.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;d rather skip ahead and have somebody else run this against one host, that&amp;rsquo;s what the &lt;a href="https://richgibbs.dev/quickcheck/"&gt;VPS/EC2 Hardening QuickCheck&lt;/a&gt; at the end of this post exists for. Otherwise, keep reading.&lt;/p&gt;
&lt;h3 id="1-imdsv2-enforcement-the-ssrf-gate-inspector-wont-fail-you-on"&gt;1. IMDSv2 enforcement (the SSRF gate Inspector won&amp;rsquo;t fail you on)&lt;/h3&gt;
&lt;p&gt;The Instance Metadata Service version 1 lets any process on the box — including any application code with an SSRF bug — fetch the instance&amp;rsquo;s IAM role credentials over plain HTTP with no token. IMDSv2 requires a session token obtained via &lt;code&gt;PUT&lt;/code&gt;, which an SSRF attacker generally cannot mint.&lt;/p&gt;
&lt;p&gt;AWS publishes the migration story at &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html"&gt;Configure the Instance Metadata Service for an existing instance&lt;/a&gt;. The instance attribute you want is &lt;code&gt;HttpTokens=required&lt;/code&gt;. The instance attribute you want to also lock down is &lt;code&gt;HttpPutResponseHopLimit=1&lt;/code&gt; so that container workloads on the host can&amp;rsquo;t use the metadata service as a confused deputy. The full option reference is in &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html"&gt;Use IMDSv2&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Read-only checks from inside the box:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;# Should 401 without a token (IMDSv2 enforced):
curl -s -o /dev/null -w &amp;quot;%{http_code}\n&amp;quot; \
  http://169.254.169.254/latest/meta-data/

# Should succeed with a token:
TOKEN=$(curl -s -X PUT &amp;quot;http://169.254.169.254/latest/api/token&amp;quot; \
  -H &amp;quot;X-aws-ec2-metadata-token-ttl-seconds: 60&amp;quot;)
curl -s -H &amp;quot;X-aws-ec2-metadata-token: $TOKEN&amp;quot; \
  http://169.254.169.254/latest/meta-data/iam/info
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If the first call returns &lt;code&gt;200&lt;/code&gt;, IMDSv1 is still accepted on this instance and any SSRF in your app stack is one HTTP request away from your role credentials. Inspector will not fail this for you; it shows up as a &amp;ldquo;finding&amp;rdquo; only if you&amp;rsquo;ve enabled the relevant network-reachability rule package and the instance is publicly reachable on the metadata port (which it never is). The check you want is the local one, above.&lt;/p&gt;
&lt;h3 id="2-ssh-posture-the-line-sshd_config-actually-applies"&gt;2. SSH posture (the line &lt;code&gt;sshd_config&lt;/code&gt; actually applies)&lt;/h3&gt;
&lt;p&gt;Inspector does not parse &lt;code&gt;sshd_config&lt;/code&gt;. The CIS Benchmarks do — see the &lt;a href="https://www.cisecurity.org/benchmark/amazon_web_services"&gt;CIS Amazon Web Services Foundations Benchmark and CIS AWS / Ubuntu / Amazon Linux benchmarks&lt;/a&gt; for the canonical control list. For a read-only audit you don&amp;rsquo;t need to map every CIS control; you need the five that account for most real-world EC2 compromises.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo sshd -T | grep -Ei \
  '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|permitemptypasswords|x11forwarding|clientaliveinterval|maxauthtries|allowusers|allowgroups) '
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;sshd -T&lt;/code&gt; form dumps the &lt;em&gt;effective&lt;/em&gt; runtime configuration, including &lt;code&gt;Include&lt;/code&gt; files and &lt;code&gt;Match&lt;/code&gt; blocks, which is the only honest way to audit SSH. Grepping &lt;code&gt;/etc/ssh/sshd_config&lt;/code&gt; by hand will miss the &lt;code&gt;cloud-init&lt;/code&gt; drop-in that re-enables password auth on Ubuntu AMIs, or the &lt;code&gt;Match Address&lt;/code&gt; block a contractor added six months ago.&lt;/p&gt;
&lt;p&gt;What you want to see, roughly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;permitrootlogin no&lt;/code&gt; (or &lt;code&gt;prohibit-password&lt;/code&gt;, never &lt;code&gt;yes&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;passwordauthentication no&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kbdinteractiveauthentication no&lt;/code&gt; (the new name for &lt;code&gt;challengeresponseauthentication&lt;/code&gt;; defaults flipped on several distributions in 2024).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;permitemptypasswords no&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;clientaliveinterval&lt;/code&gt; and &lt;code&gt;clientalivecountmax&lt;/code&gt; pair that actually evicts dead sessions.&lt;/li&gt;
&lt;li&gt;An explicit &lt;code&gt;allowusers&lt;/code&gt; or &lt;code&gt;allowgroups&lt;/code&gt; line so a freshly-created system user can&amp;rsquo;t SSH by default.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-security-group-surface-from-the-hosts-perspective-not-the-consoles"&gt;3. Security-group surface (from the host&amp;rsquo;s perspective, not the console&amp;rsquo;s)&lt;/h3&gt;
&lt;p&gt;You can absolutely list security groups via the AWS Console. The console will show you the &lt;em&gt;configured&lt;/em&gt; rules. It will not show you which of those rules the local box&amp;rsquo;s own listening sockets actually answer on, which is the surface that matters when somebody is scanning your public IP.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo ss -tulpenH | awk '{print $1, $5, $7}'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Cross-reference every listening &lt;code&gt;0.0.0.0:&lt;/code&gt; or &lt;code&gt;[::]:&lt;/code&gt; socket with the SG inbound rules attached to the instance&amp;rsquo;s primary ENI. Three patterns to flag:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A daemon listening on &lt;code&gt;0.0.0.0&lt;/code&gt; for a service that should be loopback only (Redis on &lt;code&gt;:6379&lt;/code&gt;, Postgres on &lt;code&gt;:5432&lt;/code&gt;, an internal admin endpoint on &lt;code&gt;:9000&lt;/code&gt;). Loopback-bind in the daemon config; do not rely on the SG alone.&lt;/li&gt;
&lt;li&gt;An SG rule open to &lt;code&gt;0.0.0.0/0&lt;/code&gt; for a port no process is currently listening on. That is drift — somebody opened it during an incident and never closed it.&lt;/li&gt;
&lt;li&gt;A daemon listening on a port that &lt;em&gt;is&lt;/em&gt; covered by an open SG rule but should not be (a dev-mode HTTP server on &lt;code&gt;:3000&lt;/code&gt;, a Jupyter on &lt;code&gt;:8888&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The Well-Architected Security Pillar&amp;rsquo;s &lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/sec-protect-networks.html"&gt;Protect networks&lt;/a&gt; design principle calls this out as &amp;ldquo;minimize the attack surface&amp;rdquo; — and the read-only version is just &lt;code&gt;ss&lt;/code&gt; plus the SG rule list.&lt;/p&gt;
&lt;h3 id="4-patch-state-and-unattended-upgrades-reality-check"&gt;4. Patch state and unattended-upgrades reality check&lt;/h3&gt;
&lt;p&gt;Inspector will tell you which packages have known CVEs. It will not tell you whether the box is actually applying security updates on a schedule, which is the difference between &amp;ldquo;we patched on Monday&amp;rdquo; and &amp;ldquo;we have not rebooted since the 2024 OpenSSL CVE.&amp;rdquo;&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;# When did the kernel last reboot into the running version?
uptime -s
uname -r

# Is unattended-upgrades actually installed and running? (Debian/Ubuntu)
systemctl is-enabled unattended-upgrades 2&amp;gt;/dev/null
systemctl is-active  unattended-upgrades 2&amp;gt;/dev/null
grep -E '^(APT::Periodic::Update-Package-Lists|APT::Periodic::Unattended-Upgrade)' \
  /etc/apt/apt.conf.d/20auto-upgrades 2&amp;gt;/dev/null

# Amazon Linux 2023 / RHEL family:
systemctl is-enabled dnf-automatic.timer 2&amp;gt;/dev/null
systemctl is-active  dnf-automatic.timer 2&amp;gt;/dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;An instance whose last boot was 11 months ago and whose &lt;code&gt;unattended-upgrades&lt;/code&gt; was disabled the day the founder hit a noisy &lt;code&gt;apt upgrade&lt;/code&gt; is the most common &amp;lsquo;we never got owned because we got lucky&amp;rsquo; setup in indie SaaS. The fix is two lines of config. The audit step is just running the four commands above.&lt;/p&gt;
&lt;h3 id="5-docker-socket-exposure"&gt;5. Docker socket exposure&lt;/h3&gt;
&lt;p&gt;If the box runs Docker, the single most consequential misconfiguration is exposing the Docker daemon socket — either by mounting &lt;code&gt;/var/run/docker.sock&lt;/code&gt; into a container that handles untrusted input, or by listening on a TCP port without TLS. &lt;a href="https://docs.docker.com/engine/security/"&gt;Docker&amp;rsquo;s own security documentation&lt;/a&gt; calls this out as effectively root on the host. The &lt;a href="https://www.cisecurity.org/benchmark/docker"&gt;CIS Docker Benchmark&lt;/a&gt; §2 covers the daemon-configuration controls in detail.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;# Is the daemon listening on TCP anywhere?
sudo ss -tlpn | grep -E 'docker|2375|2376'

# Which running containers have the socket mounted?
docker ps --format '{{.ID}} {{.Image}} {{.Names}}' 2&amp;gt;/dev/null | \
  while read id image name; do
    if docker inspect &amp;quot;$id&amp;quot; 2&amp;gt;/dev/null | \
         grep -q '&amp;quot;/var/run/docker.sock&amp;quot;'; then
      echo &amp;quot;SOCKET MOUNT: $name ($image)&amp;quot;
    fi
  done
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A container that mounts the socket can start a new container that mounts the host&amp;rsquo;s &lt;code&gt;/&lt;/code&gt; and chroot into it. That is not a CVE — Inspector will never flag it — but it is an instant root-equivalent path the moment that container has a code-execution bug.&lt;/p&gt;
&lt;h2 id="what-this-audit-deliberately-does-not-do"&gt;What this audit deliberately does not do&lt;/h2&gt;
&lt;p&gt;A read-only EC2 hardening audit is one layer. It is not:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;A penetration test.&lt;/strong&gt; Nothing here exploits anything. We are reading config and listening sockets, not attacking the application. The audit-vs-pentest framing for indie founders is in &lt;a href="/security-audit-vs-penetration-test-indie-founder-2026/"&gt;a separate post&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A CVE scan.&lt;/strong&gt; Inspector is the right tool for that, and you should leave it on. The point of the post is the &lt;em&gt;other&lt;/em&gt; layer, not a replacement for the package-vulnerability layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A compliance attestation.&lt;/strong&gt; SOC 2 / HIPAA / ISO 27001 require documented policies, evidence collection, vendor management, and access reviews — none of which this audit produces. A hardened instance is a &lt;em&gt;part&lt;/em&gt; of those, not a substitute.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A continuous control.&lt;/strong&gt; This is a point-in-time read. The continuous version is your patch cadence, your unattended-upgrades, your SG-drift detection, and your IMDSv2-required default in the launch template.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The CIS AWS Foundations Benchmark and the AWS Well-Architected Security Pillar both treat instance-level hardening and IAM-level controls as distinct layers, and there is a reason for that: they fail differently, and one rarely catches the other.&lt;/p&gt;
&lt;h2 id="when-the-read-only-audit-isnt-enough"&gt;When the read-only audit isn&amp;rsquo;t enough&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;ve run the five checks above on one instance and it lit up in two or more sections, the bottleneck is usually not the audit — it is the time to do the remediation. At that point you have three options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;DIY the fixes in a maintenance window.&lt;/strong&gt; Most of the items above are 10–30-minute changes. The &lt;a href="/ubuntu-debian-ec2-hardening-checklist-2026/"&gt;Ubuntu / Debian EC2 hardening checklist&lt;/a&gt; is the matching to-do list.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hire someone to run the audit on more than one instance.&lt;/strong&gt; A multi-host audit is mostly cross-referencing the same five categories with the launch templates and AMI lineage that produced them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stop here and accept the residual risk.&lt;/strong&gt; Sometimes the right answer.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There is no fourth option where AWS-native tooling alone closes this gap. Inspector, Access Analyzer, GuardDuty, and Security Hub are all great at what they&amp;rsquo;re designed to do; none of them parse &lt;code&gt;sshd_config&lt;/code&gt;, none of them check whether the metadata service still accepts unauthenticated requests, and none of them tell you that a container is mounting the Docker socket.&lt;/p&gt;
&lt;h2 id="further-reading-on-this-site"&gt;Further reading on this site&lt;/h2&gt;
&lt;p&gt;Two pieces in the same cluster, both prerequisites or natural follow-ons to this audit:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="/ubuntu-debian-ec2-hardening-checklist-2026/"&gt;Ubuntu / Debian EC2 hardening checklist (2026) →&lt;/a&gt; — the longer remediation companion to this audit. If something here failed, the fix list is there.&lt;/li&gt;
&lt;li&gt;&lt;a href="/security-audit-vs-penetration-test-indie-founder-2026/"&gt;Security audit vs penetration test: which one does an indie founder actually need? →&lt;/a&gt; — the broader framing for what an EC2 read-only hardening audit is, and what it deliberately is not.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h3 id="related-read-only-audit-paid"&gt;Related read-only audit (paid)&lt;/h3&gt;
&lt;p&gt;If you&amp;rsquo;ve run the five checks above and would rather hand a single instance to somebody else for a written, prioritized fix list — IMDSv2 enforcement, SSH posture, SG surface, patch cadence, Docker exposure, plus the five other categories the QuickCheck covers — that is exactly the &lt;a href="https://richgibbs.dev/quickcheck/"&gt;VPS/EC2 Hardening QuickCheck&lt;/a&gt; we offer. $149, one host, read-only, no agents installed, 24-hour turnaround. No managed retainers, no exploitation, no surprise add-ons. Multi-host setups upgrade to the $249 tier; everything stays read-only.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;d rather DIY the externally observable subset first, the free &lt;a href="https://blog.richgibbs.dev/quickcheck-mini/"&gt;QuickCheck Mini&lt;/a&gt; script runs the DNS / ports / TLS / headers / public-IMDS checks against your own host in about a minute.&lt;/p&gt;
&lt;p&gt;14-day refund on the paid QuickCheck, no questions asked.&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/ec2-read-only-hardening-audit-approach-indie-2026/</guid>
      <category>aws</category>
      <category>ec2</category>
      <category>hardening</category>
      <category>audit</category>
      <category>inspector</category>
      <category>imdsv2</category>
      <category>indie-founder</category>
      <category>saas</category>
      <pubDate>Sat, 16 May 2026 14:05:00 +0000</pubDate>
    </item>
    <item>
      <title>AI/API bill jumped? Find the token burn before it eats the month</title>
      <link>https://blog.richgibbs.dev/ai-api-cost-rescue-quickcheck/</link>
      <description>A practical checklist for founders running agents, internal AI tools, or automation hosts: stuck jobs, expensive model defaults, fallback loops, cache misses, and missing budget controls.</description>
      <content:encoded>&lt;p&gt;AI bills usually do not explode because the model suddenly got smarter. They explode because something operational and boring broke.&lt;/p&gt;
&lt;p&gt;A cron job starts failing and retries forever. A background agent keeps using a frontier model for work that should be a cheap classifier or no model at all. A fallback path silently routes to a paid provider after the cheap provider hits quota. A browser automation loop keeps resubmitting the same task. Prompt caching is high, but one uncached workflow still burns the month.&lt;/p&gt;
&lt;p&gt;That is the pattern I look for in an AI/API cost rescue pass.&lt;/p&gt;
&lt;h2 id="the-first-hour-checklist"&gt;The first-hour checklist&lt;/h2&gt;
&lt;p&gt;Start with the live account dashboard, but do not stop there. A dashboard tells you that money moved. It rarely tells you which boring system behavior caused it.&lt;/p&gt;
&lt;p&gt;The fastest useful pass is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;List every recurring job, cron, queue worker, background agent, and scheduled task.&lt;/li&gt;
&lt;li&gt;Mark anything that has failed more than once in a row.&lt;/li&gt;
&lt;li&gt;Mark anything using a frontier model by default.&lt;/li&gt;
&lt;li&gt;Mark anything with automatic fallback to another paid model.&lt;/li&gt;
&lt;li&gt;Check whether failed fallback runs are still billed even when they produce no user-visible output.&lt;/li&gt;
&lt;li&gt;Check cache hit rate and the workflows that miss cache.&lt;/li&gt;
&lt;li&gt;Check which calls are interactive and which are unattended background work.&lt;/li&gt;
&lt;li&gt;Disable non-revenue background jobs until there is a reason to re-enable them.&lt;/li&gt;
&lt;li&gt;Put a daily dollar alert below the panic threshold, not above it.&lt;/li&gt;
&lt;li&gt;Write down the kill switch before the next incident.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is not elegant. It works.&lt;/p&gt;
&lt;h2 id="common-spend-leaks"&gt;Common spend leaks&lt;/h2&gt;
&lt;p&gt;The most common leaks are not exotic:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Failed scheduled jobs.&lt;/strong&gt; A job that fails every 30 minutes can still create model calls, fallback attempts, summaries, traces, or notifications.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wrong model defaults.&lt;/strong&gt; High-reasoning models are useful for hard work. They are wasteful for health checks, digests, log summaries, and polling.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fallback cascades.&lt;/strong&gt; Cheap provider fails, expensive provider wakes up, then another fallback wakes up after that.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retry loops.&lt;/strong&gt; Browser automation, queue workers, and agent sessions often retry the full prompt instead of a small recovery step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No budget boundary between product and ops.&lt;/strong&gt; Customer-facing work, internal experiments, monitoring, and background housekeeping all hit the same billing account.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helpful automation with no revenue path.&lt;/strong&gt; Daily reports, market scans, and content queues feel productive while they quietly spend money.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The fix is usually less glamorous than the diagnosis: pause jobs, lower model tiers, cap retries, separate production and experiments, and make expensive paths explicit instead of automatic.&lt;/p&gt;
&lt;h2 id="what-to-send-for-a-review"&gt;What to send for a review&lt;/h2&gt;
&lt;p&gt;Send redacted evidence only:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Billing screenshots with account identifiers hidden.&lt;/li&gt;
&lt;li&gt;Provider usage by day and model.&lt;/li&gt;
&lt;li&gt;Cron/task/job list.&lt;/li&gt;
&lt;li&gt;Model routing config with secrets removed.&lt;/li&gt;
&lt;li&gt;Recent failure summaries with tokens, keys, emails, customer records, and private logs removed.&lt;/li&gt;
&lt;li&gt;A short note explaining what changed before the bill jumped.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not send API keys, tokens, SSH private keys, .env files, customer records, raw private logs, payment details, or regulated personal data.&lt;/p&gt;
&lt;h2 id="fixed-scope-offer"&gt;Fixed-scope offer&lt;/h2&gt;
&lt;style&gt;
.first-aid-cta a.button:focus-visible {
  outline: 3px solid currentColor;
  outline-offset: 3px;
  box-shadow: 0 0 0 6px rgba(11, 110, 253, .25);
}
@media (max-width: 480px) {
  .first-aid-cta a.button { display: block; text-align: center; width: 100%; }
}
&lt;/style&gt;

&lt;div class="cta first-aid-cta"&gt;
&lt;h3&gt;AI/API bill jumped?&lt;/h3&gt;
&lt;p&gt;Get a same-day $9 first-aid triage. Send redacted billing screenshots, usage by day, routing/config notes with secrets removed, and what changed. I will return a 1-page kill list: likely spend leak, first things to pause, cheaper routing/caching/batching checks, and whether the full $499 rescue is warranted.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Redacted evidence only.&lt;/strong&gt; No API keys, tokens, SSH keys, .env files, customer records, raw private logs, payment details, or regulated data.&lt;/p&gt;
&lt;p&gt;&lt;a class="button" href="https://buy.stripe.com/00w00lakYc78a41gFV5ZC07" target="_blank" rel="noopener noreferrer"&gt;Buy $9 triage&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://buy.stripe.com/9B600l2Sw1sua41ahx5ZC06"&gt;Need deeper help? See the $499 AI/API Cost Rescue.&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Same-day review depends on receiving enough redacted evidence. This is first-aid triage, not guaranteed cost recovery.&lt;/small&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;Permissions first, then prompts. If the agent stack is connected to GitHub, Gmail, Slack, Stripe, or AWS and you have never written down what it is allowed to do, do that before optimizing tokens. The free &lt;a href="/agent-permission-map-before-real-tool-access/"&gt;agent permission map checklist&lt;/a&gt; is the one-page version: account, verbs, spend, approvals, logs, kill switches.&lt;/p&gt;
&lt;p&gt;I am offering a fixed-scope &lt;strong&gt;AI/API Cost Rescue QuickCheck&lt;/strong&gt; for founders running agents, internal AI tools, or automation hosts.&lt;/p&gt;
&lt;p&gt;Price: &lt;strong&gt;$499&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You get a written report within 24 hours after complete redacted intake:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Spend leak map: what appears to be burning money and why.&lt;/li&gt;
&lt;li&gt;Kill list: what to pause first.&lt;/li&gt;
&lt;li&gt;Model routing plan: what should use cheaper models, caching, batching, or no model.&lt;/li&gt;
&lt;li&gt;Budget guardrails: caps, alerts, and daily checks.&lt;/li&gt;
&lt;li&gt;One async clarification pass within 7 days.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is advisory and fixed-scope. Implementation can be quoted separately if needed.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://buy.stripe.com/9B600l2Sw1sua41ahx5ZC06"&gt;Buy the AI/API Cost Rescue QuickCheck&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If you are not ready to buy, still do the first-hour checklist above. The important thing is to stop unattended spend before optimizing anything else.&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/ai-api-cost-rescue-quickcheck/</guid>
      <category>ai</category>
      <category>api-costs</category>
      <category>agents</category>
      <category>openai</category>
      <category>anthropic</category>
      <category>cost-control</category>
      <category>indie-founder</category>
      <pubDate>Thu, 21 May 2026 22:35:00 +0000</pubDate>
    </item>
    <item>
      <title>Redacted evidence beats account access: how to get a useful QuickCheck without handing over credentials</title>
      <link>https://blog.richgibbs.dev/redacted-evidence-without-account-access/</link>
      <description>A practical guide for founders who need AI/API cost help, email DNS review, inbox cleanup, or server hygiene advice without sending passwords, API keys, SSH keys, mailbox access, or customer data.</description>
      <content:encoded>&lt;p&gt;The default small-business support pattern is backwards.&lt;/p&gt;
&lt;p&gt;Someone asks for help. The helper asks for login access. The founder sends a password, an API key, an SSH key, a mailbox invite, or a pile of raw logs. Everyone moves faster for ten minutes, and the security risk gets worse for months.&lt;/p&gt;
&lt;p&gt;For a fixed-scope review, that is usually unnecessary.&lt;/p&gt;
&lt;p&gt;Most useful early answers do not require account custody. They require clean evidence: screenshots with identifiers hidden, DNS records, job lists, command output, billing summaries, and a short explanation of what changed.&lt;/p&gt;
&lt;p&gt;That is the boundary QuickCheck is built around.&lt;/p&gt;
&lt;h2 id="the-point-is-diagnosis-not-ownership"&gt;The point is diagnosis, not ownership&lt;/h2&gt;
&lt;p&gt;A first-pass review should answer a narrow question:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Why did this AI/API bill jump?&lt;/li&gt;
&lt;li&gt;Why does this domain fail email authentication?&lt;/li&gt;
&lt;li&gt;Why is this inbox impossible to work in?&lt;/li&gt;
&lt;li&gt;What obvious server hygiene problems should be fixed first?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those questions do not usually require control of the account. They require enough context to separate likely causes from noise.&lt;/p&gt;
&lt;p&gt;There is a big difference between:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;Here is a redacted screenshot of usage by model and day.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Here is my OpenAI API key.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first one helps. The second one creates a new incident.&lt;/p&gt;
&lt;p&gt;OpenAI&amp;rsquo;s API key safety guidance is blunt on this: do not expose keys in client-side environments, do not commit keys to repositories, use environment variables or a key management service, monitor usage, rotate keys when needed, and use IP allowlisting where appropriate.&lt;/p&gt;
&lt;p&gt;The same principle applies outside OpenAI. A credential is not evidence. A credential is authority.&lt;/p&gt;
&lt;h2 id="what-never-to-send"&gt;What never to send&lt;/h2&gt;
&lt;p&gt;Do not send:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;API keys.&lt;/li&gt;
&lt;li&gt;SSH private keys.&lt;/li&gt;
&lt;li&gt;Passwords.&lt;/li&gt;
&lt;li&gt;OAuth refresh tokens.&lt;/li&gt;
&lt;li&gt;Environment files.&lt;/li&gt;
&lt;li&gt;Full customer records.&lt;/li&gt;
&lt;li&gt;Raw mailbox exports.&lt;/li&gt;
&lt;li&gt;Payment details.&lt;/li&gt;
&lt;li&gt;Regulated personal data.&lt;/li&gt;
&lt;li&gt;Unredacted private logs.&lt;/li&gt;
&lt;li&gt;Screenshots that expose account IDs, billing addresses, or unrelated customer names.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the review cannot be completed without one of those, the scope is probably no longer &amp;ldquo;QuickCheck.&amp;rdquo; It is implementation, incident response, migration, or hands-on administration.&lt;/p&gt;
&lt;p&gt;That can be valid work. It should be separately scoped, authorized, and handled with a different risk model.&lt;/p&gt;
&lt;h2 id="what-good-redacted-evidence-looks-like"&gt;What good redacted evidence looks like&lt;/h2&gt;
&lt;p&gt;Good evidence has three traits:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It shows the shape of the problem.&lt;/li&gt;
&lt;li&gt;It hides secrets and unrelated people.&lt;/li&gt;
&lt;li&gt;It is specific enough to produce a written recommendation.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For screenshots, redact:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;API keys and tokens.&lt;/li&gt;
&lt;li&gt;Account IDs.&lt;/li&gt;
&lt;li&gt;Email addresses unless needed for the finding.&lt;/li&gt;
&lt;li&gt;Customer names.&lt;/li&gt;
&lt;li&gt;Billing address and card details.&lt;/li&gt;
&lt;li&gt;Internal hostnames if they are not relevant.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not blur the numbers that matter. If the question is cost, keep the spend, dates, model names, token counts, request counts, and usage trend visible.&lt;/p&gt;
&lt;p&gt;Do not redact the thing being reviewed. If the question is DMARC alignment, the domain and DNS records matter. If the question is server hygiene, the open ports and service names matter.&lt;/p&gt;
&lt;h2 id="aiapi-cost-review-what-to-send"&gt;AI/API cost review: what to send&lt;/h2&gt;
&lt;p&gt;For an AI/API cost review, send redacted evidence like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Provider usage by day.&lt;/li&gt;
&lt;li&gt;Usage by model.&lt;/li&gt;
&lt;li&gt;Spend by project, workspace, or API key label.&lt;/li&gt;
&lt;li&gt;Recent billing screenshot with account identifiers hidden.&lt;/li&gt;
&lt;li&gt;Cron, job, queue, or agent task list.&lt;/li&gt;
&lt;li&gt;Model routing summary.&lt;/li&gt;
&lt;li&gt;Retry and fallback rules.&lt;/li&gt;
&lt;li&gt;Recent failure summary with secrets removed.&lt;/li&gt;
&lt;li&gt;What changed before the bill moved.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The common mistake is sending only the invoice. The invoice proves money moved; it rarely shows the cause.&lt;/p&gt;
&lt;p&gt;The useful evidence is the operational layer around the invoice: jobs, routes, retries, fallbacks, model defaults, cache misses, and unattended work.&lt;/p&gt;
&lt;p&gt;OpenAI&amp;rsquo;s production guidance calls out project separation, billing limits, key tracking, and staging projects. That is exactly the kind of structure that makes a spend spike diagnosable without exposing secrets.&lt;/p&gt;
&lt;h2 id="email-dns-review-what-to-send"&gt;Email DNS review: what to send&lt;/h2&gt;
&lt;p&gt;For an Inbox/DNS QuickCheck, send:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Domain name.&lt;/li&gt;
&lt;li&gt;Current SPF record.&lt;/li&gt;
&lt;li&gt;Current DKIM selector records, if known.&lt;/li&gt;
&lt;li&gt;Current DMARC record.&lt;/li&gt;
&lt;li&gt;MX records.&lt;/li&gt;
&lt;li&gt;Sending services used by the domain.&lt;/li&gt;
&lt;li&gt;A redacted authentication result from a failed message, if available.&lt;/li&gt;
&lt;li&gt;Whether the problem is password resets, receipts, newsletters, support replies, or all mail.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not send mailbox access. Do not forward private customer threads unless the headers are the only way to prove the issue and the body is removed.&lt;/p&gt;
&lt;p&gt;Most email DNS problems are visible from public DNS and message headers. The mailbox contents are usually irrelevant.&lt;/p&gt;
&lt;h2 id="inbox-cleanup-review-what-to-send"&gt;Inbox cleanup review: what to send&lt;/h2&gt;
&lt;p&gt;For inbox cleanup, the safe path is survey first, delete second.&lt;/p&gt;
&lt;p&gt;Send:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Approximate unread count.&lt;/li&gt;
&lt;li&gt;Approximate total count.&lt;/li&gt;
&lt;li&gt;Storage/quota screenshot with personal details hidden.&lt;/li&gt;
&lt;li&gt;Top sender/category counts from a read-only survey.&lt;/li&gt;
&lt;li&gt;The kinds of messages you are comfortable deleting.&lt;/li&gt;
&lt;li&gt;The kinds of messages that must never be touched.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not hand a third-party app full mailbox OAuth just to find out that newsletters, old notifications, and automated receipts are the bulk of the mess.&lt;/p&gt;
&lt;p&gt;The useful first answer is a cleanup plan: which senders to archive, which queries to test, what to delete only after a review window, and what filters should prevent the pile from coming back.&lt;/p&gt;
&lt;h2 id="server-hygiene-review-what-to-send"&gt;Server hygiene review: what to send&lt;/h2&gt;
&lt;p&gt;For a small VPS or EC2 hygiene review, send customer-run read-only output, not credentials:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OS and version.&lt;/li&gt;
&lt;li&gt;Listening ports.&lt;/li&gt;
&lt;li&gt;Firewall status.&lt;/li&gt;
&lt;li&gt;SSH effective configuration.&lt;/li&gt;
&lt;li&gt;Running services.&lt;/li&gt;
&lt;li&gt;Update status.&lt;/li&gt;
&lt;li&gt;Disk usage.&lt;/li&gt;
&lt;li&gt;Web server vhost summary, if relevant.&lt;/li&gt;
&lt;li&gt;Backup/snapshot status, if known.&lt;/li&gt;
&lt;li&gt;Any public URL that should be checked.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not send SSH keys. Do not send cloud console credentials. Do not send a root password.&lt;/p&gt;
&lt;p&gt;For a fixed-scope advisory pass, a read-only collector or manually copied command output is enough to identify the usual issues: password SSH, root login, missing firewall, stale packages, exposed admin ports, weak headers, no backups, no fail2ban, or public files that should not be public.&lt;/p&gt;
&lt;h2 id="when-access-actually-is-needed"&gt;When access actually is needed&lt;/h2&gt;
&lt;p&gt;Sometimes access is the right answer.&lt;/p&gt;
&lt;p&gt;Implementation needs access. Incident response may need access. A migration may need access. A production fix during an outage may need access.&lt;/p&gt;
&lt;p&gt;But that should be a deliberate second step, not the default first ask.&lt;/p&gt;
&lt;p&gt;A good review should end with one of three outcomes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;Here is the fix list; you can do it yourself.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Here is the fix list; I can implement it if you approve a separate scope.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;This is riskier than a QuickCheck; stop and handle it as incident response.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That boundary protects both sides.&lt;/p&gt;
&lt;h2 id="a-simple-redaction-pass-before-you-send-anything"&gt;A simple redaction pass before you send anything&lt;/h2&gt;
&lt;p&gt;Before sending evidence:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Put the files in one folder.&lt;/li&gt;
&lt;li&gt;Rename screenshots by topic, not by account name.&lt;/li&gt;
&lt;li&gt;Redact keys, tokens, account IDs, addresses, unrelated people, and payment details.&lt;/li&gt;
&lt;li&gt;Leave the problem data visible.&lt;/li&gt;
&lt;li&gt;Add a short note: what broke, when it started, what changed, and what outcome you want.&lt;/li&gt;
&lt;li&gt;Re-open every screenshot once after redaction and look for missed secrets.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If a screenshot contains both a secret and a useful number, crop it or cover only the secret. Do not hide the whole panel and expect a useful diagnosis.&lt;/p&gt;
&lt;h2 id="a-worked-example-ai-agents-touching-real-tools"&gt;A worked example: AI agents touching real tools&lt;/h2&gt;
&lt;p&gt;The same rule applies when you are reviewing an AI agent or workflow that has been wired into GitHub, Gmail, Slack, Stripe, or AWS. The reviewer does not need your tokens or your admin console — they need a written description of which account the agent runs as, what verbs it can perform, and what currently requires human approval. The &lt;a href="/agent-permission-map-before-real-tool-access/"&gt;agent permission map checklist&lt;/a&gt; is the redacted-evidence version of that conversation: nine columns, one row per connected tool, no secrets attached.&lt;/p&gt;
&lt;h2 id="the-working-rule"&gt;The working rule&lt;/h2&gt;
&lt;p&gt;If someone needs evidence, send evidence.&lt;/p&gt;
&lt;p&gt;If someone needs authority, stop and scope that separately.&lt;/p&gt;
&lt;p&gt;That one distinction prevents a lot of unnecessary risk.&lt;/p&gt;
&lt;div class="cta"&gt;
&lt;h3&gt;Need a small review without handing over credentials?&lt;/h3&gt;
&lt;p&gt;QuickCheck is fixed-scope advisory work built around redacted evidence: AI/API cost triage, Inbox/DNS checks, inbox cleanup planning, and one-host server hygiene reviews.&lt;/p&gt;
&lt;p&gt;&lt;a class="button" href="https://richgibbs.dev/quickcheck/"&gt;See QuickCheck options&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Redacted evidence only. No API keys, tokens, SSH keys, passwords, mailbox custody, raw private logs, payment details, or regulated personal data.&lt;/small&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;h2 id="sources-worth-keeping-open"&gt;Sources worth keeping open&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety"&gt;OpenAI API key safety&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/guides/production-best-practices"&gt;OpenAI production best practices&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/redacted-evidence-without-account-access/</guid>
      <category>privacy</category>
      <category>quickcheck</category>
      <category>security</category>
      <category>ai</category>
      <category>email-dns</category>
      <category>server-hygiene</category>
      <category>indie-founder</category>
      <pubDate>Fri, 22 May 2026 22:06:00 +0000</pubDate>
    </item>
    <item>
      <title>Before an AI agent gets real tool access, map what it can actually do</title>
      <link>https://blog.richgibbs.dev/agent-permission-map-before-real-tool-access/</link>
      <description>A permission-map checklist for AI agents touching GitHub, Gmail, Slack, Stripe, AWS, or MCP: account, verbs, spend, approvals, logs, kill switches.</description>
      <content:encoded>&lt;p&gt;Most teams I talk to get the agent demo working before anyone writes down what it can do in production.&lt;/p&gt;
&lt;p&gt;Not the model. Not the prompt. The operating permissions - what account it runs as, what it can touch, what it can spend, and who has to approve.&lt;/p&gt;
&lt;p&gt;That gap is where the surprises live.&lt;/p&gt;
&lt;h2 id="the-table-that-should-exist-before-the-agent-goes-live"&gt;The table that should exist before the agent goes live&lt;/h2&gt;
&lt;p&gt;For any agent touching real tools, I would want a single table that answers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;which account it runs as&lt;/li&gt;
&lt;li&gt;what it can read&lt;/li&gt;
&lt;li&gt;what it can change&lt;/li&gt;
&lt;li&gt;what it can send externally&lt;/li&gt;
&lt;li&gt;what it can spend&lt;/li&gt;
&lt;li&gt;what it can deploy or break&lt;/li&gt;
&lt;li&gt;what requires human approval&lt;/li&gt;
&lt;li&gt;what gets logged&lt;/li&gt;
&lt;li&gt;how to turn it off fast&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Nine columns. One row per connected tool. That is the whole artifact.&lt;/p&gt;
&lt;h2 id="read-only-is-not-the-safe-answer-you-think-it-is"&gt;&amp;ldquo;Read-only&amp;rdquo; is not the safe answer you think it is&lt;/h2&gt;
&lt;p&gt;The uncomfortable bit is that &amp;ldquo;read-only&amp;rdquo; still matters.&lt;/p&gt;
&lt;p&gt;A read-only agent may see source code, support tickets, invoices, Slack channels, customer records, logs, or internal docs. That is a real blast radius even if it never writes a byte.&lt;/p&gt;
&lt;p&gt;And a write-capable agent needs &lt;em&gt;verbs&lt;/em&gt;, not labels. &amp;ldquo;Write access to GitHub&amp;rdquo; is not a permission. These are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;comment&lt;/li&gt;
&lt;li&gt;merge&lt;/li&gt;
&lt;li&gt;email&lt;/li&gt;
&lt;li&gt;refund&lt;/li&gt;
&lt;li&gt;invite&lt;/li&gt;
&lt;li&gt;export&lt;/li&gt;
&lt;li&gt;deploy&lt;/li&gt;
&lt;li&gt;delete&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The label tells you nothing. The verbs tell you everything.&lt;/p&gt;
&lt;h2 id="a-default-rule-set-i-would-start-from"&gt;A default rule set I would start from&lt;/h2&gt;
&lt;p&gt;This is the boring, working starting point. Tighten or loosen per tool, but start here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Outbound to humans&lt;/strong&gt;: draft-only for anything that leaves the company - email, Slack DMs to customers, social, support replies. A human ships the final.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code&lt;/strong&gt;: PRs allowed, merges approved.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Money&lt;/strong&gt;: read-only billing lookups, refunds approved.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud&lt;/strong&gt;: inspection allowed, resource changes approved.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logs&lt;/strong&gt;: every run shows trigger, tool call, object touched, result, and human approval if one was required.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kill switch&lt;/strong&gt;: one obvious off-switch per connected tool. Not a config flag buried in a repo. A button or a single command.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The point is not that these rules are correct for every team. The point is that &lt;em&gt;some&lt;/em&gt; version of this exists in writing before the agent becomes production infrastructure.&lt;/p&gt;
&lt;h2 id="what-this-is-not"&gt;What this is not&lt;/h2&gt;
&lt;p&gt;A few things to be honest about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This is not a security audit. It is a map.&lt;/li&gt;
&lt;li&gt;It is not a compliance review, not a pentest, not an IAM audit.&lt;/li&gt;
&lt;li&gt;It does not make any agent &amp;ldquo;safe&amp;rdquo; or &amp;ldquo;least-privileged&amp;rdquo; or &amp;ldquo;production-ready.&amp;rdquo; Those words do a lot of work that the table does not do.&lt;/li&gt;
&lt;li&gt;It does not replace your provider&amp;rsquo;s own controls - org policies, role scopes, OAuth scopes, MCP allow-lists. It tells you which of those you actually need to set.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The table is the first artifact, not the last one.&lt;/p&gt;
&lt;h2 id="the-boring-summary"&gt;The boring summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Write the permission map before the agent touches a real tool, not after.&lt;/li&gt;
&lt;li&gt;One row per connected tool. Nine columns: account, read, change, send-external, spend, deploy-or-break, approvals, logs, kill switch.&lt;/li&gt;
&lt;li&gt;Treat &amp;ldquo;read-only&amp;rdquo; as a real surface area, not a free pass.&lt;/li&gt;
&lt;li&gt;Describe write access in verbs (comment, merge, email, refund, invite, export, deploy, delete), not labels.&lt;/li&gt;
&lt;li&gt;Drafts for anything that leaves the company. Approvals for money, merges, and resource changes.&lt;/li&gt;
&lt;li&gt;One kill switch per tool, obvious enough to use at 2am.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The table does not need to be fancy. It just needs to exist.&lt;/p&gt;
&lt;p&gt;— Rich&lt;/p&gt;
&lt;h2 id="frequently-asked-questions"&gt;Frequently asked questions&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What is an agent permission map?&lt;/strong&gt;
A single table, one row per connected tool, that records which account the agent runs as, what it can read, what it can change, what it can send externally, what it can spend, what it can deploy or break, what requires a human&amp;rsquo;s approval, what gets logged, and how to turn it off fast. It is a written description, not a scanner.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is &amp;ldquo;read-only&amp;rdquo; access actually safe for an AI agent?&lt;/strong&gt;
No. Read-only access can still expose source code, support tickets, invoices, Slack history, customer records, logs, and internal docs. The map should treat read-only as a real blast radius and decide whether the agent really needs it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What is a kill switch for an AI agent?&lt;/strong&gt;
One obvious off-switch per connected tool that a human can use at 2am. A revoked OAuth token, a disabled GitHub App, a paused Zapier/Make scenario, a removed Slack app, or a single command that disables the agent&amp;rsquo;s outbound calls. Not a config flag buried in a repo.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Do I need a security audit, an IAM audit, or this map?&lt;/strong&gt;
The map is the cheaper first artifact and almost always the missing one. It is not a security audit, an IAM audit, a compliance review, or a penetration test. If you eventually need any of those, the map makes them shorter and cheaper because the scope is already written down.&lt;/p&gt;
&lt;h2 id="want-a-written-permission-map-review"&gt;Want a written permission-map review?&lt;/h2&gt;
&lt;p&gt;This lane is in private validation. If you are connecting an agent to GitHub, Gmail, Slack, Stripe, AWS, MCP servers, support tools, or a deploy path and you want a fixed-scope, async, written permission-map review, email &lt;a href="mailto:support@richgibbs.dev?subject=Agent%20Permission%20Map%20review"&gt;support@richgibbs.dev&lt;/a&gt; with the tool the agent touches and the riskiest verb it can do.&lt;/p&gt;
&lt;p&gt;What you get back in this window is the checklist and the report template I am working from, not a free review of your stack. No public price during the validation window. The QuickCheck site lists the related paid offers that already exist for &lt;a href="https://richgibbs.dev/quickcheck/"&gt;AI/API cost rescue, inbox/DNS, and one-host server hygiene&lt;/a&gt;; the Agent Permission Map is a separate lane that is not yet a productized QuickCheck.&lt;/p&gt;
&lt;p&gt;I will not ask for credentials, OAuth grants, or admin access. I cannot accept regulated data (health, financial-account, payment-card, student, children&amp;rsquo;s, HR).&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;Advisory checklist for operators. Not legal advice, not a security audit, not an IAM audit, not a compliance review, not a penetration test, and not a certification. No affiliation with or endorsement by GitHub, Google, Slack, Stripe, AWS, OpenAI, Anthropic, Microsoft, Cloudflare, or any other vendor named above. The author will not ask for credentials, OAuth grants, or admin access, and cannot accept regulated data (health, financial-account, payment-card, student, children&amp;rsquo;s, HR).&lt;/em&gt;&lt;/p&gt;
&lt;script type="application/ld+json"&gt;{"@context":"https://schema.org","@type":"FAQPage","mainEntity":[{"@type":"Question","name":"What is an agent permission map?","acceptedAnswer":{"@type":"Answer","text":"A single written table, one row per connected tool, that records which account the AI agent runs as, what it can read, what it can change, what it can send externally, what it can spend, what it can deploy or break, what requires human approval, what gets logged, and how to turn it off fast. It is a description of operating permissions, not a scanner."}},{"@type":"Question","name":"Is read-only access actually safe for an AI agent?","acceptedAnswer":{"@type":"Answer","text":"No. Read-only access can still expose source code, support tickets, invoices, Slack history, customer records, logs, and internal docs. The permission map should treat read-only as a real blast radius and decide whether the agent really needs it."}},{"@type":"Question","name":"What is a kill switch for an AI agent?","acceptedAnswer":{"@type":"Answer","text":"One obvious off-switch per connected tool that a human can use at 2am: a revoked OAuth token, a disabled GitHub App, a paused Zapier or Make scenario, a removed Slack app, or a single command that disables the agent's outbound calls. Not a config flag buried in a repo."}},{"@type":"Question","name":"Do I need a security audit, an IAM audit, or an agent permission map?","acceptedAnswer":{"@type":"Answer","text":"The agent permission map is the cheaper first artifact and almost always the missing one. It is not a security audit, an IAM audit, a compliance review, or a penetration test. If any of those become necessary later, the map makes them shorter and cheaper because the scope is already written down."}}]}&lt;/script&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/agent-permission-map-before-real-tool-access/</guid>
      <category>ai-agents</category>
      <category>agent-permissions</category>
      <category>mcp</category>
      <category>github</category>
      <category>gmail</category>
      <category>slack</category>
      <category>stripe</category>
      <category>aws</category>
      <category>indie-founder</category>
      <category>operations</category>
      <pubDate>Sat, 23 May 2026 16:40:00 +0000</pubDate>
    </item>
    <item>
      <title>Docker Compose on one VPS: the production checklist before you outgrow it</title>
      <link>https://blog.richgibbs.dev/docker-compose-one-vps-production-checklist/</link>
      <description>A practical checklist for running Docker Compose on a single VPS: restart policy, health checks, log rotation, backups, deploy path, ports, secrets, rollback, and alerts.</description>
      <content:encoded>&lt;p&gt;Docker Compose is fine for a one-server app.&lt;/p&gt;
&lt;p&gt;That sentence makes some people twitch, so here is the boundary: one VPS, one operator, a small app, a reverse proxy, a database you understand, and a business where &amp;ldquo;five minutes down while I fix it&amp;rdquo; is annoying but not catastrophic.&lt;/p&gt;
&lt;p&gt;That setup does not need Kubernetes on day one. It does need a boring checklist, because most Compose outages are not deep container problems. They are simpler:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the process did not come back after reboot&lt;/li&gt;
&lt;li&gt;logs filled the disk&lt;/li&gt;
&lt;li&gt;a private port was accidentally published to the internet&lt;/li&gt;
&lt;li&gt;the &lt;code&gt;.env&lt;/code&gt; file became the only copy of production secrets&lt;/li&gt;
&lt;li&gt;the database volume was &amp;ldquo;backed up&amp;rdquo; but never restored&lt;/li&gt;
&lt;li&gt;the deploy command was whatever you typed last time from shell history&lt;/li&gt;
&lt;li&gt;the health check said &amp;ldquo;container running&amp;rdquo; while the app was dead&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Compose is not the problem in that story. Undocumented production habits are.&lt;/p&gt;
&lt;p&gt;This is the checklist I would want in place before trusting a Docker Compose app on one VPS.&lt;/p&gt;
&lt;h2 id="1-write-down-the-actual-shape-of-the-system"&gt;1. Write down the actual shape of the system&lt;/h2&gt;
&lt;p&gt;Before changing YAML, write one short inventory:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;public hostnames&lt;/li&gt;
&lt;li&gt;Compose project directory&lt;/li&gt;
&lt;li&gt;services in &lt;code&gt;compose.yml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;published ports&lt;/li&gt;
&lt;li&gt;named volumes&lt;/li&gt;
&lt;li&gt;bind mounts&lt;/li&gt;
&lt;li&gt;environment files&lt;/li&gt;
&lt;li&gt;backup target&lt;/li&gt;
&lt;li&gt;deploy command&lt;/li&gt;
&lt;li&gt;rollback command&lt;/li&gt;
&lt;li&gt;who gets paged when it breaks, even if &amp;ldquo;who&amp;rdquo; is just you&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The act of writing that list will catch half the mistakes.&lt;/p&gt;
&lt;p&gt;If you cannot say which volume contains production data, you do not have a deployment. You have a container that happens to be running.&lt;/p&gt;
&lt;h2 id="2-make-the-reverse-proxy-the-only-public-door"&gt;2. Make the reverse proxy the only public door&lt;/h2&gt;
&lt;p&gt;For a one-box Compose app, I want exactly one public entry path:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;80/tcp&lt;/code&gt; and &lt;code&gt;443/tcp&lt;/code&gt; to Caddy, nginx, Traefik, or Apache&lt;/li&gt;
&lt;li&gt;app containers reachable only on the Docker network or loopback&lt;/li&gt;
&lt;li&gt;database and cache ports not published publicly&lt;/li&gt;
&lt;li&gt;admin tools disabled or bound to &lt;code&gt;127.0.0.1&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The dangerous Compose line is usually this:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-yaml"&gt;ports:
  - &amp;quot;5432:5432&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That publishes Postgres on every interface unless the host firewall saves you.&lt;/p&gt;
&lt;p&gt;Prefer one of these shapes:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-yaml"&gt;ports:
  - &amp;quot;127.0.0.1:3000:3000&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;or no host port at all, with the reverse proxy joining the same Compose network.&lt;/p&gt;
&lt;p&gt;Then verify from outside the box:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;nmap -Pn -p 1-10000 your.server.ip
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You should see SSH, HTTP, HTTPS, and very little else. If Redis, Postgres, MySQL, Meilisearch, Elasticsearch, or an admin dashboard shows up, stop and fix that before touching the application.&lt;/p&gt;
&lt;h2 id="3-use-a-restart-policy-but-do-not-confuse-it-with-health"&gt;3. Use a restart policy, but do not confuse it with health&lt;/h2&gt;
&lt;p&gt;Docker documents restart policies for the basic &amp;ldquo;come back after exit or daemon restart&amp;rdquo; behavior. For a one-VPS app, &lt;code&gt;unless-stopped&lt;/code&gt; is usually the least surprising default.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-yaml"&gt;services:
  web:
    image: registry.example.com/myapp:2026-05-24
    restart: unless-stopped
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That gets you through process exits and host reboots.&lt;/p&gt;
&lt;p&gt;It does not prove the app is healthy.&lt;/p&gt;
&lt;p&gt;A wedged app can keep a process alive forever. A web server can return 500s while the container is &amp;ldquo;Up&amp;rdquo;. A worker can be connected to the wrong queue and look fine from Docker&amp;rsquo;s point of view.&lt;/p&gt;
&lt;p&gt;So pair restart policy with a real health check.&lt;/p&gt;
&lt;h2 id="4-add-health-checks-that-test-the-thing-users-need"&gt;4. Add health checks that test the thing users need&lt;/h2&gt;
&lt;p&gt;Docker Compose supports service health checks in the Compose file. The command can be a list form or shell form; the important part is that it tests behavior, not just process existence.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-yaml"&gt;services:
  web:
    image: registry.example.com/myapp:2026-05-24
    restart: unless-stopped
    healthcheck:
      test: [&amp;quot;CMD-SHELL&amp;quot;, &amp;quot;wget -qO- http://127.0.0.1:3000/healthz &amp;gt;/dev/null || exit 1&amp;quot;]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 30s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A good health check answers one narrow question: &amp;ldquo;Can this service do the small thing users depend on?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;For a web app, that might be &lt;code&gt;/healthz&lt;/code&gt; returning 200 after checking the database connection.&lt;/p&gt;
&lt;p&gt;For a worker, it might be a queue heartbeat or a lightweight dependency check.&lt;/p&gt;
&lt;p&gt;Do not make health checks expensive. Do not run migrations inside them. Do not call third-party APIs every 30 seconds. If the health check creates its own outage, it failed the assignment.&lt;/p&gt;
&lt;h2 id="5-put-log-rotation-in-the-compose-file"&gt;5. Put log rotation in the Compose file&lt;/h2&gt;
&lt;p&gt;The most boring production outage is a full disk.&lt;/p&gt;
&lt;p&gt;Docker&amp;rsquo;s default &lt;code&gt;json-file&lt;/code&gt; logging driver writes container output to JSON files on the host. Docker&amp;rsquo;s docs call out options such as &lt;code&gt;max-size&lt;/code&gt; and &lt;code&gt;max-file&lt;/code&gt; for limiting those logs. Use them.&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-yaml"&gt;services:
  web:
    logging:
      driver: &amp;quot;json-file&amp;quot;
      options:
        max-size: &amp;quot;10m&amp;quot;
        max-file: &amp;quot;5&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then check the host:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;docker system df
du -h /var/lib/docker/containers 2&amp;gt;/dev/null | sort -h | tail
df -h /
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If logs can fill the root volume, your uptime depends on how chatty the app gets during an incident. That is not a plan.&lt;/p&gt;
&lt;h2 id="6-treat-env-as-config-not-a-secret-vault"&gt;6. Treat &lt;code&gt;.env&lt;/code&gt; as config, not a secret vault&lt;/h2&gt;
&lt;p&gt;Compose reads environment files because it is convenient. Convenience is not the same as secret management.&lt;/p&gt;
&lt;p&gt;For a one-person VPS, I am not going to pretend every app needs Vault, SOPS, KMS, and a ceremony. But the minimum line is still clear:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.env&lt;/code&gt; is not committed&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env&lt;/code&gt; is mode &lt;code&gt;0600&lt;/code&gt; or at least not world-readable&lt;/li&gt;
&lt;li&gt;secrets are not printed in deploy logs&lt;/li&gt;
&lt;li&gt;backups do not spray &lt;code&gt;.env&lt;/code&gt; into random buckets&lt;/li&gt;
&lt;li&gt;the production &lt;code&gt;.env&lt;/code&gt; file has a second recoverable copy somewhere intentional&lt;/li&gt;
&lt;li&gt;old API keys get rotated when contractors, incidents, or accidental exposure make that necessary&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Docker Compose also has &lt;code&gt;secrets&lt;/code&gt; and &lt;code&gt;configs&lt;/code&gt; concepts in the Compose specification. On a single VPS, even a simple file-mounted secret can be better than passing everything as environment variables, because it gives you a clearer boundary for &amp;ldquo;this file is sensitive.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;The main rule is not fancy: know where secrets live, know who can read them, and know how to rotate them.&lt;/p&gt;
&lt;h2 id="7-name-volumes-like-you-will-have-to-restore-them-at-2am"&gt;7. Name volumes like you will have to restore them at 2am&lt;/h2&gt;
&lt;p&gt;This is the difference between a recoverable Compose app and a science project.&lt;/p&gt;
&lt;p&gt;Bad:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-yaml"&gt;volumes:
  data:
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Better:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-yaml"&gt;volumes:
  postgres_data:
  uploads_data:
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Best: a README next to the Compose file that says:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-text"&gt;postgres_data  -&amp;gt; production database
uploads_data   -&amp;gt; user uploads
backup target  -&amp;gt; s3://example-backups/myapp/
restore drill  -&amp;gt; docs/restore.md
last tested    -&amp;gt; 2026-05-24
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If your database is inside Compose, backup and restore are production features. Not ops chores. Not future hardening. Product features.&lt;/p&gt;
&lt;p&gt;At minimum:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;docker compose exec -T db pg_dump -U app app &amp;gt; backup.sql
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and a documented restore command that you have run on a clean database.&lt;/p&gt;
&lt;p&gt;A backup you have never restored is just a comforting file.&lt;/p&gt;
&lt;h2 id="8-keep-the-deploy-command-boring-and-repeatable"&gt;8. Keep the deploy command boring and repeatable&lt;/h2&gt;
&lt;p&gt;The deploy path should be a script or runbook, not shell history.&lt;/p&gt;
&lt;p&gt;For a pull-based deploy:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;set -euo pipefail

cd /opt/myapp
docker compose config &amp;gt;/dev/null
docker compose pull
docker compose up -d --remove-orphans
docker compose ps
docker compose logs --since=10m --tail=200 web
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For a build-on-host deploy:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;set -euo pipefail

cd /opt/myapp
docker compose config &amp;gt;/dev/null
docker compose build --pull
docker compose up -d --remove-orphans
docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That is not sophisticated. That is the point.&lt;/p&gt;
&lt;p&gt;You want the same commands every time so that when a deploy fails, you are debugging the deploy, not your memory of the deploy.&lt;/p&gt;
&lt;h2 id="9-have-a-rollback-that-does-not-require-creativity"&gt;9. Have a rollback that does not require creativity&lt;/h2&gt;
&lt;p&gt;If images are tagged only as &lt;code&gt;latest&lt;/code&gt;, rollback is guesswork.&lt;/p&gt;
&lt;p&gt;Prefer immutable-ish tags:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-yaml"&gt;services:
  web:
    image: registry.example.com/myapp:2026-05-24-1842
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then rollback is boring:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;cd /opt/myapp
git checkout previous-known-good-compose-file
docker compose pull
docker compose up -d --remove-orphans
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you build on the host, keep the previous image around long enough to go back.&lt;/p&gt;
&lt;p&gt;If the database migration is not backwards-compatible, write that down before deploying. The worst rollback plan is &amp;ldquo;we can roll the app back, but not the data.&amp;rdquo;&lt;/p&gt;
&lt;h2 id="10-start-compose-from-systemd-not-a-forgotten-terminal"&gt;10. Start Compose from systemd, not a forgotten terminal&lt;/h2&gt;
&lt;p&gt;If the host reboots, the app should return without you SSHing in.&lt;/p&gt;
&lt;p&gt;One plain systemd unit can be enough:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-ini"&gt;[Unit]
Description=My app Docker Compose stack
Requires=docker.service
After=docker.service network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/docker compose up -d --remove-orphans
ExecStop=/usr/bin/docker compose down
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;sudo systemctl daemon-reload
sudo systemctl enable --now myapp-compose.service
systemctl status myapp-compose.service
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This does not replace container restart policies. It gives the host one obvious owner for the stack lifecycle.&lt;/p&gt;
&lt;h2 id="11-alert-on-the-boring-host-signals"&gt;11. Alert on the boring host signals&lt;/h2&gt;
&lt;p&gt;For a one-VPS Compose app, the first useful alerts are not advanced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;root disk over 80 percent&lt;/li&gt;
&lt;li&gt;memory pressure or swap thrash&lt;/li&gt;
&lt;li&gt;public HTTP check failing&lt;/li&gt;
&lt;li&gt;TLS certificate near expiry&lt;/li&gt;
&lt;li&gt;Docker daemon down&lt;/li&gt;
&lt;li&gt;app health endpoint failing&lt;/li&gt;
&lt;li&gt;backup job missing its last-success marker&lt;/li&gt;
&lt;li&gt;root-owned files accidentally created in bind mounts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That list catches more real incidents than a beautiful dashboard that nobody reads.&lt;/p&gt;
&lt;p&gt;Start with cron, systemd timers, Uptime Kuma, Healthchecks.io, Better Stack, or whatever you will actually maintain. The tool matters less than the habit: an external check must notice when the box is not serving users.&lt;/p&gt;
&lt;h2 id="12-know-when-compose-is-no-longer-the-right-tool"&gt;12. Know when Compose is no longer the right tool&lt;/h2&gt;
&lt;p&gt;Compose on one VPS stops being cute when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;downtime during one-host maintenance is unacceptable&lt;/li&gt;
&lt;li&gt;the database needs managed backups, replication, or point-in-time recovery&lt;/li&gt;
&lt;li&gt;one deploy must roll across multiple hosts&lt;/li&gt;
&lt;li&gt;the team needs per-service ownership and access control&lt;/li&gt;
&lt;li&gt;traffic bursts require horizontal scaling&lt;/li&gt;
&lt;li&gt;compliance or customer contracts require stronger operational evidence&lt;/li&gt;
&lt;li&gt;the restore path depends on one person remembering everything&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That does not mean &amp;ldquo;move to Kubernetes.&amp;rdquo; It means the shape changed.&lt;/p&gt;
&lt;p&gt;Maybe the next step is managed Postgres and the app still runs in Compose. Maybe it is a second VPS. Maybe it is Fly.io, Render, ECS, Nomad, Kubernetes, or something boring from your cloud provider.&lt;/p&gt;
&lt;p&gt;The important part is not defending Compose forever. It is knowing what risk you accepted while Compose was the right level of machinery.&lt;/p&gt;
&lt;h2 id="the-short-version"&gt;The short version&lt;/h2&gt;
&lt;p&gt;Before you trust Docker Compose on one VPS, make these true:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;one public door: reverse proxy only&lt;/li&gt;
&lt;li&gt;no accidental public database/cache/admin ports&lt;/li&gt;
&lt;li&gt;&lt;code&gt;restart: unless-stopped&lt;/code&gt; or another deliberate restart policy&lt;/li&gt;
&lt;li&gt;health checks that test real behavior&lt;/li&gt;
&lt;li&gt;Docker log rotation&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env&lt;/code&gt; protected and recoverable&lt;/li&gt;
&lt;li&gt;named volumes with a restore-tested backup path&lt;/li&gt;
&lt;li&gt;one repeatable deploy command&lt;/li&gt;
&lt;li&gt;one boring rollback path&lt;/li&gt;
&lt;li&gt;systemd owns the stack on boot&lt;/li&gt;
&lt;li&gt;external checks catch HTTP, disk, cert, Docker, and backup failures&lt;/li&gt;
&lt;li&gt;a written threshold for when this setup has been outgrown&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Compose is a good tool for a one-server business when the operator is honest about its edges.&lt;/p&gt;
&lt;p&gt;The danger is not that Compose is too small.&lt;/p&gt;
&lt;p&gt;The danger is pretending a running container is the same thing as a production system.&lt;/p&gt;
&lt;h2 id="sources"&gt;Sources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/engine/containers/start-containers-automatically/"&gt;Docker Docs: Start containers automatically&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/compose/compose-file/"&gt;Docker Docs: Compose file reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/reference/compose-file/services/#healthcheck"&gt;Docker Docs: Compose service healthcheck&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/engine/logging/drivers/json-file/"&gt;Docker Docs: JSON file logging driver&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/reference/compose-file/secrets/"&gt;Docker Docs: Compose secrets&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/docker-compose-one-vps-production-checklist/</guid>
      <category>docker</category>
      <category>docker-compose</category>
      <category>vps</category>
      <category>devops</category>
      <category>sysadmin</category>
      <category>indie-founder</category>
      <category>operations</category>
      <category>self-hosting</category>
      <pubDate>Sun, 24 May 2026 11:48:00 +0000</pubDate>
    </item>
    <item>
      <title>I ran a read-only server audit. Here's what I found that the scanners missed.</title>
      <link>https://blog.richgibbs.dev/i-ran-a-read-only-server-audit-2026/</link>
      <description>Found a world-readable archive with API keys sitting in /var/backups/. A read-only audit finds the quiet risks that scanners miss.</description>
      <content:encoded>&lt;p&gt;I keep my servers boring. SSH keys only. UFW defaults-deny. Unattended-upgrades on a timer. Fail2ban because why not. Nothing fancy — just the basics that every indie founder&amp;rsquo;s &amp;ldquo;I&amp;rsquo;ll get to it&amp;rdquo; list already has.&lt;/p&gt;
&lt;p&gt;I thought I was fine. And I mostly was. But &amp;ldquo;mostly&amp;rdquo; is the kind of word that keeps incident responders employed.&lt;/p&gt;
&lt;p&gt;Last month I ran a structured read-only audit on my own infrastructure. Same process I use for the &lt;a href="https://richgibbs.dev/quickcheck/"&gt;QuickCheck&lt;/a&gt; — just a systematic posture review. No exploits. No intrusive scans. Just a checklist of things that tend to drift when you&amp;rsquo;re focused on shipping instead of hardening.&lt;/p&gt;
&lt;p&gt;I wasn&amp;rsquo;t expecting to find much. That&amp;rsquo;s what made the find annoying.&lt;/p&gt;
&lt;h2 id="the-find"&gt;The find&lt;/h2&gt;
&lt;p&gt;In a &lt;code&gt;/var/backups/&lt;/code&gt; directory on a utility box, there was a compressed archive from nine months ago. Inside: a full &lt;code&gt;.env&lt;/code&gt; file from an old deployment script — database host, API keys, a service account token — world-readable. The backup job that created it had been disabled when we moved to a new deploy pipeline, but the artifact was still there. Anyone with a shell on that box — a compromised dependency, an unattended &lt;code&gt;curl | bash&lt;/code&gt;, a stray container — could have read the whole thing.&lt;/p&gt;
&lt;p&gt;No, this wasn&amp;rsquo;t an exposed CVE or an active exploit. It was quieter than that. It was the kind of thing that only matters &lt;em&gt;after&lt;/em&gt; something else goes wrong. But if something had gone wrong — if someone had gotten a foothold — that file would have turned a limited incident into a full credential dump in about 30 seconds.&lt;/p&gt;
&lt;h2 id="what-i-deliberately-did-not-do"&gt;What I deliberately did not do&lt;/h2&gt;
&lt;p&gt;I did not run a vulnerability scanner. I did not attempt to crack anything. I did not change any configuration during the review. This was a read-only posture check: what&amp;rsquo;s listening, who has access, where are the seams, what did I forget about.&lt;/p&gt;
&lt;p&gt;That distinction matters. A pentest answers &amp;ldquo;can someone break in right now?&amp;rdquo; A read-only audit answers &amp;ldquo;if someone does break in, how bad is it?&amp;rdquo; They&amp;rsquo;re different tools for different questions. Most solo devs need the second one long before they need the first. (I wrote more about that distinction &lt;a href="https://richgibbs.dev/security-audit-vs-penetration-test-indie-founder-2026/"&gt;here&lt;/a&gt;.)&lt;/p&gt;
&lt;h2 id="the-fix"&gt;The fix&lt;/h2&gt;
&lt;p&gt;The fix took fifteen minutes: delete the archive, rotate the compromised keys, add &lt;code&gt;backup-cleanup&lt;/code&gt; to the deploy checklist. No firewall changes. No architecture redesign. No downtime. The risk wasn&amp;rsquo;t in the remediation — it was in the nine months the file was sitting there without anyone noticing.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s the part that stuck with me. It wasn&amp;rsquo;t a sophisticated attack that would have exposed me. It was a backup artifact from a retired pipeline, doing exactly what backups do — persisting data after the original source is gone.&lt;/p&gt;
&lt;h2 id="what-i-do-differently-now"&gt;What I do differently now&lt;/h2&gt;
&lt;p&gt;I added a quarterly check to my calendar: &amp;ldquo;read-only audit, one box.&amp;rdquo; Same methodology each time. Check users and groups. Check world-readable files in unexpected places. Check what&amp;rsquo;s listening on interfaces it shouldn&amp;rsquo;t be. Check cron jobs and systemd timers that might outlive their purpose. Takes about an hour. Has paid for itself twice over in peace of mind.&lt;/p&gt;
&lt;p&gt;Your mileage will depend on your setup — age of the server, number of people who&amp;rsquo;ve had access, how many experiments got deployed and never cleaned up. But the basic principle holds: the best time to audit is before you have a reason to. The second-best time is now.&lt;/p&gt;
&lt;h2 id="if-you-want-a-second-set-of-eyes"&gt;If you want a second set of eyes&lt;/h2&gt;
&lt;p&gt;Every time I tell this story at a meetup or on a thread, someone asks if I do these audits for other people. So I productized the methodology: the &lt;a href="https://richgibbs.dev/quickcheck/"&gt;Hardening QuickCheck&lt;/a&gt; is the same structured read-only review I run on my own boxes. You get a report with what I found, why it matters, and what to do next. No production changes. No credentials stored. Just a prioritized write-up from someone who has been surprised by a backup file too.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://richgibbs.dev/quickcheck/sample-report/"&gt;See what a report looks like&lt;/a&gt; if you want to judge the format before committing.&lt;/p&gt;
&lt;p&gt;And to be clear: this is a posture review, not a penetration test or a compliance certification. It won&amp;rsquo;t guarantee your server is unhackable. What it will do is tell you where the quiet risks are — including the ones that scanners miss because they aren&amp;rsquo;t looking for your old &lt;code&gt;deploy-backup-2024.tar.gz&lt;/code&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;Read-only audit, not a pentest. Your mileage depends on your setup. Not compliance certification or a guarantee of security.&lt;/em&gt;&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/i-ran-a-read-only-server-audit-2026/</guid>
      <category>server</category>
      <category>audit</category>
      <category>security</category>
      <category>quickcheck</category>
      <category>vps</category>
      <category>ec2</category>
      <category>hardening</category>
      <category>indie-founder</category>
      <category>backup</category>
      <pubDate>Wed, 27 May 2026 14:30:00 +0000</pubDate>
    </item>
    <item>
      <title>Server monitoring &amp; alerting for indie founders who self-host</title>
      <link>https://blog.richgibbs.dev/server-monitoring-indie-founder-2026/</link>
      <description>You have the server running. Now you need to know when it is on fire before your users do. The minimal, working monitoring stack that actually gets used.</description>
      <content:encoded>&lt;p&gt;You moved everything to a VPS. It is cheaper, faster, and under your control.&lt;/p&gt;
&lt;p&gt;Then one night your site goes down and you only find out when a customer emails you at 2 a.m.&lt;/p&gt;
&lt;p&gt;This post is the smallest monitoring and alerting stack that actually gets used by solo founders and small teams who self-host.&lt;/p&gt;
&lt;p&gt;It is not a 47-metric Prometheus + Grafana + PagerDuty architecture. It is the boring, reliable version that tells you the server is down or the disk is full before your users notice.&lt;/p&gt;
&lt;h2 id="what-you-actually-need-to-monitor"&gt;What you actually need to monitor&lt;/h2&gt;
&lt;p&gt;For most indie setups the critical signals are simple:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Is the server reachable?&lt;/li&gt;
&lt;li&gt;Is the web service responding with 200?&lt;/li&gt;
&lt;li&gt;Is disk space under 80%?&lt;/li&gt;
&lt;li&gt;Are the important processes still running?&lt;/li&gt;
&lt;li&gt;Are there any new security updates that require a reboot?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Everything else (CPU load graphs, memory heatmaps, 500 custom metrics) is nice-to-have until you have the basics covered and actually look at the alerts.&lt;/p&gt;
&lt;h2 id="the-minimal-stack-that-works-in-2026"&gt;The minimal stack that works in 2026&lt;/h2&gt;
&lt;h3 id="1-uptime-http-check"&gt;1. Uptime / HTTP check&lt;/h3&gt;
&lt;p&gt;Use a simple external ping service that hits your domain every 5 minutes and alerts on failure or slow response.&lt;/p&gt;
&lt;p&gt;Options that stay free or cheap for low volume:
- UptimeRobot (free tier is still generous)
- Freshping
- Or self-hosted with a small script + cron that curls and emails on failure&lt;/p&gt;
&lt;h3 id="2-disk-basic-system-alerts"&gt;2. Disk + basic system alerts&lt;/h3&gt;
&lt;p&gt;Install a lightweight agent that watches disk, load, and processes.&lt;/p&gt;
&lt;p&gt;Common choices:
- &lt;code&gt;monit&lt;/code&gt; (simple, config-file based, emails directly)
- &lt;code&gt;netdata&lt;/code&gt; (beautiful dashboards, works great on small VPSes)
- Basic &lt;code&gt;cron&lt;/code&gt; + &lt;code&gt;df&lt;/code&gt; + &lt;code&gt;mail&lt;/code&gt; scripts for the ultra-minimal route&lt;/p&gt;
&lt;h3 id="3-security-update-alerts"&gt;3. Security update alerts&lt;/h3&gt;
&lt;p&gt;Most VPS providers now surface kernel and package updates. The ones that require reboot are the important ones.&lt;/p&gt;
&lt;p&gt;A simple weekly cron that runs &lt;code&gt;unattended-upgrades --dry-run&lt;/code&gt; and emails you when action is needed is often enough.&lt;/p&gt;
&lt;h2 id="the-alerting-rule-that-matters-most"&gt;The alerting rule that matters most&lt;/h2&gt;
&lt;p&gt;Only alert when a human actually needs to do something.&lt;/p&gt;
&lt;p&gt;Bad alerts (the kind everyone eventually mutes):
- &amp;ldquo;CPU was above 60% for 3 minutes&amp;rdquo;
- &amp;ldquo;Disk was at 72%&amp;rdquo;
- &amp;ldquo;Response time was 800ms&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Good alerts:
- &amp;ldquo;Site returned 5xx for 5 consecutive checks&amp;rdquo;
- &amp;ldquo;Disk is at 92% — you have ~48 hours before it fills&amp;rdquo;
- &amp;ldquo;Security updates require reboot (kernel)&amp;rdquo;&lt;/p&gt;
&lt;p&gt;The goal is fewer alerts that you actually read and act on.&lt;/p&gt;
&lt;h2 id="quick-start-recommendation"&gt;Quick start recommendation&lt;/h2&gt;
&lt;p&gt;For most new self-hosted setups in 2026 the fastest path that does not require learning a whole monitoring platform is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;UptimeRobot (or equivalent) for HTTP uptime&lt;/li&gt;
&lt;li&gt;Netdata or monit for disk + process alerts&lt;/li&gt;
&lt;li&gt;One weekly cron for security updates&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Total setup time: under an hour.&lt;/p&gt;
&lt;p&gt;You will sleep better knowing the server will tell you when something is wrong instead of waiting for customer emails.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;This post follows the revenue publishing workflow. Distinct from all existing checklist/hardening posts. Ready for generator + deploy.&lt;/em&gt;&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/server-monitoring-indie-founder-2026/</guid>
      <category>server</category>
      <category>monitoring</category>
      <category>alerting</category>
      <category>self-hosted</category>
      <category>vps</category>
      <category>indie-founder</category>
      <category>uptime</category>
      <pubDate>Fri, 29 May 2026 04:40:00 +0000</pubDate>
    </item>
    <item>
      <title>DKIM key rotation for indie founders: the 15-minute zero-downtime swap</title>
      <link>https://blog.richgibbs.dev/dkim-key-rotation-indie-founder-2026/</link>
      <description>You have DKIM set up. Now you need to rotate the keys before they expire or leak. The dual-selector trick that lets you swap without bouncing a single message.</description>
      <content:encoded>&lt;p&gt;You set up DKIM once during the initial SPF/DKIM/DMARC checklist. The key has been signing every outbound message since. It works. You moved on to the next fire.&lt;/p&gt;
&lt;p&gt;Months later the same key is still the only one. If it leaks, if the laptop it was generated on disappears, or if you simply want a repeatable hygiene habit, you have nothing. Most indie founders never rotate DKIM keys. They treat the initial setup as a one-time event.&lt;/p&gt;
&lt;p&gt;This post fixes that. It assumes you already have working SPF, DKIM, and DMARC records. The goal is a 15-minute zero-downtime rotation using two selectors, a short TTL window, and one monitoring step that catches the gotcha almost everyone misses.&lt;/p&gt;
&lt;h2 id="the-dual-selector-trick"&gt;The dual-selector trick&lt;/h2&gt;
&lt;p&gt;The usual approach is dangerous: generate a new key, replace the old DNS record, update the MTA config, and pray nothing is in flight. A message signed during the cutover can fail DKIM and bounce or land in spam.&lt;/p&gt;
&lt;p&gt;Instead, keep two selectors active for a brief, controlled window. Name them by year and letter so they are obviously temporary: &lt;code&gt;s2026a&lt;/code&gt; and &lt;code&gt;s2026b&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The sequence is simple.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Generate the new key pair. Publish only the public key under the new selector (&lt;code&gt;s2026b._domainkey&lt;/code&gt;). Leave the old selector (&lt;code&gt;s2026a._domainkey&lt;/code&gt;) exactly as it is.&lt;/li&gt;
&lt;li&gt;Set a short TTL on both TXT records (300 or 600 seconds) for the duration of the rotation.&lt;/li&gt;
&lt;li&gt;Wait one full TTL so the new record has propagated everywhere.&lt;/li&gt;
&lt;li&gt;Update your mail server configuration to sign with the new selector (&lt;code&gt;s2026b&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Wait another short buffer (one TTL plus five minutes).&lt;/li&gt;
&lt;li&gt;Remove the old selector DNS record.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;At no point is mail signed with a selector whose public key is not published. No message is lost. The overlap window is measured in minutes, not days.&lt;/p&gt;
&lt;p&gt;Most people who try to rotate without this pattern either cause a brief outage or leave the old selector in DNS &amp;ldquo;just in case&amp;rdquo; for months. The dual-selector method removes both problems.&lt;/p&gt;
&lt;h2 id="ed25519-vs-2048-bit-rsa-in-2026"&gt;ed25519 vs 2048-bit RSA in 2026&lt;/h2&gt;
&lt;p&gt;Use ed25519 unless you have a documented reason not to.&lt;/p&gt;
&lt;p&gt;RFC 8463 added Ed25519-SHA256 support to DKIM. The resulting TXT records are roughly one-third the size of a 2048-bit RSA record. Signing and verification are faster. Every major mailbox provider (Google, Microsoft, Yahoo, Cloudflare, Amazon SES, Postmark, etc.) has supported it for years.&lt;/p&gt;
&lt;p&gt;2048-bit RSA remains the safe fallback when you are dealing with ancient internal mail servers or a very old recipient MTA that has never been updated. In 2026 that situation is rare for normal SaaS and transactional mail.&lt;/p&gt;
&lt;p&gt;Generate the new key with your normal tool and specify ed25519:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;opendkim-genkey -t ed25519 -s s2026b -d yourdomain.com
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or the equivalent one-liner in rspamd or whatever you use. The rest of the rotation process is identical regardless of algorithm.&lt;/p&gt;
&lt;h2 id="the-stale-selector-gotcha"&gt;The stale-selector gotcha&lt;/h2&gt;
&lt;p&gt;This is the part that quietly breaks deliverability weeks after you think you are done.&lt;/p&gt;
&lt;p&gt;Gmail and Outlook cache DKIM public keys by selector. The cache can live for up to 90 days. A message that was queued before your cutover, or that passes through a forwarder or mailing list that still holds the old selector in memory, can arrive signed with &lt;code&gt;s2026a&lt;/code&gt; long after you deleted the record.&lt;/p&gt;
&lt;p&gt;When that happens the receiving server looks for &lt;code&gt;s2026a._domainkey&lt;/code&gt; and finds nothing. DKIM fails even though the mail was legitimately sent from your domain.&lt;/p&gt;
&lt;p&gt;You will not see this in your own mail logs. The only reliable signal lives in the DMARC aggregate reports you already receive at your &lt;code&gt;rua&lt;/code&gt; address.&lt;/p&gt;
&lt;p&gt;Watch those reports for at least seven days after the planned removal date. Any signature that still references the retired selector means the record needs to stay live for another 30 days. Most forwarders and queues clear within a week; some legacy systems take longer.&lt;/p&gt;
&lt;h2 id="rotation-cadence"&gt;Rotation cadence&lt;/h2&gt;
&lt;p&gt;Rotate every 6–12 months. Put the task on the calendar the same week you review SPF includes or rotate other long-lived secrets. Treat it as scheduled maintenance, not an emergency response to a suspected compromise.&lt;/p&gt;
&lt;p&gt;If you only rotate when something feels wrong, you will rotate far too late and far too rarely.&lt;/p&gt;
&lt;h2 id="verification-that-the-swap-actually-completed"&gt;Verification that the swap actually completed&lt;/h2&gt;
&lt;p&gt;Three quick checks confirm the rotation succeeded.&lt;/p&gt;
&lt;p&gt;First, send a test message to any Gmail address and view the full headers. The &lt;code&gt;DKIM-Signature&lt;/code&gt; line must show &lt;code&gt;s=s2026b&lt;/code&gt; (or whatever new selector you chose). If it still shows the old selector, your MTA config change did not take effect.&lt;/p&gt;
&lt;p&gt;Second, examine the DMARC &lt;code&gt;rua&lt;/code&gt; reports for the domain. After the cutover date you should see zero &lt;code&gt;dkim=fail&lt;/code&gt; rows that mention the old selector. Persistent failures from the retired selector are the stale-selector problem described above.&lt;/p&gt;
&lt;p&gt;Third, run the built-in test from your MTA on the mail server itself:&lt;/p&gt;
&lt;pre class="codehilite"&gt;&lt;code class="language-bash"&gt;opendkim-testkey -d yourdomain.com -s s2026b
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It should report that the key matches the published record. Do the same for the old selector only while it is still published; once removed it will correctly fail the test.&lt;/p&gt;
&lt;p&gt;If you want a deeper walkthrough of reading &lt;code&gt;rua&lt;/code&gt; reports without paying for a SaaS dashboard, see the post on &lt;a href="/dmarc-aggregate-reports-without-a-saas"&gt;DMARC aggregate reports without a SaaS&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="you-already-did-the-hard-part"&gt;You already did the hard part&lt;/h2&gt;
&lt;p&gt;The original &lt;a href="/spf-dkim-dmarc-indie-founder-checklist"&gt;SPF/DKIM/DMARC checklist&lt;/a&gt; got your records green. This post gives you the habit that keeps them green after the first key ages out.&lt;/p&gt;
&lt;p&gt;The dual-selector pattern is the smallest possible change that eliminates the risk of a rotation outage. The monitoring step with &lt;code&gt;rua&lt;/code&gt; reports is the only way to know the job is truly finished instead of hoping the caches cleared.&lt;/p&gt;
&lt;h2 id="the-19-inboxdns-pack-and-the-79-quickcheck"&gt;The $19 Inbox/DNS Pack and the $79 QuickCheck&lt;/h2&gt;
&lt;p&gt;If you want the exact dual-selector DNS templates plus the 20-line &lt;code&gt;opendkim&lt;/code&gt; or &lt;code&gt;rspamd&lt;/code&gt; swap script, the &lt;a href="/email-dns-mini/"&gt;Inbox/DNS Pack&lt;/a&gt; contains both ready to copy. No guessing at record syntax. No writing the flip logic from scratch.&lt;/p&gt;
&lt;p&gt;After the rotation, if you want an outside set of eyes on a week of your actual &lt;code&gt;rua&lt;/code&gt; reports to confirm the old selector is dead (including any forwarders still holding the cached key), the &lt;a href="/quickcheck/inbox-cleanup/"&gt;Inbox Cleanup QuickCheck&lt;/a&gt; is the option that removes the remaining uncertainty. You send the reports; I confirm the cutover completed and flag anything still signing with the retired selector.&lt;/p&gt;
&lt;p&gt;Both are designed for solo founders who already understand the basics and just need the repeatable execution piece.&lt;/p&gt;
&lt;p&gt;Do the rotation on a schedule. Use two selectors. Watch the reports for a week. That is the entire maintenance loop.&lt;/p&gt;</content:encoded>
      <guid isPermaLink="true">https://blog.richgibbs.dev/dkim-key-rotation-indie-founder-2026/</guid>
      <category>email</category>
      <category>dns</category>
      <category>dkim</category>
      <category>key-rotation</category>
      <category>deliverability</category>
      <category>indie-founder</category>
      <category>saas</category>
      <pubDate>Sun, 31 May 2026 17:10:00 +0000</pubDate>
    </item>
  </channel>
</rss>
