Rich Gibbs

Q: What is an agent permission map?

A single written table, one row per connected tool, that records which account the AI agent runs as, what it can read, what it can change, what it can send externally, what it can spend, what it can deploy or break, what requires human approval, what gets logged, and how to turn it off fast. It is a description of operating permissions, not a scanner.

Q: Do I need a security audit, an IAM audit, or an agent permission map?

The agent permission map is the cheaper first artifact and almost always the missing one. It is not a security audit, an IAM audit, a compliance review, or a penetration test. If any of those become necessary later, the map makes them shorter and cheaper because the scope is already written down.

Ubuntu/Debian EC2 hardening checklist (2026)

Sun, 10 May 2026 00:20:00 +0000

You spun up an EC2 instance, pointed a domain at it, and now real traffic — and real bots — can reach it. Most “hardening guides” online are either copy-paste cargo cult from 2014 or vendor whitepapers selling a SIEM. This is the version I actually run on Ubuntu 22.04, Ubuntu 24.04, and Debian 12 boxes, written for solo founders and small teams who don’t have a dedicated security person.

Work through it top to bottom on a fresh box. On an existing box, treat it as a diff: read each section, run the audit command, fix the gap, move on.

Why this checklist

The threats most small EC2 fleets actually get hit by aren’t APTs. They’re:

SSH brute force from random botnets
Exposed services you forgot were listening (Redis, Postgres, Docker API, an old admin panel)
Stolen IAM credentials via SSRF on a misconfigured app reaching the EC2 metadata service
An unpatched kernel or library with a known CVE
A compromised dependency or container image that opens a reverse shell

Everything below is aimed at those concrete risks. There’s no checklist on earth that makes you “secure” — but a tight baseline closes the cheap, automated attack paths so an attacker has to actually work.

Threat model assumptions

Before any commands, make these explicit:

This is a single-tenant Linux server (or small fleet) on AWS EC2.
You are the only admin, or there’s a tiny ops team with shared SSH keys.
The instance runs a public-facing web app and/or some background workers.
You’re not in a regulated environment yet (PCI/HIPAA/SOC 2 controls are not what this checklist gives you).
You can tolerate a few minutes of downtime to reboot for kernel updates.

If any of those don’t match, adjust before applying.

1. SSH

SSH is still the single biggest “front door” on a Linux server.

Use keys, not passwords. Disable root login. Limit who can log in.

Edit /etc/ssh/sshd_config (or drop a file in /etc/ssh/sshd_config.d/ on Ubuntu 22.04+):

PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
PubkeyAuthentication yes
PermitEmptyPasswords no
X11Forwarding no
MaxAuthTries 3
LoginGraceTime 20
ClientAliveInterval 300
ClientAliveCountMax 2
AllowUsers ubuntu deploy

Replace ubuntu deploy with the actual non-root accounts you use. Then validate and reload:

sudo sshd -t
sudo systemctl reload ssh

Optional but worth it on small boxes:

Move SSH off port 22. It doesn’t stop a determined attacker, but it cuts log noise from internet-wide scanners by ~95%. If you do this, update the EC2 security group too.
Restrict the SSH security group to your office/VPN IP, your home IP, or a bastion. 0.0.0.0/0 on port 22 is a choice, not a default.
Install fail2ban for cheap brute-force throttling:

bash sudo apt-get update && sudo apt-get install -y fail2ban sudo systemctl enable --now fail2ban

Audit:

sudo sshd -T | grep -Ei 'permitrootlogin|passwordauth|pubkeyauth|allowusers|port'

2. Firewall and listeners

The cheapest mistake on EC2 is a service binding to 0.0.0.0 that you thought was on 127.0.0.1. Defense in depth: lock it down at the OS and at the security group.

See what’s actually listening:

sudo ss -tulpn

Anything bound to 0.0.0.0 or :: that isn’t your web server, SSH, or something you explicitly want public is a finding. Common offenders: Redis (6379), Postgres (5432), MySQL (3306), Docker API (2375/2376), Elasticsearch (9200), Memcached (11211), node dev servers.

Bind to localhost in the service config (e.g. bind 127.0.0.1 in /etc/redis/redis.conf, listen_addresses = 'localhost' in postgresql.conf).

Then layer UFW on top:

sudo apt-get install -y ufw
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable
sudo ufw status verbose

On the AWS side, the security group is your real perimeter. Rules of thumb:

One SG per role (web, db, worker), not one giant SG that allows everything internally.
DB and cache SGs accept traffic only from the app SG, never from 0.0.0.0/0.
SSH SG limited to known IPs or a bastion/VPN SG.
No 0.0.0.0/0 on anything except 80/443 on the public web tier.

Audit:

sudo ss -tulpn | awk '$5 ~ /0\.0\.0\.0|\[::\]/'
sudo ufw status numbered

Cross-check the AWS console / CLI:

aws ec2 describe-security-groups \
  --query 'SecurityGroups[].{Name:GroupName,Ingress:IpPermissions}' \
  --output json

3. OS updates and reboots

Unpatched kernels and OpenSSL/libc libraries are the most boring and most common way servers get owned.

Enable unattended security upgrades:

sudo apt-get install -y unattended-upgrades apt-listchanges
sudo dpkg-reconfigure -plow unattended-upgrades

Check /etc/apt/apt.conf.d/50unattended-upgrades includes the security pocket and that Unattended-Upgrade::Automatic-Reboot is set deliberately. On a single box with a real user, automatic reboots at 3am can be fine; on production-critical workers, prefer notification + manual.

Patch now:

sudo apt-get update
sudo apt-get -y dist-upgrade
sudo apt-get -y autoremove --purge

Detect a needed reboot:

[ -f /var/run/reboot-required ] && cat /var/run/reboot-required.pkgs

If the kernel was updated, schedule a reboot. Live-patching (Ubuntu Pro / Livepatch) is great if you’re paying for it, but it doesn’t cover everything — you’ll still need occasional reboots.

Audit:

apt list --upgradable 2>/dev/null
uname -r

4. Admin surface

Every account that can sudo is part of your admin surface. Trim it.

getent group sudo
getent group adm
awk -F: '($3 == 0) {print}' /etc/passwd   # any extra UID 0 account is a finding

Rules:

One sudo user per real human, no shared logins where avoidable.
Service accounts (www-data, postgres, deploy) should not have shell or sudo. Use usermod -s /usr/sbin/nologin <user> if needed.
Rotate or remove SSH keys when someone leaves the team. ~/.ssh/authorized_keys for every login user is your source of truth — review it.
Disable cloud-init’s default password if any (cloud-init shouldn’t set one on official AMIs, but check).
If you must allow sudo without a password for automation, scope it to specific commands in /etc/sudoers.d/, not blanket NOPASSWD: ALL.

Audit:

for u in $(awk -F: '$7 ~ /sh$/ {print $1}' /etc/passwd); do
  echo "== $u =="; sudo cat /home/$u/.ssh/authorized_keys 2>/dev/null
done

5. EC2 metadata service (IMDSv2)

This one is non-negotiable in 2026. The EC2 instance metadata service hands out IAM role credentials. With IMDSv1 enabled, any server-side request forgery (SSRF) bug in your app can pop those credentials and walk into your AWS account.

Force IMDSv2 only, with a low hop limit:

TOKEN=$(curl -sX PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
curl -sH "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id

If that works but the same call without a token also works, you’re still on IMDSv1. For old AMIs, containers, or ASGs you cannot blindly rotate, follow the IMDSv2 migration sequence before making it mandatory.

Enforce v2 on the instance (run from your laptop with the AWS CLI):

aws ec2 modify-instance-metadata-options \
  --instance-id i-xxxxxxxxxxxxxxxxx \
  --http-tokens required \
  --http-endpoint enabled \
  --http-put-response-hop-limit 1

hop-limit 1 means a container or proxy can’t trivially relay a request to the metadata service. If you run Docker with bridge networking, you may need 2 — but start at 1, raise only if needed, and never to 64.

Also: the IAM role attached to the instance should be least privilege. “Read this one S3 bucket and write to this one log group” beats AdministratorAccess every time.

Audit:

aws ec2 describe-instances \
  --query 'Reservations[].Instances[].[InstanceId,MetadataOptions.HttpTokens,MetadataOptions.HttpPutResponseHopLimit]' \
  --output table

Anything where HttpTokens is not required is a finding.

Mid-article CTA

If you’d rather have someone else go through this list on your servers and hand you back a clear report, that’s exactly what Tuck Sentinel QuickCheck does: a one-shot, read-only audit of a single Linux box with prioritized findings and copy-pasteable fixes. You can see what the output looks like in this sample report before deciding.

Back to the checklist.

6. Logging and time sync

You can’t investigate what you didn’t record, and you can’t correlate logs that disagree on what time it is.

Time sync. Ubuntu 22.04+ and Debian 12 ship systemd-timesyncd or chrony. Either is fine, just make sure one is running:

timedatectl
# or
chronyc tracking

If you’re on AWS, the local time source 169.254.169.123 is reliable and low-latency. chrony config example:

server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4

Logging. journald is the default. A few sane settings in /etc/systemd/journald.conf:

Storage=persistent
SystemMaxUse=1G
SystemMaxFileSize=128M
ForwardToSyslog=no

Then:

sudo systemctl restart systemd-journald

For anything beyond a single box, ship logs off the instance — CloudWatch Logs, a Loki/Grafana stack, or any hosted log service. The reason isn’t compliance, it’s that the first thing an attacker tries to do is rm /var/log/* and journalctl --rotate --vacuum-time=1s.

Auditd is worth installing if you want a record of which user ran which command:

sudo apt-get install -y auditd
sudo systemctl enable --now auditd

You don’t need elaborate rules to start; the defaults plus shipping /var/log/audit/audit.log off-box is already a huge upgrade.

Audit:

journalctl --disk-usage
timedatectl | grep 'System clock synchronized'

7. Backups and restore drills

A backup you’ve never restored is a wish, not a backup.

For a small EC2 setup:

Use AWS Backup or scheduled EBS snapshots for the volume(s).
For databases, also take logical backups (pg_dump, mysqldump) on a schedule and copy them to S3 with versioning + lifecycle to Glacier.
Encrypt at rest (EBS encryption + S3 SSE-KMS). On modern AWS regions/accounts, EBS encryption-by-default should be on — check it.
Keep at least one backup copy in a different AWS account or region. Ransomware-style attackers will delete in-region snapshots if they get the chance.

Restore drill — once a quarter, on a throwaway instance:

Pick a recent snapshot/dump.
Spin up a new instance/volume from it.
Verify the app starts and recent data is present.
Time how long it took. That’s your real RTO.

If you’ve never done step 4, you don’t know your RTO; you have a hope.

Audit:

aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[?StartTime>=`2026-01-01`].[SnapshotId,StartTime,VolumeSize,Description]' \
  --output table

8. Docker basics (if applicable)

If you don’t run Docker on the box, skip this. If you do, the most common foot-guns:

Don’t expose the Docker daemon over TCP. 2375 unauthenticated is root-on-box for anyone who can reach it. Use the local socket (/var/run/docker.sock) and SSH for remote control.
Mind the -p flag. -p 5432:5432 binds to 0.0.0.0 and bypasses UFW on most Docker setups (Docker writes its own iptables rules). If you only need the port locally, use -p 127.0.0.1:5432:5432.
Run containers as non-root where possible. USER directive in your Dockerfile, or --user 1000:1000 at runtime.
Pin base images to a digest (FROM ubuntu:24.04@sha256:...) for production, and rebuild on a schedule to pick up CVE fixes.
Don’t bind-mount the Docker socket into containers unless you fully understand that’s equivalent to giving that container root on the host.
Set --read-only and --cap-drop=ALL for containers that don’t need to write to their filesystem or hold extra capabilities; add back only what’s needed.

A useful audit one-liner:

docker ps --format '{{.Names}} {{.Ports}}' | grep -E '0\.0\.0\.0|:::'

Anything in that list is reachable from the public internet (modulo the security group). Decide if that’s intentional.

For containerd/k8s setups this barely scratches the surface — but on a single EC2 box running a few containers, those bullets close ~80% of the cheap holes.

What this is not

Be honest with yourself about what a checklist like this does and doesn’t do.

It is not a penetration test. Nobody is exploiting your application logic, your auth flows, or your business rules here. A pentest is a different (and more expensive) thing.
It is not compliance. SOC 2, HIPAA, PCI, ISO 27001 all require documented policies, evidence collection, access reviews, vendor management, and a lot more. A hardened box is part of that, not a substitute.
It is not a guarantee. New CVEs ship every week. Your application code changes. Someone leaks a key on GitHub. Hardening is a continuous practice, not a one-time event.
It is not opinionated about your app stack. TLS configuration, WAF rules, secrets management, dependency scanning, CI/CD security — all out of scope here.

What it does do: dramatically reduce the set of “stupid ways your server gets owned by a bot at 3am” and give you a baseline you can re-run on every new instance.

End-article CTA

If you got this far and want to skip the manual audit, that’s exactly what I built Tuck Sentinel QuickCheck for: a single-instance, read-only Linux audit that runs the kind of checks above and produces a prioritized report with concrete fixes — no agent left behind, no ongoing access. Take a look at the sample report to see exactly what you’d get.

Either way: run the checklist. Future-you will thank present-you.

About Tuck Sentinel

Tuck Sentinel is a small, focused security tooling project from indie operator Rich Gibbs. It produces practical, no-nonsense audits and content for solo founders and small teams running their own Linux infrastructure — the kind of work most SOC platforms ignore because the deal size is too small. Start with QuickCheck if you want a one-shot review of a single server.

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Ubuntu/Debian EC2 hardening checklist (2026)",
  "description": "A practical 2026 hardening checklist for Ubuntu and Debian EC2 instances: SSH, UFW, IMDSv2, updates, logging, backups, and Docker basics.",
  "author": {
    "@type": "Person",
    "name": "Rich Gibbs",
    "url": "https://richgibbs.dev/"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Tuck Sentinel",
    "url": "https://richgibbs.dev/"
  },
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://richgibbs.dev/blog/ubuntu-debian-ec2-hardening-checklist-2026/"
  },
  "image": "https://richgibbs.dev/og/ubuntu-debian-ec2-hardening-2026.png",
  "datePublished": "2026-05-10",
  "dateModified": "2026-05-10",
  "keywords": "ubuntu, debian, ec2, hardening, security, devops, sysadmin, aws, imdsv2, ssh, ufw",
  "inLanguage": "en"
}

The Indie Founder's VPS Security 101

Sun, 10 May 2026 00:22:00 +0000

You shipped the thing. It runs on one Linux box at DigitalOcean or Hetzner or wherever. Customers are starting to show up, and somewhere in the back of your head a little voice is asking: is this thing actually safe?

This guide is for that voice.

It’s written for solo founders and very small teams who are not security professionals but can copy a command into a terminal. The goal is “secure enough that you can sleep” — not “audit-grade fortress.” Those are different jobs, and treating one like the other is how you waste a weekend installing seven intrusion detection tools and shipping nothing for a month.

What “secure enough” looks like for one box

For a single VPS running your SaaS, “secure enough” is a short list:

Nobody can log in as root from the internet.
Logging in requires a key you have, not a password someone could guess.
Only the ports you actually use are open.
The OS gets security patches automatically.
You’d notice if something obviously bad started happening.
If the disk caught fire tomorrow, you could rebuild from a backup before the end of the day.

That’s the whole bar. Everything else is optimization. Hit those six and you’ve already done more than the majority of small-team production servers I’ve seen.

First-day setup

Do these once, when the server is fresh. They take about twenty minutes.

1. Create a non-root user with sudo

Logging in as root is a footgun. One typo and you’ve nuked the box. Make a normal user instead.

# As root, on a fresh server
adduser deploy
usermod -aG sudo deploy

Pick a real password for deploy even though you’ll be using SSH keys — you’ll need it for sudo prompts.

On your laptop, if you don’t already have a key:

ssh-keygen -t ed25519 -C "you@laptop"

Copy it to the server:

ssh-copy-id deploy@your.server.ip

Now log in as deploy and confirm sudo works:

ssh deploy@your.server.ip
sudo whoami   # should print: root

Once you’re sure key login works, lock down SSH. Edit /etc/ssh/sshd_config (or drop a file in /etc/ssh/sshd_config.d/):

sudo tee /etc/ssh/sshd_config.d/99-hardening.conf >/dev/null <<'EOF'
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
EOF

sudo sshd -t          # test config — must print nothing
sudo systemctl reload ssh

Do not close your existing SSH session yet. Open a second terminal and confirm you can log in fresh. If that works, you’re good. If it doesn’t, you’ve still got the first session to fix things.

3. Turn on the firewall

Ubuntu ships with ufw, which is a friendly wrapper around iptables/nftables. Default-deny inbound, allow only what you need:

sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable
sudo ufw status verbose

If you don’t run a web server on this box, drop the 80/443 lines. The rule is simple: open a port only when something on the box actually needs to listen on it.

4. Enable automatic security updates

Most successful attacks are not clever zero-days — they’re known bugs in software you forgot to patch. Let the OS patch itself.

sudo apt update
sudo apt install -y unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades   # answer "Yes"

Then check /etc/apt/apt.conf.d/50unattended-upgrades and make sure security updates are uncommented. On Ubuntu the default config already covers ${distro_id}:${distro_codename}-security, which is what you want.

For peace of mind, make it tell you when reboots are needed and when to install them:

sudo tee /etc/apt/apt.conf.d/51auto-reboot >/dev/null <<'EOF'
Unattended-Upgrade::Automatic-Reboot "true";
Unattended-Upgrade::Automatic-Reboot-Time "04:00";
EOF

Pick a time when nobody’s using the app. Yes, this means the box reboots itself sometimes. That’s fine. Your app should already survive a reboot — and if it doesn’t, that’s a bigger problem than security.

That’s first-day setup. Non-root sudo user, keys-only SSH, default-deny firewall, automatic patching. You’re now ahead of a lot of production servers.

Worried you missed something on first-day setup?

Run a free QuickCheck on your server →

It’s a read-only scan that flags the boring stuff: SSH still allows passwords, port 22 open to the world, no automatic updates configured, sketchy listening services, and so on. No agent, no signup wall. Here’s a sample report if you’d like to see the format first.

What to actually monitor

You don’t need a SIEM. You need a few things you can eyeball once a week (or get a tiny script to email you about). For a single VPS, this short list catches almost everything that matters.

Failed logins

If somebody is hammering your SSH port, this shows it:

sudo journalctl -u ssh --since "24 hours ago" | grep -i "failed\|invalid"

A handful of attempts per day is internet background noise. Thousands per hour from one IP is worth blocking with ufw or installing fail2ban.

Listening ports

What’s actually accepting connections on this box? Run this every so often and make sure nothing surprising is there:

sudo ss -tulpn

You’re looking for things bound to 0.0.0.0: or :::. Anything bound to 127.0.0.1 is fine — only your box can talk to it. The classic mistake: running a dev database with bind = 0.0.0.0 and no password. Don’t do that.

Disk free

Servers don’t usually die from hackers. They die from full disks at 3 AM.

df -h /
du -sh /var/log /var/lib/docker 2>/dev/null

If / is over 80% full, plan on cleaning it up before it hits 100% and your database refuses to write.

Package updates available

Even with unattended-upgrades, it’s worth a manual sanity check now and then:

sudo apt update
apt list --upgradable 2>/dev/null

And: is a reboot pending after a kernel update?

[ -f /var/run/reboot-required ] && cat /var/run/reboot-required

If yes, schedule one. A patched kernel that hasn’t been booted into is just a download.

You can wire any of these into a weekly cron that emails you a one-page digest. Five lines of bash. Don’t overthink it.

Backups and restore drills

This is the boring section everyone skips. Skip it and you have a hobby project, not a business.

The minimum viable backup setup for a single VPS:

Database: nightly dump (pg_dump, mysqldump, or your equivalent), encrypted, sent off-box. To S3, B2, or any object store. Keep at least 7 daily and 4 weekly copies.
User-uploaded files: same deal — sync to object storage on a schedule. restic and rclone both work fine.
Config: keep it in git. If your nginx.conf lives only on the server, it’s already half-lost.

That’s the easy part. Here’s the part people skip:

Actually do a restore. From scratch. On a fresh VPS. Once.

Spin up a new box. Pull last night’s backup. Restore the database. Boot the app. Did it work? How long did it take? What did you forget? (Spoiler: an environment variable, an SSL cert, a cron job, or a system package.)

If you’ve never done this drill, you don’t have backups. You have files you hope will work. There is a meaningful difference, and you really, really don’t want to discover it during an outage.

Re-do the drill at least once a year, or any time you make a big infrastructure change.

Don’t over-do it

There is a tempting path where, in the name of “being thorough,” you install:

An intrusion detection system
A second intrusion detection system in case the first one misses something
A file integrity monitor
A custom auditd ruleset you found on a blog
An EDR agent
A SIEM forwarder

…on a VPS that hosts one Rails app and gets 200 visitors a day.

Don’t. Each of these has a cost: CPU, memory, alert noise you’ll learn to ignore, and your time. For one small box, the basics in this article handle 95% of realistic risk. Adding more tools without tuning them often makes you less secure, because real signals get buried in junk alerts you stop reading.

If your business actually grows into the territory where you need that stuff (regulated data, big customer base, real compliance), you’ll know — and at that point you’ll also have the budget to do it properly. Until then: keep the surface small, keep it patched, and keep watching the four things in the monitoring section.

Common mistakes

The same handful of things bite small-team servers over and over:

Port 22 open to the entire internet with password login still enabled. This is the #1 thing scanners look for. Even with a strong password, you’re contributing to the noise. Keys only.
Logging in as root. Either directly, or via a sudoers rule that means a single mistake takes the whole box down. Make a real user.
Skipping reboots after kernel updates. A patched-but-not-rebooted kernel still runs the old, vulnerable kernel. unattended-upgrades with Automatic-Reboot "true" fixes this for free.
IMDSv1 left enabled on AWS. If you’re on EC2/Lightsail, the legacy instance metadata endpoint can be reached by anything that can make an outbound HTTP request from the box — including a bug in your app. Use the IMDSv2 migration playbook to enforce HttpTokens=required without breaking older agents or SDKs.
Dev services bound to 0.0.0.0. Postgres, Redis, MongoDB, Elasticsearch, a debug UI, that one Jupyter notebook you spun up “just for a sec” — anything that listens on all interfaces with no auth is a free shell waiting to happen. Bind to 127.0.0.1, or at minimum require a password and put it behind the firewall.
No backups, or backups that have never been restored. See previous section. This is the one that ends businesses.
Storing secrets in committed .env files. You’ll forget, push to a public repo, and your API keys are now public. Use a .env.example checked in, and the real .env ignored.

None of these are exotic. All of them are still everywhere.

Worth a free second opinion?

Even after a careful first-day setup, things drift. A teammate enables password auth “just for a minute.” A new service starts listening on 0.0.0.0. Auto-updates silently break and stop running. The point of a periodic external check is to catch that drift before it matters.

Run a QuickCheck on your VPS → — read-only, no install, takes a few minutes. Or look at a sample report first to see what it covers.

What this is not

This article is a sensible starting checklist for one Linux VPS run by one person or a tiny team. It is not:

A replacement for security advice from someone who knows your specific stack and threat model.
A compliance program. If you handle health data, payment data, or anything else regulated, you need more than a blog post.
A guarantee. Nothing in security is. The goal is to make yourself a much less appealing target than the millions of other servers on the internet that haven’t done any of this.

Do the basics, do them well, then go back to building the actual product. That’s the job.

About Tuck Sentinel

Tuck Sentinel is a small operation focused on practical security checks for indie founders and small teams running production on a VPS. We build QuickCheck, a free read-only scan that highlights the boring-but-important configuration issues most one-person ops teams miss. No agents, no upsell maze — just the things worth fixing.

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "The Indie Founder's VPS Security 101",
  "description": "A practical, no-nonsense guide for solo founders running one Linux VPS. Lock the doors, watch the right things, and skip the security theater.",
  "author": {
    "@type": "Organization",
    "name": "Tuck Sentinel",
    "url": "https://richgibbs.dev/"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Tuck Sentinel",
    "url": "https://richgibbs.dev/"
  },
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://richgibbs.dev/blog/indie-founder-vps-security-101/"
  },
  "image": "https://richgibbs.dev/og/indie-founder-vps-security-101.png",
  "keywords": "VPS security, indie founder, Linux server hardening, Ubuntu, Debian, SSH, ufw, unattended-upgrades, backups",
  "articleSection": "Security",
  "inLanguage": "en"
}

AWS IMDSv2 Migration Without Breaking Things

Sun, 10 May 2026 00:22:00 +0000

If you have EC2 instances older than a year or two, some of them probably still allow IMDSv1. The Instance Metadata Service is the HTTP endpoint at 169.254.169.254 every EC2 instance can hit to learn about itself: instance ID, region, attached IAM role, and the temporary credentials that come with it. IMDSv1 is the original unauthenticated GET protocol. IMDSv2 is the session-token version that blocks a class of SSRF and confused-deputy attacks from walking off with your IAM Role credentials.

AWS has been nudging everyone toward IMDSv2 for years, but existing fleets, AMIs baked before the change, and ASGs pinned to old launch templates are full of IMDSv1-allowing instances. Migration is conceptually simple — flip a setting per instance — and operationally annoying, because flipping it on the wrong workload breaks credential lookups for SDKs, kubelet, the ECS agent, or your own scripts.

This guide walks through the migration the way an operator actually has to do it: detect what is still using v1, change instances in safe waves, validate, and have a rollback path. If you are using the migration window to clean up the rest of the instance, pair this with the broader Ubuntu/Debian EC2 hardening checklist.

Why Migrate

IMDSv1 is a plain HTTP GET against the link-local address. Anything inside the instance that can make an outbound HTTP request — including a vulnerable web app with SSRF — can read instance metadata, including the IAM Role Credentials path:

GET http://169.254.169.254/latest/meta-data/iam/security-credentials/<role-name>

That returns short-lived credentials for whatever role is attached to the instance. With IMDSv1, no proof of locality is required. An SSRF in a public-facing service can pivot directly to your IAM credentials.

IMDSv2 changes the protocol in two important ways:

Session tokens. Callers PUT to /latest/api/token for a session token, then send it back as X-aws-ec2-metadata-token. SSRF primitives that only allow GET are blocked.
Hop limit. The token response honors a TTL hop limit. Default is 1, so a container behind a Docker bridge or a pod behind a CNI cannot reach IMDS unless explicitly allowed.

Set IMDSv2 to required and v1 stops responding. That’s the goal state.

What Breaks

The realistic breakage list is short and well-known. Knowing it upfront is most of the migration.

Old AWS SDKs. Anything older than the published cutoffs only knows IMDSv1: AWS CLI v1 < 1.18.x, boto3 < 1.12.x, AWS SDK for Java v1 < 1.11.678, Go SDK v1 < 1.25.38, .NET SDK before late-2019. Modern SDKs auto-negotiate v2 with v1 fallback, but if v2 is required the fallback never engages.
Containers behind Docker bridge or CNI. The default hop limit of 1 denies pods/containers that route through the bridge. Raise the hop limit to 2 — or better, use IRSA on EKS, EC2 Pod Identity, or task roles on ECS so workloads don’t depend on instance metadata at all.
kubelet on self-managed nodes. Older kubelets only spoke v1. Modern EKS-optimized AMIs are fine; legacy kops clusters and old custom AMIs are the usual offenders.
ECS agent. amazon-ecs-init >= 1.50 supports IMDSv2. Old ECS-optimized AMIs not re-rolled in years can fail credential fetch.
CloudWatch / SSM agent. Recent versions fine; very old pinned versions not.
Custom scripts. curl http://169.254.169.254/latest/meta-data/... without a token will 401 once v1 is off.
Third-party agents in old AMIs. Old Datadog, New Relic, Splunk, or backup agents from years-old golden images can be v1-only.

That’s the whole list. Everything else either works on day one or never touched IMDS.

Detect IMDSv1 Use

Don’t flip the switch blind. Find the callers first.

CloudWatch metric: `MetadataNoToken`

Every EC2 instance emits a CloudWatch metric called MetadataNoToken in the AWS/EC2 namespace. It increments every time something on the instance hits IMDSv1. This is the single most useful signal you have.

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name MetadataNoToken \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistics Sum \
  --period 3600 \
  --start-time "$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time   "$(date -u +%Y-%m-%dT%H:%M:%SZ)"

If Sum across the last 7 days is 0, that instance is not making any IMDSv1 calls and is safe to switch. Anything non-zero means something is still hitting v1.

For a fleet view, query across all instance IDs or use CloudWatch Metrics Insights / Metric Math to graph MetadataNoToken aggregated. Tag the noisy instances and dig in.

Inventory: which instances even allow v1?

aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{
    Id:InstanceId,
    State:State.Name,
    HttpTokens:MetadataOptions.HttpTokens,
    HopLimit:MetadataOptions.HttpPutResponseHopLimit,
    Endpoint:MetadataOptions.HttpEndpoint
  }' \
  --output table

HttpTokens is what you care about. It will be one of:

optional — IMDSv1 still allowed (the thing you’re trying to remove)
required — IMDSv2 only (the goal state)

A simple “what’s left?” query:

aws ec2 describe-instances \
  --filters "Name=metadata-options.http-tokens,Values=optional" \
            "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].InstanceId' \
  --output text

CloudTrail and VPC flow logs

CloudTrail does not log calls to IMDS itself — those never leave the instance. What it does show is the AWS API calls made with the credentials IMDS handed out, via userIdentity.sessionContext and the accessKeyId of the temporary credentials. Useful for finding workloads still authenticating via instance role that should have moved to IRSA or task roles.

VPC flow logs do not see 169.254.169.254 traffic either — link-local stays inside the host. Stick to MetadataNoToken plus the inventory query.

On-host detection

If you have shell access to a candidate instance, run something quick before you change settings:

# Try IMDSv1 — if this returns data, v1 is still on
curl -s -o /dev/null -w "%{http_code}\n" \
  http://169.254.169.254/latest/meta-data/instance-id

# Try IMDSv2 — should always return 200 once v2 is supported
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id

To find callers on a host, auditd rules on connects to 169.254.169.254 plus ss -tnp snapshots usually identify the offending process. On a Kubernetes node, look at old DaemonSets and sidecars first.

Migration Steps

The flow that has worked reliably for small and mid-size fleets:

1. Baseline and freeze new IMDSv1

Set account-level defaults so anything launched from now on is IMDSv2-required and any new AMIs are also v2-required:

# Default IMDS options for new instances in this region
aws ec2 modify-instance-metadata-defaults \
  --http-tokens required \
  --http-put-response-hop-limit 2 \
  --http-endpoint enabled

# Default for newly-registered AMIs
aws ec2 modify-image-attribute \
  --image-id ami-xxxxxxxxxxxxxxxxx \
  --imds-support v2.0

Use modify-image-attribute --imds-support v2.0 on each AMI you control. Once set, instances launched from that AMI get v2-required automatically.

Also set the launch template / Auto Scaling group launch template versions to require IMDSv2:

aws ec2 create-launch-template-version \
  --launch-template-id lt-0123456789abcdef0 \
  --source-version 1 \
  --launch-template-data '{
    "MetadataOptions": {
      "HttpTokens": "required",
      "HttpPutResponseHopLimit": 2,
      "HttpEndpoint": "enabled"
    }
  }'

This stops the bleeding. Old instances may still be on v1, but no new ones are.

2. Sort instances into waves

Pull the list of HttpTokens=optional instances. Group them by:

Wave 0 — disposable. Stateless workers, batch nodes, dev/test. Cheap to break, cheap to recreate. Migrate first.
Wave 1 — replaceable through autoscaling. ASG-managed web tiers, ECS/EKS nodes. New launches are already v2-required; old nodes get rotated out by simply triggering an instance refresh.
Wave 2 — stateful or hand-built. Bastions, databases on EC2, single-instance services, anything pet-shaped.

For waves 0 and 1, prefer rotation over modification — relaunch from updated launch templates rather than mutating live instances. Less risky, fewer surprises.

3. Optional: try `optional` → `required` with a hop bump

For a stateful instance you cannot easily relaunch, raise the hop limit first (so containers keep working), then flip tokens to required:

# Step A: bump hop limit while still allowing v1
aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-put-response-hop-limit 2 \
  --http-tokens optional \
  --http-endpoint enabled

# Verify everything still works for at least one full agent cycle
# (CloudWatch agent, SSM agent, your app, container credential lookups)

# Step B: require v2
aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-tokens required

Watch MetadataNoToken after step A — if any callers are still using v1, they will keep showing up in the metric. Fix or upgrade them before step B.

4. Roll Auto Scaling groups

After the launch template is updated:

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name my-asg \
  --preferences '{"MinHealthyPercentage": 90, "InstanceWarmup": 300}'

For EKS managed node groups, the equivalent is updating the node group to a new launch template version and letting AWS drain and replace nodes. For ECS, update the capacity provider’s launch template and either drain instances or wait for natural turnover.

5. Sweep and confirm

After each wave, re-run the inventory query and the MetadataNoToken check. Anything still on optional should have a name attached to it and a reason.

Mid-article CTA: Want a one-shot read-only audit that tells you which of your EC2 instances still allow IMDSv1, plus a dozen other quiet AWS posture issues? That’s exactly what QuickCheck is built for. Skim a sample report before you decide.

Validation

After you flip an instance, you want fast confirmation it’s actually on v2 and nothing is silently failing.

Confirm v2-required at the API level

aws ec2 describe-instances \
  --instance-ids i-0123456789abcdef0 \
  --query 'Reservations[0].Instances[0].MetadataOptions'

Expected:

{
  "State": "applied",
  "HttpTokens": "required",
  "HttpPutResponseHopLimit": 2,
  "HttpEndpoint": "enabled",
  "HttpProtocolIpv6": "disabled",
  "InstanceMetadataTags": "disabled"
}

State: applied matters — pending means the change has not landed yet.

Confirm v1 is actually rejected on the host

# Should now return 401 Unauthorized
curl -s -o /dev/null -w "v1: %{http_code}\n" \
  http://169.254.169.254/latest/meta-data/instance-id

# Should return 200 with the instance ID
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  -w "\nv2: %{http_code}\n" \
  http://169.254.169.254/latest/meta-data/instance-id

v1: 401 and v2: 200 is the correct pair.

Confirm credentials still resolve

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
ROLE=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/)
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/$ROLE \
  | head -c 200; echo

You should see AccessKeyId, SecretAccessKey, Token, and Expiration.

Confirm app-level health

aws sts get-caller-identity from the instance using whichever SDK your workloads use.
Container credential lookups from inside one container per host (especially if you raised the hop limit).
ECS agent: curl -s http://localhost:51678/v1/metadata should still respond.
kubelet health: nodes still Ready, image pulls from ECR still work.

Confirm `MetadataNoToken` is zero

After 24–48 hours on v2-required, MetadataNoToken should be a flat zero line. If not, something is still calling v1 — which now means it is failing. Find it.

Rollback

You want this written down before you need it.

Per-instance rollback is one CLI call:

aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-tokens optional \
  --http-put-response-hop-limit 2 \
  --http-endpoint enabled

That re-enables IMDSv1 immediately, no instance restart required. It is the same call you used to flip forward — just with optional instead of required.

Launch template rollback: revert to the previous version.

aws ec2 modify-launch-template \
  --launch-template-id lt-0123456789abcdef0 \
  --default-version 1

Auto Scaling rollback: trigger another instance refresh against the previous LT version, or roll forward with a fixed template once you know what broke. Avoid the temptation to mutate live ASG instances; relaunch is cleaner.

For account-level defaults, you can re-relax them, but generally do not. Once new instances are v2-required by default, leave that in place even if you have to roll back individual stragglers.

QuickCheck CTA

If you’d rather not hand-roll the inventory queries and CloudWatch checks across every account and region, QuickCheck runs a read-only, one-shot review of your AWS posture and produces a plain-English report. IMDSv1 stragglers are one of the dozen things it surfaces — alongside open security groups, public S3, missing MFA on root, untagged keys, and a few other “you’d rather know” items. See an example in the sample report. It is not magic and not a replacement for proper cloud security tooling, but it is a fast way to know where you stand before you start migrating.

What This Is Not

To set expectations clearly:

This is not a penetration test. It is a configuration migration, not an adversarial exercise.
This is not a certification or compliance attestation. Migrating to IMDSv2 is a control improvement; it does not by itself constitute SOC 2, ISO 27001, PCI, or anything else. Your auditor still wants the artifacts they always want.
This is not a guarantee. Cloud security is a portfolio of controls. IMDSv2 closes one well-known SSRF-to-credentials path; it does not address misconfigured security groups, overly broad IAM policies, leaked long-lived keys, or vulnerable application code. Treat it as one item on the list.
This is not a substitute for moving workloads to IRSA / EC2 Pod Identity / ECS task roles where those fit. IMDSv2 makes instance metadata safer; per-workload identity is still the better long-term answer for containers.

Migrate to IMDSv2 because it is cheap, well-understood, and removes a real foot-gun. Then keep going.

About Tuck Sentinel

Tuck Sentinel is the security-focused side of an indie operator workshop by Rich Gibbs. It builds small, sharp tools — like QuickCheck — for founders and small teams who want a competent read of their cloud posture without an enterprise platform. The bias: fast, honest, read-only assessments and migrations you can actually finish.

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "AWS IMDSv2 Migration Without Breaking Things",
  "description": "A practical, indie-founder guide to migrating EC2 instances from IMDSv1 to IMDSv2 without breaking SDKs, containers, kubelet, or the ECS agent.",
  "author": {
    "@type": "Organization",
    "name": "Tuck Sentinel"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Tuck Sentinel",
    "url": "https://richgibbs.dev/"
  },
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://example.com/blog/aws-imdsv2-migration-without-breaking-things"
  },
  "image": "https://example.com/og/aws-imdsv2-migration.png",
  "articleSection": "Cloud Security",
  "keywords": "AWS, EC2, IMDSv2, IMDSv1, cloud security, IAM, SSRF, migration",
  "about": [
    { "@type": "Thing", "name": "AWS EC2 Instance Metadata Service" },
    { "@type": "Thing", "name": "IMDSv2" },
    { "@type": "Thing", "name": "Cloud Security Posture" }
  ]
}

SPF, DKIM, DMARC for indie founders: the 20-minute checklist

Sun, 10 May 2026 04:30:00 +0000

You shipped a product. Stripe sends receipts. Postmark sends magic links. Mailchimp blasts your launch list. You replied to a support ticket from your own founder address.

Then someone said “hey, your password reset went to spam.”

This guide is for that moment.

It is not a deliverability bible. It is the smallest correct version of the SPF / DKIM / DMARC story for a solo founder or a 2-3 person SaaS team, with one custom domain and two-to-five tools that send email on its behalf. If you can edit DNS and copy a record, you can finish it tonight.

We are also not selling you a deliverability platform. The point of this post is for you to do it yourself, correctly, in one sitting.

What “set up email DNS” actually means in 2026

Mailbox providers — Gmail, Yahoo, Outlook, Apple, ProtonMail — use three DNS-anchored signals to decide whether a message is plausibly from your domain at all:

SPF says “these IP addresses / hostnames are allowed to send mail using my domain in the envelope sender.”
DKIM says “messages from my domain will carry a cryptographic signature in the headers, signed by a key whose public half lives in DNS.”
DMARC says “if SPF and DKIM both fail to align with my visible From: domain, here is what you should do — nothing, quarantine to spam, or reject — and please send me reports.”

In 2024–2025 Gmail and Yahoo started requiring all three from any sender shipping more than 5,000 messages a day to their users, and they have been quietly tightening the rules for low-volume senders ever since. In practice, by 2026:

If your domain has no SPF and no DKIM, password resets and receipts will sometimes silently disappear into spam.
If your domain has no DMARC at all, anyone can spoof “from your domain” until enough recipients complain.
If your DMARC record is malformed, mailbox providers behave the same as if it isn’t there — except now your reports vanish too.

You do not need to be perfect. You need to be not broken.

The 20-minute checklist

Before you touch DNS, do the boring inventory step. This is the part most founders skip and most spam problems come from.

1. List every tool that sends mail “from” your domain (3 minutes)

Open a notes file. Write the domain you want to fix at the top. Then list every place that sends email as that domain. Real examples for a typical indie SaaS:

Founder mail (you replying from you@yourdomain.com) — Google Workspace or Fastmail.
Transactional / product mail — Postmark, Resend, Mailgun, AWS SES, SendGrid, Mailtrap.
Marketing / newsletter — ConvertKit, Mailchimp, Beehiiv, Buttondown, Substack custom domain.
Helpdesk — Help Scout, Front, HubSpot, Zendesk, Plain.
App-platform notifications — Vercel/Render/Heroku notifications using your domain, GitHub on a custom domain.
Stripe receipts and Tally form notifications, when configured to “send from” your domain rather than the platform default.

If you cannot remember, search your inbox for from:@yourdomain.com and note every “tool integration” message you find from the last 90 days.

This list is the single most useful artifact in this entire process. If anyone ever asks you “do you know who sends as your domain?”, you can answer in one screen.

2. Pick exactly one SPF record (5 minutes)

SPF is one TXT record at the apex of your domain (yourdomain.com, not mail.yourdomain.com). You are allowed exactly one. If there are two SPF TXT records in DNS, every conforming mailbox server treats the result as permerror and ignores both.

A working SPF for the example list above might be:

v=spf1 include:_spf.google.com include:spf.mtasv.net include:_spf.mailgun.org include:_spf.constantcontact.com -all

Rules:

Start with v=spf1.
One include: per provider, taken from each provider’s docs. Do not invent them.
End with -all (hard fail) or ~all (soft fail). Use ~all while you are setting up DMARC, then move to -all once DMARC reports are clean.
Do not put +all anywhere. Ever. That tells the world anyone can send as you.
Do not exceed 10 DNS lookups across all the include: and redirect= directives combined. Tools like Google Workspace + Mailgun + Mailchimp + Constant Contact + Help Scout will quietly exceed 10. If you see permerror reports, this is usually why.

If you use mail.yourdomain.com as a separate sending subdomain (some providers configure it that way), publish a separate SPF record at that subdomain.

3. Add DKIM for each sending tool (5 minutes)

DKIM is per-provider. Every provider that sends mail for you should give you one or more selector._domainkey.yourdomain.com CNAME or TXT records to add.

Examples of selectors you’ll see in a real indie SaaS:

Google Workspace: google._domainkey
Postmark: <assigned>._domainkey (Postmark assigns the selector when you verify the domain)
Mailgun: mailo._domainkey and pic._domainkey
ConvertKit / Mailchimp: their dashboard prints the exact CNAMEs.
Resend: resend._domainkey

Two rules that catch people:

DKIM records do not show up in plain dig TXT yourdomain.com. You have to query the selector explicitly: dig TXT selector._domainkey.yourdomain.com. If you cannot remember selectors, you cannot validate your own DKIM from public DNS — write them down.
“DKIM is set up” is not the same as “messages are being signed.” Each provider has its own toggle for “sign outbound mail with this key.” If signing is off in the provider dashboard, the selector record alone is useless.

The Authentication-Results header in any actual sent email is the source of truth. If it says dkim=pass from your visible domain, signing is real.

4. Publish a cautious DMARC (3 minutes)

DMARC is one TXT record at _dmarc.yourdomain.com. Start safe:

v=DMARC1; p=none; rua=mailto:dmarc-reports@yourdomain.com; adkim=r; aspf=r; pct=100

Translation:

p=none — do not block anything yet, just ask for reports.
rua=mailto: — a real mailbox you actually read; not a personal Gmail you ignore. Many founders use a forwarding alias like dmarc-reports@yourdomain.com that lands in a labeled folder.
adkim=r; aspf=r — relaxed alignment. Strict alignment is for later.

A 14-day p=none window before you tighten anything is the difference between “I learned my newsletter platform sends as mail.mydomain.com” and “I broke my newsletter for two days.”

After 14 days of clean reports — meaning every legitimate sender shows up in the reports as passing SPF or DKIM aligned with yourdomain.com — move to:

v=DMARC1; p=quarantine; rua=mailto:dmarc-reports@yourdomain.com; pct=25

The pct=25 ramp is intentional. It means “quarantine 25 % of messages that fail alignment” so you can detect any forgotten sender before going full p=quarantine or p=reject.

If you are an indie founder, you may stop at p=quarantine forever. p=reject is for senders who are confident no legitimate mail anywhere uses their domain incorrectly.

5. Verify the result with one real email (4 minutes)

Send one email to yourself at Gmail, Yahoo, and Outlook from each sending tool you care about most (founder mail, password reset, newsletter). Open the message header.

You are looking for an Authentication-Results line that says all three of:

spf=pass with smtp.mailfrom= matching a domain that contains yourdomain.com.
dkim=pass with header.d=yourdomain.com (alignment) — not header.d=postmarkapp.com or header.d=mailgun.org.
dmarc=pass.

dkim=pass header.d=mailgun.org while your visible From: is support@yourdomain.com is the most common deliverability bug among indie founders. The message is technically signed, but DMARC-wise it is unsigned by your domain. Fix it by completing the provider’s “Use my own domain” / “Custom domain DKIM” configuration. Postmark, Mailgun, Resend, SendGrid, Mailchimp, ConvertKit, and AWS SES all support this; they just don’t enable it by default.

Things to deliberately ignore in v1

You do not need:

BIMI. Useful only after DMARC is at p=quarantine or stricter for a long time, and even then it is a logo-display feature, not a deliverability feature.
ARC. Mailing-list specific.
DKIM key rotation. Whatever your provider gave you is fine until they tell you to rotate.
Per-subdomain DMARC strictness (sp=). Default is fine until you operate dedicated sending subdomains.

You also do not need:

A paid “deliverability platform” subscription.
A reputation-monitoring agency.
An IP warmup schedule (you are using shared IPs from your ESP; they handle warmup).

Common gotchas an indie founder will hit

These are the failure modes I see most often when reviewing single-domain setups:

Two SPF records. Often a leftover from when you were trying providers. Merge into one.
+all left over from a Google guide that said “for testing only.” Remove.
DMARC rua pointing at you@yourdomain.com itself. Your inbox will fill with unreadable XML aggregate reports. Use a sub-alias (dmarc-reports@) that auto-files.
DKIM “set up” but provider has signing disabled. Toggle it on in the provider, and confirm with a real test message header.
Marketing tool added later, but DKIM never aligned. New newsletter platform turns SPF green, leaves DKIM header.d= pointing at the platform’s domain. DMARC fails alignment for that one tool.
Personal Gmail “Send mail as” alias used to reply from you@yourdomain.com. Even if Workspace is fine, that alias often sends as gmail.com underneath. Reply-To is fine; the sending identity matters for alignment.
Subdomain forgotten. Stripe receipts sometimes go through mail.yourdomain.com. If subdomain SPF/DKIM is missing, mailbox providers can still apply the apex DMARC. Check at the exact subdomain.

If any of those sound like a problem you cannot debug from your provider’s dashboard alone, that is the moment a second pair of eyes is worth more than another deliverability article.

Next step: a $99 second pair of eyes

Once you’ve done the 20-minute pass above, the question is usually not “is the record there?” It’s “are all these records aligned with the way I actually send mail?” That answer lives partly in DNS and partly in a few real message headers.

If you’d like a written, prioritized fix list for one domain — SPF, DKIM, DMARC, MX, sender-tool inventory, and the obvious mistakes — that is exactly the Inbox/DNS QuickCheck we offer. $99, one domain, no DNS login needed, 24-hour turnaround. No managed retainers, no inbox-placement guarantees, no spam help.

If you’d rather DIY but want the printable, fillable Markdown version of the entire process — sender inventory template, SPF builder, DKIM provider reference, DMARC ramp, Authentication-Results decoder — that’s the Indie Founder Email DNS Pack, $19 (pay what you want, $9 minimum) on Gumroad.

That is also the point at which most founders realize there was one tool nobody remembered to align. That tool is almost always a marketing platform.

You don’t have to buy anything to follow the checklist above. The above is the whole working answer for most one-domain indie SaaS. The QuickCheck exists for when you’ve done the obvious and still have a quiet 5–10 % of legitimate mail disappearing into spam, and you want a second set of eyes before you tighten DMARC further.

Either way, the goal is the same: your password resets, your receipts, and your founder replies should reach the inbox. The boring DNS hygiene above is most of the answer.

If you’ve already finished the checklist above and tightened DMARC to p=quarantine, and now a specific sender — newsletter tool, Stripe receipts, a sub-domain — has started being quarantined or hard-bounced (Gmail 5.7.26, Microsoft 5.7.509 / 5.7.515), the DMARC Quarantine Pack is the focused diagnostic runbook for that exact moment. It includes a DSN decoder cheat-sheet, three real-world incident walkthroughs (marketing-tool DKIM drift, forgotten sub-domain, forwarding/ARC breakage), and a single-file Python aggregate-XML reader so you can read your own DMARC reports without paying for a SaaS dashboard.

DMARC Quarantine Pack — $29 on Gumroad · 14-day refund, no questions.

Cloudflare Email Routing for indie founders: the 10-minute support@ setup

Sun, 10 May 2026 05:00:00 +0000

You launched. Your domain has a website, a payment link, a privacy page that says “support@yourdomain.com,” and… no actual mailbox at that address. Real customer mail is silently bouncing.

You don’t need a $6-per-user-per-month Workspace seat to fix this. Cloudflare Email Routing forwards support@yourdomain.com (and any other alias you want) to a Gmail/Fastmail/Proton mailbox you already pay for, for free, in about ten minutes.

This post is the boring, working playbook to set it up — plus the one thing it can’t do that surprises every founder who tries it for the first time.

What Email Routing actually is

Cloudflare Email Routing is inbound-only forwarding for any domain whose DNS lives at Cloudflare. You publish a few MX and TXT records that Cloudflare manages for you, define some routing rules in the dashboard (“send anything to support@yourdomain.com to my Gmail”), and incoming mail gets re-injected into your real mailbox.

What it is not:

Not a mailbox. You can’t log in to a Cloudflare interface to read mail.
Not an outbound SMTP server. You can’t send from support@yourdomain.com through Email Routing. Replies will go from whatever mailbox you forwarded into, unless you also configure your replying client (more on this below).
Not a deliverability service. It accepts mail from the public internet and re-delivers it. SPF/DKIM/DMARC for your domain are still your job.

The 10-minute path

Two prerequisites:

Your domain’s nameservers are Cloudflare’s. (If they aren’t, follow Cloudflare’s “Add a site” flow first; it’s free, takes about 5 minutes plus DNS propagation.)
You have a Gmail / Fastmail / Proton / Mailbox.org / etc. mailbox you actually read.

1. Enable Email Routing (1 minute)

In the Cloudflare dashboard:

Pick your zone → Email → Email Routing.
Click Get started. Cloudflare will offer to add the required DNS records automatically. Say yes.

Cloudflare will publish:

Three MX records pointing at route1.mx.cloudflare.net, route2.mx.cloudflare.net, route3.mx.cloudflare.net.
A TXT record at the apex with v=spf1 include:_spf.mx.cloudflare.net ~all.
A DKIM CNAME (cf2024-1._domainkey) for the routing service.

If you already have an SPF record at the apex, stop and merge them by hand. You should never have two SPF records. We’ll come back to that in the gotchas.

2. Verify the destination address (2 minutes)

Still in Email → Email Routing:

Destination addresses → Add destination address.
Enter the personal mailbox you want forwarded mail to land in.
Cloudflare emails a verification link. Click it.
The destination’s status should flip to Verified.

You can verify multiple destinations and route different aliases to different mailboxes. Useful if billing@ should go to a finance address and security@ should go to a different one.

3. Add a routing rule (1 minute)

Custom addresses → Create address.
Custom address: support (so the full address is support@yourdomain.com).
Action: Send to an email.
Destination: pick the verified address.
Save.

Repeat for every alias you advertise: hello@, billing@, legal@, security@, dmarc-reports@ (very useful, more on this in a minute).

4. (Optional) Catch-all (30 seconds)

In Custom addresses → Catch-all address, set it to either:

Drop — anything not matched is silently dropped. Good for spam hygiene.
Send to — any unknown alias is forwarded to your fallback mailbox. Good if you advertise lots of aliases on signup forms and don’t want misspellings to bounce.

There is no third option. Pick one. “Drop” is what most indie SaaS founders should use.

5. Send a real test (1 minute)

From a different mailbox (not the destination — Gmail will helpfully suppress mail you sent to yourself), email support@yourdomain.com. It should arrive at the destination within a few seconds, with the original sender preserved in the From: header.

That’s the whole setup. The remaining time is what’s between you and a working outbound support address, which is the part that catches everyone.

The one thing Email Routing does not do — and what to do instead

Email Routing is inbound only. If you reply to a customer’s email and you do nothing else, your reply will go from your.personal@gmail.com, not from support@yourdomain.com. The customer sees a different address than the one they wrote to, the conversation feels off, and you might also leak a personal address you didn’t mean to advertise.

Three options, in order of effort:

Option A — Live with replies coming from the personal mailbox

OK for a v0 SaaS while you have ten customers. Set the Reply-To header in your mail tool to support@yourdomain.com so further replies route correctly. Most mail clients let you set a default Reply-To per identity. Customers will see your personal address in the visible From, which they will probably tolerate while you’re small.

Option B — Use Gmail’s “Send mail as” with a real outbound SMTP server

In Gmail: Settings → Accounts and Import → Send mail as → Add another email address.

You will need:

A real outbound SMTP host that authorizes you to send as support@yourdomain.com. Gmail itself will not let you do this without an SMTP server; the “treat as alias” path that worked years ago is gone.
An SMTP host can be a paid Workspace seat ($6/mo/user — the thing we just avoided), or a transactional ESP like Postmark / Resend / Mailgun / SES configured with your custom domain and DKIM.

If you already use Postmark/Resend/Mailgun/SES for product mail, set up an authorized “transactional support” sender there and feed the SMTP credentials into Gmail’s Send-mail-as flow. Postmark and Resend both have specific docs for this. Now your replies go from support@yourdomain.com over a path that aligns with DKIM.

Option C — Use a help-desk tool with custom-domain support

Help Scout, Plain, Front, Missive, HubSpot Service. These all accept inbound mail forwarded to a tool-specific address (you point Cloudflare Email Routing at it instead of your Gmail) and send outbound replies as support@yourdomain.com with their own DKIM you authorize. Per-seat pricing varies; some have free tiers up to a few mailboxes.

For an indie SaaS at 0–500 customers, Option B is usually the sweet spot. For a 2-3 person team that wants conversation handling, Option C earns its keep.

Common gotchas

These are the things I see indie founders get wrong with Email Routing.

Gotcha 1: dual SPF records

If your DNS already had an SPF record (because you set up Postmark or Mailgun before adding Email Routing), Cloudflare will silently publish a second one. Conforming receivers will treat dual SPF as permerror and ignore both. Result: legitimate inbound delivery may still work via MX, but your outbound SPF alignment quietly breaks.

Fix: keep one record at the apex. If you also send via Postmark and Mailgun and have routing on:

v=spf1 include:_spf.mx.cloudflare.net include:spf.mtasv.net include:_spf.mailgun.org ~all

Verify with dig +short TXT yourdomain.com | grep spf1. You should see exactly one line.

Gotcha 2: forwarded mail lands in spam

Forwarding rewrites the SMTP envelope. The original sender’s SPF/DKIM may no longer align by the time Gmail receives the forwarded copy. Symptoms: real customer mail to support@ shows up in Gmail’s Spam folder.

Fix:

In Gmail, open one such message → More → Filter messages like this → set criteria to to:support@yourdomain.com OR deliveredto:support@yourdomain.com, then Never send to spam + apply a Support label + optionally categorize as Primary.
Test from a non-destination address; do not test from another mailbox owned by the same Google account.

This is a Gmail filter, not a Cloudflare problem. Email Routing sets ARC headers correctly; some receivers still ding forwarded mail.

Gotcha 3: DMARC reports vanish

You set up DMARC at _dmarc.yourdomain.com with rua=mailto:dmarc-reports@yourdomain.com. That alias must actually route somewhere. If you forgot to add a Cloudflare Email Routing rule for dmarc-reports, the reports get dropped, and you’ll think DMARC is broken when really you just have no inbox to read it from.

Fix: add dmarc-reports@ as a routed alias. In Gmail set a filter to auto-label and skip the inbox. Aggregate reports are XML and noisy.

Gotcha 4: your “send mail as” alias still routes through gmail.com

Even with Send-mail-as configured, if you don’t enable “Treat as alias” or you don’t use a true outbound SMTP host, Gmail will sometimes send through gmail.com and tag it as a forwarded sender. The visible From: looks right, but Authentication-Results will tell on you.

Fix: read the Authentication-Results header on a real reply (View original in Gmail). You want dkim=pass header.d=yourdomain.com, not header.d=gmail.com.

Gotcha 5: paid Workspace already exists for the domain

If your domain previously had Google Workspace MX records, or M365 MX records, the dashboard will warn before overwriting them. Do not click through that warning unless you intend to abandon the existing mailbox. Cloudflare’s MX records replace whatever was there — including your live Workspace inbox.

Fix: pick one. Either keep Workspace and don’t enable Email Routing, or migrate everything to Email Routing first.

Authentication after Email Routing is on

Once routing is live, dig your domain. You should see exactly:

3 MX records → route{1,2,3}.mx.cloudflare.net.
1 SPF TXT (only) at the apex.
1 DKIM TXT (cf2024-1._domainkey) for the routing service.
Your existing DKIMs from product/transactional senders.
1 DMARC TXT at _dmarc.

If any of those is duplicated or missing, you have homework. The matching SPF/DKIM/DMARC checklist walks the rest.

When to upgrade past Email Routing

Email Routing is the right answer when:

You need 1–10 aliases on one domain.
Volume is “real customer mail,” not “we send 50k newsletters a month from this address.”
A 5–60 second delay on inbound is fine.

Outgrow it when:

You want a shared inbox for two or more people without forwarding to the same Gmail.
You need calendar/contacts/Drive on the domain (that’s Workspace’s actual value, not the email forwarding).
You need server-to-server inbound webhooks (Email Routing supports a “Send to a Worker” action for this; useful but past the 10-minute mark).

Want a written second-pair-of-eyes on your setup

Once routing is live and SPF/DKIM/DMARC are published, the most useful thing you can do is verify every authorized sender is aligned with your visible From: address. That’s exactly the Inbox/DNS QuickCheck — a $99 written report on one domain, delivered within 24 hours, no DNS login required.

If you’d rather DIY the whole thing, the same content in printable, fillable Markdown form (sender inventory template, SPF builder, DMARC ramp, Authentication-Results decoder) is in the Indie Founder Email DNS Pack — $19 (pay what you want, $9 minimum) on Gumroad. Either is fine. The point is to do it once, well, then never think about it again.

If you set up Cloudflare Email Routing and also publish DMARC at p=quarantine or stricter, a small but real failure mode is forwarded mail that breaks SPF or DKIM alignment at the receiving mailbox. The DMARC Quarantine Pack ($29) is the focused runbook for diagnosing that and related cases — Gmail 5.7.26 and Microsoft 5.7.509 / 5.7.515 decoded with source citations, three incident walkthroughs (one of them is a forwarding/ARC case), and a single-file Python aggregate-XML reader for reading your own reports.

DMARC Quarantine Pack — $29 on Gumroad · 14-day refund, no questions.

I had 80,000 unread emails. Here's the cleanup playbook (no apps, no OAuth)

Sun, 10 May 2026 17:11:52 +0000

Last weekend I sat down to clean out my personal Gmail.

I had 80,675 unread messages older than one year. Most were newsletters from companies I’d long since stopped caring about — receipts from a 2019 ride-share account, password-reset emails from accounts that no longer exist, every “weekly digest” I’d ever opted into and forgotten.

The cleanup itself took about 20 minutes once I had a plan. The having a plan part took three evenings.

This post is the playbook I actually used. It isn’t a SaaS pitch. It doesn’t ask you to log into anything. It’s the boring, working sequence for someone who has tens of thousands of old unread emails in a personal Gmail and wants them gone tonight, without nuking something they’ll wish they’d kept.

Why most “inbox zero” advice fails on a real mailbox

If you Google “how to delete old unread emails Gmail bulk,” you get three kinds of answers:

SaaS apps that want full mailbox OAuth. Mailstrom, Clean Email, Cleanfox. They work, but the permission cost is large for a job that runs once.
Blog posts from 2014. They reference Outlook 2010, IMAP folders, and Gmail’s old desktop UI. The screenshots don’t match anything you actually see today.
“5 tips” listicles that assume you have 800 unread emails, not 80,000. The “select all” trick doesn’t survive paging through 1,600 pages of 50 messages each.

None of these are useful when you’re staring down a five-figure unread count.

The reason isn’t the operations — Gmail’s older_than: operator does most of the heavy lifting. The reason is order. If you don’t survey first, you’ll start deleting things you wanted to keep, panic, stop halfway, and end up with a mailbox that’s somehow worse than when you started.

The order I followed (and you can copy)

The whole playbook is five steps. None of them involve installing anything.

1. Survey before you touch anything

Open Gmail. Open the search bar. Run these queries one at a time and write the result count down on a piece of paper:

is:unread older_than:1y
is:unread older_than:3y
from:newsletters older_than:6m (or substitute a sender domain you know is noise)
has:attachment older_than:2y larger:10M
category:promotions older_than:6m

Five numbers. That’s your map.

If the first number is over 5,000, congratulations — you have the same shape of problem most indie founders have. Mine was 80,675 against the first query. Yours will be different but the playbook scales.

If you want a more thorough survey — top 20 senders, oldest cohorts by year, attachment age buckets, label sprawl — that’s what the Inbox Cleanup Pack ships: a small read-only shell script that calls the Gmail API under your own OAuth client and writes a single survey.json file. No message bodies, no subjects, no message ids. Just counts. You can also do the survey by hand with the queries above; the script just makes it faster on big mailboxes.

2. Filter the recurring noise first

Before you delete anything, kill the inbound flow.

Open Gmail → Settings → Filters and Blocked Addresses → Create a new filter.

For each of the top 5 newsletter senders you can name off the top of your head, create a filter:

from:(news@somecompany.com) → “Skip the Inbox” + “Mark as read” + “Apply label: Newsletters” + “Also apply filter to existing matching conversations.”

The “also apply” checkbox is the part most people miss. It silently archives the existing 4,000 unread newsletters from that sender in one click. No manual select-all required.

Repeat this for your top 5–10 noisy senders. You’ll be surprised how much of the unread count is concentrated in a small number of domains. In my case, six senders accounted for 41% of the 80,675.

3. Bulk-archive the old promotions cohort

Now that the inbound is filtered, attack the standing cohort.

In the search bar:

is:unread category:promotions older_than:1y

Click the small “Select all conversations that match this search” link that appears above the message list. (The plain “select all” checkbox at the top only ticks the 50 visible messages — this is the most common gotcha and the reason people quit halfway.)

Then Archive, not Delete. Archiving keeps the messages in All Mail; deleting moves them to Trash. For promotions older than a year, archive is enough — they’ll never come up in your inbox again unless you specifically search for them.

4. Delete the truly dead — with `older_than:` and a safety net

For the cohort that genuinely has no future use:

is:unread older_than:2y

Same “Select all conversations that match this search” link, then Delete.

Two things to know:

Gmail’s Trash auto-purges after 30 days. That’s your real undo window. If you delete 50,000 messages today, you have 30 days to walk into Trash and pull anything important back.
Deleting from Gmail does not delete from Google Takeout history. If you’ve ever exported your mail with Takeout, that snapshot is still in Drive. The Trash purge is a Gmail-only concept.

I deliberately did not delete anything younger than two years. The marginal value of “unread receipt from 2024” is small but non-zero — there’s still a chance you’ll need to find one. The marginal value of “unread newsletter from 2019” is zero.

5. Set up the maintenance you’ll actually keep

The cleanup is a one-day job. Maintenance is what keeps you from being here again in two years.

Three filters that survive long-term:

One filter per platform that sends transactional mail you don’t read in real time (Stripe receipts, Vercel deploy notifications, GitHub digest mails) → “Skip the Inbox” + “Apply label: Transactional.”
One filter for unsubscribe in body → “Apply label: Newsletter.” This labels every newsletter going forward without skipping the inbox; once a quarter you can sweep the label.
One filter for from:(*@yourdomain.com) → “Star” or “Mark as important.” Mail from your own domain to yourself is almost always something you actually wanted to act on.

Don’t go past three. Filter sprawl is the second cause of inbox bankruptcy after newsletter sprawl.

What I deliberately did not do

No third-party apps. No Mailstrom, no Clean Email. They work for the people they fit; they’re the wrong shape for a one-time cleanup.
No “inbox zero” rules. Inbox zero is a discipline, not a software problem. Either you’ll keep up or you won’t; no app changes that.
No deletion of mail younger than two years — too much chance of needing one of them.
No bulk-unsubscribe service. Most of them either MITM your unsubscribe (and re-sell the implied opt-in signal) or get blocked by sender reputation systems. Manual unsubscribe from the noisiest five senders, then filter the rest, beats a bulk service every time.
No DNS or sender-config changes. That’s a different problem — see the Inbox/DNS Pack and Inbox/DNS QuickCheck for the SPF/DKIM/DMARC side.

The boring summary

Survey before you touch anything.
Filter the recurring noise first, with “Also apply filter to existing matching conversations” checked.
Archive (not delete) the old Promotions cohort.
Delete the cohort older than two years, knowing the 30-day Trash window is your safety net.
Three maintenance filters, no more.

That’s the whole playbook. It’s enough for most personal Gmails carrying tens of thousands of unread.

Want this packaged?

If you’d rather have the survey script, the printable Markdown version of the cleanup order, the full filter templates, and the cohort-by-cohort cleanup-order I followed, that’s exactly the Inbox Cleanup Pack — $19 (pay-what-you-want, $9 minimum) on Gumroad.

If you’d rather hand me the counts-only survey.json from the script and get a written, prioritized cleanup plan tailored to your mailbox in 24 hours, that’s the $79 Inbox Cleanup QuickCheck. I never see message content; just the counts.

If you’re a small Workspace team with up to 10 mailboxes — typical pre-migration scenario — the $499 Enterprise tier handles it under your own internal-app OAuth path, no third-party permissions added.

Either way, the playbook above is the working answer for most personal Gmails. The product exists for the case where you’d rather pay $19 for a pre-written cleanup order than reverse-engineer one yourself, or pay $79 for a custom plan, or have a teammate run the same survey across 10 mailboxes before a migration.

The 30-day Trash window is your safety net. Use it.

— Rich

I wouldn't give a SaaS my Gmail to clean it. Here's the 30-line read-only alternative.

Mon, 11 May 2026 16:55:00 +0000

I sat down three weekends in a row to clean out my personal Gmail.

The first two weekends I did what most people do. I opened the “free inbox cleaner” tab everyone keeps tweeting about, read the OAuth consent screen — Read, compose, send, and permanently delete all your email from Gmail — closed the tab, and went back to scrolling. The cost felt wrong for a job I’d only run once.

The third weekend I wrote my own script. Survey-only, read-only, runs under my own Google account, no tokens ever leave my laptop. The actual cleanup, once I had a plan, took less than half an hour: I cleared 80,675 messages older than a year, archived another 14,000-odd, and built three filters that have kept the backlog at zero ever since.

This is what I learned about doing it without handing a stranger the keys to my mailbox.

The OAuth scope problem nobody wants to say out loud

A free email cleanup app is not, in fact, free.

When you click “Sign in with Google” and the consent screen asks for https://mail.google.com/ — that’s the all-of-Gmail scope. It’s not “look at counts.” It’s not “look at senders.” It’s read every message, write every message, delete every message, send mail as you to anyone. There is no narrower scope that lets a third-party app do bulk cleanup the way most of these tools do it.

A few honest consequences of granting that scope:

The app’s server can read message bodies, attachments, contacts, calendar invites, and 2FA codes at any time it holds a valid token. Most don’t advertise doing that. The capability exists either way.
OAuth refresh tokens last for months by default. Removing the app from your Google account dashboard revokes new tokens, not stored ones. If the vendor’s database was already scraped, the bird has flown.
You are now an upstream dependency of every breach that vendor will ever have. The 2014–2024 history of mailbox-OAuth apps is not encouraging on that point — look up any of the big-name “smart inbox” companies and you’ll find at least one incident.

This isn’t a hit piece on any specific tool. I won’t name any. The economics of “free, ad-funded inbox cleaner with full mailbox OAuth” are the same regardless of who’s running it. The product is the inbox.

For a recurring assistant you trust — a calendar app, a CRM you live in — that scope is sometimes a fair trade. For a one-time cleanup, it isn’t. The right tool for a one-time job is one that doesn’t outlive the job.

Survey-then-Delete: the methodology

Here’s the method that actually worked. I call it Survey-then-Delete because reversing those two words is what causes most cleanups to fail halfway.

Survey, counts only. Don’t look at bodies. Don’t look at subjects. Don’t even pull message IDs. Just ask Gmail “how many messages match this query?” for a handful of useful queries — top senders, age cohorts, attachment sizes.
Identify the top 12 senders by volume. In every five-figure mailbox I’ve audited, fewer than 15 senders account for 40–70% of the noise. This is universal.
Filter the recurring inbound first. For each of those senders, build a Gmail filter that skips the inbox, marks as read, and “Also apply filter to existing matching conversations.” That single checkbox is where most manual cleanups stall.
Bulk delete by sender and age cohort, not by clicking individual messages. Use Gmail’s from: and older_than: operators. The 30-day Trash window is your safety net — anything you delete is recoverable for 30 days.
Run two maintenance filters so you never have to do this again.

Notice what’s not on the list: no mailbox migration, no archive-everything panic, no “select all 80,000 and pray.” You don’t even need to know which individual messages you’re deleting. You’re operating on counts and senders, like a sysadmin culling logs, not on individual emails.

That’s the whole product worldview. Cleanup is a one-time cohort operation, and a third-party app with permanent mailbox access is overkill for it.

What the survey actually looks like

The survey step is the part the script does. It calls the Gmail API under your own OAuth — read-only scope, gmail.metadata plus gmail.readonly for counts — and writes a single survey.json to your laptop. No message bodies, no subjects, no message IDs. Just counts.

Here’s a redacted version of what one row looks like, the way the script renders it so you can read it before deciding anything:

sender                          count    oldest                  recommended action
─────────────────────────────────────────────────────────────────────────────────────
news@<redacted-saas>.com       11,842   2017-03-12   filter+delete (>1y)
deals@<redacted-airline>.com    7,901   2014-08-19   filter+delete (>1y)
updates@<redacted-network>.com  5,617   2016-01-08   filter+delete (>1y)
no-reply@<redacted-bank>.com    3,402   2018-04-22   filter only (keep — statements)
receipts@<redacted-cart>.com    2,883   2019-06-30   filter only (keep — receipts)
hello@<redacted-newsletter>     2,114   2020-11-04   filter+delete (>2y)
… 6 more rows …                              
─────────────────────────────────────────────────────────────────────────────────────
top 12 senders                 51,883   covers 64.3% of unread >1y

That’s the whole output. Four columns, twelve rows, one summary line. With that table you can decide, in 90 seconds, which senders you want to filter and delete (most of them), which you want to filter only (anything with statements, receipts, security alerts), and which you want to leave alone (the 30%-ish tail of senders you might still care about).

The actual deletion is a second pass — different command, explicit confirmation, dry-run by default. You read the count, you say yes, Gmail moves the cohort to Trash, the 30-day undo window protects you.

Safety properties, in plain English

This is the bit I want to be very precise about, because “we never see your mail” is something every cleaner says, and most of them are stretching.

Read-only OAuth at survey time. The survey command requests gmail.metadata + gmail.readonly. Those scopes cannot delete, send, or modify mail. Google enforces this at the API edge; it’s not a promise, it’s a permission.
Deletion runs under a separate, on-demand gmail.modify scope that you grant only when you actually want to delete, and revoke from your Google account afterwards in one click. The script doesn’t ask for mail.google.com/ (the all-powerful scope) — ever.
The OAuth client is yours. You create the Google Cloud project in your own account, paste the client ID and secret into a config file on your laptop. The tokens are written to a file in your home directory with 0600 permissions. They never touch our infrastructure. I literally cannot read your mail; the credentials only exist on your machine.
The Enterprise tier sidesteps the same problem differently: your IT admin publishes the script as an Internal app inside your Google Cloud organization, which means it’s exempt from Google’s app verification process and the 100-user cap, but also that there’s no “third-party app” to revoke — the script runs as you, on your own org’s Cloud project.

If you’re the kind of person who reads OAuth scope strings before clicking through them — same — that’s the design.

The three ways to do this

Pick the one that matches how much DIY you want to wrangle.

$19 — Inbox Cleanup Pack (DIY). Get it on Gumroad → Pay-what-you-want, $9 floor. The same shell script I used (read-only survey + opt-in deletion), the Gmail filter templates, the exact cohort-by-cohort cleanup order, and the printable Markdown playbook. You run everything on your own laptop under your own Google Cloud OAuth client. No third-party permissions added to your account.

$79 — Inbox Cleanup QuickCheck (we write the plan). Buy on Stripe → You run the same survey script. You send me the survey.json file (counts only — no message bodies, no subjects, no IDs). I send back a written, prioritized cleanup plan tailored to your top senders, your age cohorts, and your tolerance for “delete vs archive.” Delivered within 24 hours, plus one async clarification pass within 14 days — up to 30 minutes’ worth of follow-up questions over email at support@richgibbs.dev.

$499 — Inbox Cleanup Enterprise (up to 10 Workspace mailboxes). Buy on Stripe → For pre-migration or pre-acquisition cleanups across a small team — typically a 2-to-10-person Google Workspace org. Your IT admin publishes our script as an Internal app under your own Cloud project (no third-party verification, no app-store entry, no shared tokens). You run the survey across up to 10 mailboxes, send the merged survey.json, and we write a per-mailbox plan plus the cross-mailbox patterns (shared newsletters worth bulk-filtering across the org, etc.). 5 business day SLA, one async clarification pass within 14 days — up to 30 minutes’ worth of follow-up questions, via email to support@richgibbs.dev. More details on the Inbox Cleanup service page.

All three deliverables are async-only. Email is the only follow-up channel.

If you came in through the deliverability rabbit-hole — receipts going to spam, password resets vanishing — the inbox problem is downstream of the outbox problem, and both are fixable in one sitting:

SPF, DKIM, DMARC for indie founders: the 20-minute checklist — the matching DNS-side hygiene pass for your sending domain.
Cloudflare Email Routing for indie founders: the 10-minute support@ setup — if you don’t even have a support@yourdomain.com yet, start here before you do anything else.

The short version

The “free inbox cleaner” model is a scope-creep trap for a job that runs once.
Survey-then-Delete: count first, identify the top 12 senders, filter the inbound, then bulk-delete by sender and age cohort.
Read-only OAuth at survey time; on-demand gmail.modify only when you’re actually deleting; tokens live on your laptop, not ours.
$19 if you want to run it yourself. $79 if you want me to write the cleanup plan. $499 if you need to do it across a small Workspace team without exposing tokens to a third party.

The 30-day Trash window is your safety net. So is reading the OAuth scope string before you click “Allow.”

— Rich

Tuck Sentinel — independent. Not affiliated with, endorsed by, or certified by Google, Yahoo, Microsoft, AWS, Cloudflare, Stripe, Tally, or any email or cloud provider.

DMARC aggregate reports without a SaaS: read your own rua XML in 30 minutes

Tue, 12 May 2026 14:45:00 +0000

You published a DMARC record. The rua=mailto: part is pointing at a real mailbox you actually read. Reports started arriving 24 hours later. They are zipped XML files with names like google.com!yourdomain.com!1715472000!1715558400.zip, you cannot read them, and every blog post you find tells you to sign up for Postmark, Valimail, dmarcian, EasyDMARC, or some other $20–$200/month SaaS to “decode” them.

You don’t need any of that. The DMARC aggregate-report format is a stable, well-defined XML schema published in RFC 7489 (§7.2), and a working reader takes about 120 lines of stdlib-only Python — no extra packages, no API keys, runs in cron on the same $20 VPS that already runs your mail.

This post is the reading half of that story. What the reports actually contain, what every field means in practice, the 20-line skeleton to walk the XML yourself, what the full reader adds beyond the skeleton, and the cron-friendly workflow that makes the data actionable. If you ever decide you want a SaaS, you will be a much better customer for it.

Why DMARC aggregate (`rua`) reports exist

DMARC, defined in RFC 7489, is a policy layer on top of SPF and DKIM. A receiver (Gmail, Microsoft, Yahoo, Apple, ProtonMail, …) checks each incoming message and decides three things: does SPF pass and align with the RFC5322.From domain, does DKIM pass and align, and what does the domain owner’s published _dmarc TXT record say to do when neither aligns.

The receiver acts on every message immediately. But the domain owner (you) has no idea what happened until somebody complains. Aggregate reports close that loop. From RFC 7489 §7.2:

Aggregate reports are most useful when they all contain the same data; thus this section describes a single report format, generated daily, sent via email, encoded as XML.

In practice, every major receiver who supports DMARC sends one aggregate report per UTC day per sender domain they saw mail from, to the address (or addresses) listed in the rua= tag of your _dmarc record. Each report says: “Here is every source IP that claimed to be sending as your domain in the last 24 hours, how many messages each one sent, and what we decided about each one.” It does not contain message bodies, subjects, recipients, or any other PII. It is metadata only.

That is what makes the format safe to receive, store, and parse on a $20 VPS. Failure reports (ruf=, separate spec) sometimes carry redacted message content; aggregate reports do not. We are talking about rua only.

The shape of a real aggregate XML report

A real Gmail aggregate report, opened in a text editor after gunzip, looks roughly like this (one record shown; a real report typically has 5–50):

<feedback>
  <report_metadata>
    <org_name>google.com</org_name>
    <email>noreply-dmarc-support@google.com</email>
    <report_id>1234567890123456789</report_id>
    <date_range>
      <begin>1715472000</begin>
      <end>1715558400</end>
    </date_range>
  </report_metadata>
  <policy_published>
    <domain>yourdomain.com</domain>
    <adkim>r</adkim>
    <aspf>r</aspf>
    <p>quarantine</p>
    <sp>quarantine</sp>
    <pct>100</pct>
  </policy_published>
  <record>
    <row>
      <source_ip>50.31.156.6</source_ip>
      <count>42</count>
      <policy_evaluated>
        <disposition>none</disposition>
        <dkim>pass</dkim>
        <spf>pass</spf>
      </policy_evaluated>
    </row>
    <identifiers>
      <header_from>yourdomain.com</header_from>
    </identifiers>
    <auth_results>
      <dkim>
        <domain>yourdomain.com</domain>
        <selector>pm</selector>
        <result>pass</result>
      </dkim>
      <spf>
        <domain>pm-bounces.yourdomain.com</domain>
        <result>pass</result>
      </spf>
    </auth_results>
  </record>
</feedback>

Every aggregate report from any receiver has the same top-level shape: one <feedback> element, one <report_metadata> block, one <policy_published> block, and one <record> per source IP per disposition outcome. The schema is fixed by RFC 7489 Appendix C; receivers don’t get to invent new fields.

The report decoder ring

Once you have the XML in front of you, three fields do most of the work.

<source_ip> is the IP address the receiver saw the message arrive from. If it is one of your sending platform’s IPs (a Postmark, Resend, Mailgun, SES, ConvertKit, Mailchimp range), that is good. If it is an IP you have never heard of and the count is non-trivial and alignment failed, that is either a forwarder you forgot about or somebody actively spoofing your domain. Both are worth investigating, but in 2026, the boring forwarder explanation is the answer about 95 % of the time.

<policy_evaluated> is the receiver’s verdict on this batch of messages. Three sub-fields matter:

<disposition> — what the receiver did. none means delivered normally; quarantine means spam-foldered; reject means refused at SMTP time. This is the applied outcome, after any pct= ramp and local override.
<dkim> — whether DKIM passed and aligned with the RFC5322.From domain in <identifiers><header_from>.
<spf> — same, for SPF alignment (the MAIL FROM / Return-Path domain must align with header_from).

The single most common indie-founder confusion is the difference between <auth_results> (the raw SPF/DKIM verification result on whatever domains the message presented) and <policy_evaluated> (whether those results aligned with the visible From: domain). A message can have <auth_results><dkim><result>pass</result></dkim> and still show <policy_evaluated><dkim>fail</dkim> — DKIM technically passed, but the signing domain was mailgun.org instead of your domain, so DMARC alignment failed. That is the most common deliverability bug in this whole article. Fix it by enabling “Custom domain DKIM” on the offending provider.

<header_from> under <identifiers> is the RFC5322.From domain — what the recipient sees. If this is ever a domain other than yours (a subdomain you forgot, an old sending domain), every alignment decision in the same record is being judged against that domain, not your apex.

RFC 7960 (“Interoperability Issues Between DMARC and Indirect Email Flows”) is the official, RFC-blessed description of why honest forwarders break DMARC alignment — mailing lists, forward-to-personal-inbox aliases, and any hop that rewrites headers will show <policy_evaluated><dkim>fail</dkim> on aggregate reports while not being malicious. That is the moment to read the ARC spec, RFC 8617, and decide whether to enable ARC on your forwarder or just stop forwarding mail you publish DMARC for.

A 20-line stdlib-only skeleton

You can read every report on disk with nothing but the Python standard library. Here is the smallest correct skeleton that walks every record in a single XML file:

import xml.etree.ElementTree as ET
from pathlib import Path

def walk(path: Path):
    tree = ET.parse(path)
    root = tree.getroot()
    org = root.findtext("report_metadata/org_name", default="?")
    dom = root.findtext("policy_published/domain", default="?")
    for rec in root.findall("record"):
        ip    = rec.findtext("row/source_ip", default="?")
        count = int(rec.findtext("row/count", default="0"))
        disp  = rec.findtext("row/policy_evaluated/disposition", default="?")
        dkim  = rec.findtext("row/policy_evaluated/dkim", default="?")
        spf   = rec.findtext("row/policy_evaluated/spf", default="?")
        hfrom = rec.findtext("identifiers/header_from", default="?")
        yield (org, dom, ip, count, disp, dkim, spf, hfrom)

for f in Path("reports").glob("*.xml"):
    for r in walk(f):
        print("\t".join(map(str, r)))

That is the whole reading layer. Run it against a directory of un-gzipped XML reports and you have a tab-separated table you can pipe into awk, sort -k4 -n, or just grep fail.

What a full reader adds on top of this 20-line skeleton — and what the paid pack ships pre-built — is:

Transparent .gz, .zip, and raw-.xml handling (receivers disagree on compression; some send a .zip containing an .xml, some send a .xml.gz, Microsoft used to email both).
Grouping by source domain and by sending sub-domain, so the report says “Postmark sent 1,420 messages on your behalf today, all aligned” instead of one row per IP.
Disposition rollups: how many none vs. quarantine vs. reject per sender, per day.
ARC results from <auth_results> (per RFC 8617), so legit forwarders are flagged as “ARC-rescued, ignore” instead of “DKIM fail, panic.”
A multi-day rolling view so a one-bad-day spike does not page you but a seven-day trend does.
An “unknown sender” alert for any source IP that has never appeared in your historical reports and is sending more than N messages a day.

The 20-line skeleton is enough to learn the data. The 120-line full reader is what you keep in cron.

A cron-friendly daily workflow

Once you can read the reports, the workflow is short.

Use a dedicated mailbox. Point rua=mailto:dmarc-reports@yourdomain.com at an alias you do not read directly. Cloudflare Email Routing forwarding into a labeled Gmail folder works perfectly for this; so does a Postfix .forward into a Maildir on the same VPS. Google’s own Workspace Admin Help DMARC guide recommends the same separation.
Fetch on a schedule. Pull the new attachments out of that mailbox once an hour. IMAP, the Gmail API (under your own internal-app OAuth client), or a simple notmuch new + maildir scan all work. Drop the attachments under ~/dmarc/reports/.
Parse and roll up. Run the reader nightly. Append a row per (date, sender_domain, source_ip, count, dkim_aligned, spf_aligned, disposition) to a CSV or SQLite file. This is the historical record you query when something breaks.
Alert only on change. Mail yourself when (a) a brand-new source IP appears and sends more than ~50 messages, (b) a previously-aligned sender’s DKIM-alignment rate drops below 95 % for two consecutive days, or (c) any disposition=reject count goes above zero for a sender you care about.

That is the entire pipeline. There is no dashboard, no per-domain license, no “trust score.” The data is the data.

When you actually do need a SaaS

Be honest: the boring DIY pipeline above is correct for a single domain, one or two sending sub-domains, and 5–50 reports a day. The point at which a SaaS starts pulling its weight is roughly:

More than ~5 active domains, especially if a deliverability team wants a shared dashboard.
High-volume marketing senders (>500k messages/month) where you want forensic (ruf=) reports correlated with bounce categories.
Anything that needs SPF/DKIM hygiene enforced across an org with 50+ employees and rotating contractors.
Compliance contexts (Microsoft anti-spam configuration docs are worth reading here too) where someone external wants an audit trail of the policy itself.

For one indie founder with one domain and three senders? You are the worst customer dmarcian will ever have. Read your own reports.

If you want the full Python reader (gzip- and zip-aware, sub-domain rollups, ARC handling, the unknown_sender alert function, plus three real incident walkthroughs — marketing-tool DKIM drift, forgotten sub-domain, forwarder/ARC breakage — and a DSN decoder cheat-sheet for Gmail 5.7.26 and Microsoft 5.7.509 / 5.7.515) in one bundle, the DMARC Quarantine Pack — $29 on Gumroad has it. 14-day refund, no questions.

SPF, DKIM, DMARC for indie founders: the 20-minute checklist — the prerequisite. Publish a sane _dmarc record and a rua= target first; then the aggregate reports in this post will actually start arriving.
Cloudflare Email Routing for indie founders: the 10-minute support@ setup — the cleanest way to give your dmarc-reports@yourdomain.com alias a real destination without paying for a Workspace seat, and the post that explains the one forwarder hop ARC (RFC 8617) is designed to rescue.
Related downloadable pack: DMARC Quarantine Pack — $29 on Gumroad — the full single-file Python reader, three real-incident walkthroughs, and the DSN decoder cheat-sheet for when DMARC moves from p=none to p=quarantine and a specific sender starts getting bounced. 14-day refund, no questions.

Security audit vs penetration test: which one does an indie founder actually need?

Thu, 14 May 2026 14:30:00 +0000

You shipped a product. It runs on one or two VPS, maybe a Postgres, a Stripe webhook, a small Workspace tenant, and a marketing site. A prospect asks for “your latest pen test.” A compliance template asks if you’ve had a “security audit in the past 12 months.” A vendor pitches you a $9,000 engagement. A friend tells you to “just run a Nessus scan.” Three of those four things are different jobs, and one of them is barely a job at all.

This guide is the plain-English version of security audit vs penetration test for an indie founder or a 1-5 person SaaS team, in 2026, with no security staff and no compliance department behind you. It tells you what each thing actually is, what each one is for, what each one costs roughly, and how to pick. It is deliberately short on jargon and long on “use this when, use that when.”

We are also not selling you a $9,000 engagement. The whole point is that most indie founders need the cheaper, read-only thing first, and only sometimes need the more expensive intrusive thing later.

The actual definitions (from the standards, not from a sales deck)

The cleanest split comes from NIST Special Publication 800-115, Technical Guide to Information Security Testing and Assessment, which divides “security testing and examination” into three top-level techniques:

Review techniques — documentation review, log review, configuration review, network sniffing, file integrity checking. Non-intrusive. The system stays untouched. (NIST SP 800-115, §3, “Review Techniques.”) https://csrc.nist.gov/pubs/sp/800/115/final
Target identification and analysis techniques — network discovery, port and service identification, vulnerability scanning, wireless scanning. Mostly non-intrusive, can be noisy. (NIST SP 800-115, §4.)
Target vulnerability validation techniques — password cracking, penetration testing, social engineering. Actively exploit the things the scanner found, to prove they are real. Intrusive by design. (NIST SP 800-115, §5.)

In everyday founder language:

A security audit (sometimes “security assessment,” “security review,” “configuration audit”) is mostly review plus vulnerability scanning. Read-only. Nobody is trying to log into your box without permission. It asks: given the configuration you have, what would a competent attacker probably notice first? The deliverable is a prioritized fix list.
A penetration test is validation. With your written permission, a human tester (or a team) actually tries to break in — exploit a finding, chain two boring misconfigs into one bad outcome, see how far they can get from the outside without your help. The deliverable is a report of what they actually achieved, with reproduction steps, plus the fix list.

OWASP’s Web Security Testing Guide (WSTG) makes the same distinction in its introduction: a vulnerability assessment lists potential issues, while a penetration test attempts to exploit them and demonstrate impact. https://owasp.org/www-project-web-security-testing-guide/stable/

CIS Controls v8 puts penetration testing in its own control (Control 18, “Penetration Testing”), separate from the configuration/audit controls earlier in the framework — it’s deliberately the last control, on the assumption you’ve already done the boring read-only hygiene of the first seventeen. https://www.cisecurity.org/controls/v8

That ordering is the answer to “which one do I need first,” for almost every indie founder reading this post.

What an indie-founder security audit actually looks like

For a 1-5 person SaaS team running on a VPS or two (or EC2 / Lightsail / Hetzner / DigitalOcean / Fly / Render), a read-only security audit usually covers, at a minimum:

SSH posture. Key-only auth, no password login, no root login, sane MaxAuthTries, sane LoginGraceTime, your actual authorized_keys files, stale keys from ex-contractors.
Firewall / security-group state. What ports are actually open to 0.0.0.0/0 versus what you think are open. SG rules that opened temporarily during an incident and never closed.
Patch state. Whether unattended-upgrades (or the equivalent) is actually running, when the last reboot was, how many security updates are pending.
Instance metadata posture. IMDSv2 required (on AWS), no v1 fallback, sensible hop limits. (See the EC2 hardening checklist below for the migration path.)
TLS posture. Cert expiry windows, weak ciphers, the Cloudflare-origin cert versus the in-host cert, SNI/origin drift.
Container exposure. Whether the Docker socket is mounted into anything web-facing; whether any container runs as root with a public port bound to 0.0.0.0.
Email DNS hygiene. SPF/DKIM/DMARC published, aligned, and not contradicting each other for any of the senders you actually use.
Backup posture. Whether the off-host copy exists, whether anyone has ever tested a restore, retention windows.
Logging posture. Whether auditd / journald are actually retaining anything useful, whether logs ship off-host, whether they survive a host being terminated.
OAuth and third-party access. Which third-party SaaS still has live tokens against your Workspace mailbox / Drive / Calendar. The Zapier from 18 months ago, the contractor’s n8n, the AI summarizer somebody clicked through.

None of that requires the auditor to exploit anything. Most of it the auditor can read off the host with a short-lived, read-only SSH key plus a few API tokens scoped to Describe* actions, plus a handful of externally observable signals (DNS, TLS handshakes, HTTP headers).

Cost range for an indie-scale read-only audit on one host in 2026: roughly $100–$500 for a fixed-scope one-shot deliverable (our $149 VPS/EC2 Hardening QuickCheck sits at the low end of that), and a few thousand for a thorough multi-host audit with a written report. The expensive consultancy engagements you see quoted at $5k–$15k are typically wrapping an audit and a light pen test together, plus several hours of advisory time.

There is also a free version of the externally observable subset — DNS records, exposed ports, TLS posture, HTTP security headers, public IMDS reachability — which you can run yourself in a few minutes; see QuickCheck Mini for the script-only version we ship.

What a penetration test actually looks like

A real pen test is somebody (or a small team) given a defined scope, a defined window, and written authorization, trying to compromise that scope. They might:

Phish your contractors using a lookalike domain (only if you signed off on social-engineering scope).
Find an exposed .env file in a misconfigured nginx alias and pivot to your database.
Find an outdated dependency, write or borrow an exploit, and get a shell.
Chain a low-severity SSRF in your app into AWS credential theft via IMDSv1 (this is why IMDSv2-only is non-negotiable).
Try to escalate from a low-privilege application user to root on the host.

The deliverable is what they achieved, with full reproduction steps, screenshots, and a prioritized remediation list. NIST SP 800-115 §5.2 frames this as the “Planning / Discovery / Attack / Reporting” four-phase pattern that almost every pen-test methodology since has adopted. CIS Controls v8 Control 18 explicitly states the goal is “to identify vulnerabilities and attack vectors that may be used to exploit enterprise systems.”

That is much more expensive than an audit, for three reasons:

It is human-time-intensive. A meaningful external pen test of one small SaaS is one to two weeks of senior practitioner time, not an automated scan.
It carries production risk. Even a careful tester occasionally trips a fail2ban rule, fills a log volume, or knocks a small VPS over. You need monitoring and a rollback plan.
The report has to be defensible. Reproduction steps, CVSS-style severity (or equivalent), screenshots, retest. The write-up alone is usually a third of the engagement.

Cost range for a small-scope external SaaS pen test in 2026: roughly $5,000–$15,000 for a one-or-two-week engagement covering a single web application plus its hosting surface. More for cloud-environment pen tests or anything involving social engineering.

When the audit is the right answer (which is most of the time)

For an indie founder, ask the audit first if any of these are true:

You have never had anyone look at the box other than you.
A prospect asked for “evidence of security review” or your last security questionnaire response.
You are about to enable a new sender (marketing tool, transactional ESP) and your DMARC is at p=none.
You inherited the infrastructure from a co-founder or contractor who left.
You just migrated to a new VPS or cloud account and want to know what carried over wrong.
A friend told you to “run a pen test” and you don’t yet have a fix list to validate against.
You are pre-revenue or sub-$1M ARR and your security budget is “one weekend per quarter.”

The reason is simple: a pen test against a host that has never had a configuration audit will mostly produce findings that the audit would have produced for a fraction of the cost. You pay a senior tester $200/hour to discover that SSH still allows password login, that IMDSv1 is enabled, and that DMARC is at p=none. That is not a good use of either of your money.

CIS Controls v8 makes the same point structurally: penetration testing is Control 18, after asset inventory, secure configuration, vulnerability management, audit logging, account management, and the rest. The intended flow is audit → fix → audit again → then pen test, to see what got past the audit.

When the pen test is the right answer

Ask for the pen test when:

You’ve already done the audit, fixed the findings, and want to know what an attacker would still notice.
A specific enterprise customer or insurer is contractually requiring an external pen test letter.
You handle regulated data (PCI scope, HIPAA, certain regional privacy frameworks) where pen testing has a specific cadence requirement.
Your application has meaningful auth/authz complexity (multi-tenant, SSO, role hierarchies) that a configuration audit cannot meaningfully validate without exploiting it.
You’re raising and a sophisticated investor’s diligence shop is going to run their own tester against you anyway, and you’d rather be the one who knows the findings first.

If none of those apply, the honest answer most indie-scale auditors will give you — and the one this site gives — is do the audit first.

A quick decision table

Situation	Probably the right ask
“I’ve never had anyone look at the box.”	Read-only audit
“Prospect wants ‘evidence of security review.’”	Read-only audit + letter
“We’re about to go from $19/mo Postgres to $999/mo customer.”	Audit now, pen test in 6 mo
“Compliance says pen test annually.”	Pen test (after audit)
“We just migrated cloud accounts.”	Read-only audit
“Auditor already gave us a fix list six months ago.”	Pen test, against the fix list
“Insurer is requiring a pen-test letter for renewal.”	Pen test (scoped to their ask)
“We don’t even know what we run.”	Asset inventory, then audit

What to deliberately ignore in v1

You do not need, on day one:

A SOC 2 Type II report. That is a control-attestation engagement, not a security test. Different deliverable, different price tier, different timeline.
A red-team engagement. That is pen testing with social engineering and stealth requirements bolted on. Massively more expensive, only useful once the audit and pen-test layers are mature.
A bug-bounty program. Useful for a public-facing app that is already hardened. A bug bounty on day one mostly buys you reports about missing security headers from people farming for $50.

You also do not need a dedicated SIEM, a managed-detection vendor, or a third-party “compliance platform” subscription until you have actual customers asking for one of those things by name.

Common indie-founder gotchas

These are the failure modes that show up most often when somebody asks “audit or pen test?” for a 1-5 person team:

Asking for a pen test when the real ask is an audit. The deliverable comes back full of “informational” findings about TLS posture and SSH config, and the founder concludes “pen tests are useless.” They aren’t — that was an audit wearing a pen-test invoice.
Asking for an audit when the real ask is a SOC 2 readiness assessment. Different deliverable. An audit will not get you a SOC 2 report.
Letting a vendor define “audit” to mean “we ran a Nessus scan against your IP.” A scan is one input to an audit, not the audit itself. Without the configuration review and the human prioritization, you’re getting a 200-page PDF of CVE noise.
Pen testing without an asset inventory. The tester scopes “your production environment” and you forget the marketing site on Fly, the staging box on Hetzner, and the contractor’s old IP. The actual attack surface goes untested.
Treating an audit letter as a deliverability or compliance shortcut. A clean audit letter says you passed an audit. It does not say your DMARC is green or that you are PCI in scope.

If any of those sound like a problem you’ve already had once and would rather not have again, that is exactly the kind of clarity a short, read-only, fixed-scope audit is for.

Encrypting Your EBS Root Volume Without Rebuilding the Server (AWS 2026)

Thu, 14 May 2026 23:30:00 +0000

You checked your EC2 console, opened the Volumes view, and noticed it: the Encrypted column on your root volume says No. You probably launched that instance from a community AMI a year or two ago, before AWS started defaulting to encrypted EBS in most regions. Everything since then — your OS, your app code, your customer data, your secrets — has been sitting on an unencrypted block device. Snapshots inherit that state, AMIs you shared with another account inherited it, and any future restore continues to inherit it until you do something about it.

This guide walks through how to fix that without rebuilding the box from scratch. The path is well-known to anyone who has done it before, and full of small AWS-specific traps if you haven’t. The summary: you cannot encrypt an existing EBS volume in place. You snapshot it, copy the snapshot with encryption enabled, create a new volume from the copy, and swap the root. That’s it — but every step has at least one way to get wrong on the first try.

It pairs naturally with the Ubuntu/Debian EC2 hardening checklist and the AWS IMDSv2 migration guide — same broader topic: tightening cloud posture on workloads that were launched before you cared about any of this.

Why this matters

The standard pushback on “we should encrypt EBS” is: the disk never leaves AWS’s data center, so what are we protecting against? That argument misses where the risk actually lives.

Snapshots and AMIs are portable. An unencrypted snapshot can be shared to another AWS account or made public with a single API call. An encrypted snapshot can’t — sharing requires KMS grants. The encryption flag is a hard guard against accidental cross-account data leakage, including the classic “I made this AMI public to debug something” mistake.
KMS gives you a second access-control layer. With unencrypted EBS, anyone with ec2:CreateVolume and ec2:AttachVolume on the snapshot can mount it on a new instance and read everything. With KMS, they also need kms:Decrypt and (often) kms:CreateGrant on the key. That separation is the single biggest practical reason to encrypt — it forces a deliberate second permission for “I want to actually read this data”.
Compliance, even the informal kind. SOC 2, ISO 27001, PCI, HIPAA, most state privacy laws, and most enterprise customer security questionnaires ask whether data at rest is encrypted. The honest answer for an unencrypted EBS volume is “no”. Encrypted-with-KMS gets you a clean “yes” with zero application changes.
Future-proofing. AWS now enables “Always encrypt new EBS volumes” by default in most regions for new accounts. Older accounts keep their old default. New volumes you create going forward will be encrypted; the old one keeps being the outlier. Migrating now eliminates the special case before it becomes the only special case left.

None of this changes how your app behaves. EBS encryption is transparent. The OS doesn’t know. The application doesn’t know. There is no measurable performance hit on current-generation instance types. The reason most people haven’t done it is that the migration is fiddly, not that the destination is.

What “enable default encryption” does and doesn’t fix

In EC2 → Account attributes → EBS encryption, there’s a setting called Always encrypt new EBS volumes. Turning it on is good and you should do it. But understand what it does:

✅ New volumes you create from scratch are encrypted.
✅ New snapshots you take of already-encrypted volumes stay encrypted.
✅ Volumes restored from encrypted snapshots stay encrypted.
❌ Existing unencrypted volumes are not retroactively encrypted.
❌ Snapshots of existing unencrypted volumes are not encrypted by default — they inherit the source’s state.

That last point trips people up. Once “default encryption” is on, you might assume taking a fresh snapshot of an old volume gives you an encrypted snapshot. It does not. Snapshots match the source volume’s encryption state. You have to make the encrypted copy explicit, which is exactly what the migration path below does.

So step zero is: flip on default encryption for the region (EC2 → Account attributes → EBS encryption → Always encrypt new EBS volumes), pick a default KMS key (aws/ebs is fine to start, or a CMK you control), and then deal with the old volume.

The two migration paths

You have two viable options for the root volume itself. Pick based on how much downtime you can take and whether you want to swap instance types or AMIs at the same time.

Path A — Swap the root volume on the same instance (downtime path)

Same instance ID, same Elastic IP, same security groups, same instance profile, same launch template. You take a planned outage (typically 5-15 minutes), encrypt the volume, swap it back in, boot, verify.

Use Path A when: - The instance has external state (Elastic IP attached directly, hard-coded references to the instance ID, attached IAM role, manually-created security group memberships) you don’t want to redo. - You’re fine with a single short maintenance window. - You haven’t already planned an OS or AMI refresh.

Path B — Build a new instance from an encrypted AMI (rebuild path)

Snapshot → copy with encryption → create AMI → launch a fresh instance from the encrypted AMI → swap the Elastic IP (or DNS) → terminate the old box.

Use Path B when: - You also want to refresh the OS or instance type. - You can run two instances side-by-side briefly and cut over with a CNAME or EIP move. - You’d rather not stop a production box, even briefly. - The instance has accumulated hand-applied config you’d like to leave behind anyway.

Path B is more cleanup work but lower risk because the old volume stays untouched until you’re confident the new instance is healthy. Path A is faster and surgical.

The rest of this guide focuses on Path A, because that’s what most solo operators on a single-box deployment actually want. The same KMS copy step is the core of Path B; the difference is what you do with the resulting AMI afterwards.

Mid-article CTA: Want a read-only audit that tells you which EBS volumes in your account are still unencrypted — plus open security groups, IMDSv1 stragglers, exposed IAM users, and a few other “you’d rather know” items? That’s exactly what QuickCheck is built for. One run, plain-English report, no install on your account.

Path A: Step-by-step

1. Inventory and prepare

First, capture the things you’ll need to put back.

INSTANCE_ID=i-0123456789abcdef0
REGION=us-east-1

aws ec2 describe-instances \
  --instance-ids "$INSTANCE_ID" \
  --region "$REGION" \
  --query 'Reservations[0].Instances[0].{
    AZ:Placement.AvailabilityZone,
    Root:RootDeviceName,
    RootVol:BlockDeviceMappings[?DeviceName==`/dev/xvda` || DeviceName==`/dev/sda1`].Ebs.VolumeId | [0],
    Type:InstanceType,
    AMI:ImageId
  }'

Write down AZ, Root (will be /dev/xvda or /dev/sda1 — the difference matters in step 5), and RootVol (the source volume ID).

Pick the KMS key you’ll encrypt with. The default AWS-managed alias/aws/ebs is fine for most setups; use a customer-managed key (CMK) if you want explicit grant control or cross-account isolation later.

KMS_KEY_ID="alias/aws/ebs"   # or arn:aws:kms:us-east-1:<acct>:key/<uuid>

Confirm your caller can actually use that key for encrypt-decrypt — the migration fails late if it can’t:

aws kms describe-key --key-id "$KMS_KEY_ID" --region "$REGION" \
  --query 'KeyMetadata.{KeyState:KeyState,KeyUsage:KeyUsage,Arn:Arn}'

KeyState must be Enabled and KeyUsage must be ENCRYPT_DECRYPT.

2. Stop the instance

EBS root volumes can only be detached when the instance is stopped. There is no online path for the root. Plan a maintenance window now.

aws ec2 stop-instances --instance-ids "$INSTANCE_ID" --region "$REGION"
aws ec2 wait instance-stopped --instance-ids "$INSTANCE_ID" --region "$REGION"

This is the start of your downtime clock. From here, the only thing protecting you is that the original volume still exists and is unmodified.

3. Snapshot the unencrypted volume

SRC_VOL=vol-0aaaabbbbccccdddd0   # from step 1

SNAP_ID=$(aws ec2 create-snapshot \
  --volume-id "$SRC_VOL" \
  --description "pre-encryption snapshot of $SRC_VOL" \
  --region "$REGION" \
  --query 'SnapshotId' --output text)

aws ec2 wait snapshot-completed --snapshot-ids "$SNAP_ID" --region "$REGION"
echo "Source snapshot: $SNAP_ID"

This snapshot is unencrypted (inherits source state). It’s also your rollback insurance for the rest of the migration — do not delete it until you’re done and verified.

4. Copy the snapshot with KMS encryption

This is the only step in the whole process where encryption actually happens. It’s a same-region copy with --encrypted and a key ID.

ENC_SNAP_ID=$(aws ec2 copy-snapshot \
  --source-region "$REGION" \
  --region "$REGION" \
  --source-snapshot-id "$SNAP_ID" \
  --description "encrypted copy of $SNAP_ID" \
  --encrypted \
  --kms-key-id "$KMS_KEY_ID" \
  --query 'SnapshotId' --output text)

aws ec2 wait snapshot-completed --snapshot-ids "$ENC_SNAP_ID" --region "$REGION"
echo "Encrypted snapshot: $ENC_SNAP_ID"

Confirm it’s actually encrypted:

aws ec2 describe-snapshots --snapshot-ids "$ENC_SNAP_ID" --region "$REGION" \
  --query 'Snapshots[0].{Encrypted:Encrypted,KmsKeyId:KmsKeyId}'

You want "Encrypted": true and a KmsKeyId ARN that matches the key you used.

5. Create the new encrypted volume in the right AZ

This is the single most common place to lose 20 minutes. The new volume must be in the same Availability Zone as the instance. EBS volumes are AZ-scoped; you can’t attach a us-east-1a volume to a us-east-1b instance.

AZ=us-east-1b   # from step 1

NEW_VOL=$(aws ec2 create-volume \
  --snapshot-id "$ENC_SNAP_ID" \
  --availability-zone "$AZ" \
  --volume-type gp3 \
  --encrypted \
  --kms-key-id "$KMS_KEY_ID" \
  --region "$REGION" \
  --query 'VolumeId' --output text)

aws ec2 wait volume-available --volume-ids "$NEW_VOL" --region "$REGION"
echo "New encrypted root: $NEW_VOL"

Notes: - gp3 is the modern default. If your old volume was gp2 or io1, this is a fine moment to upgrade. gp3 is cheaper than gp2 at equivalent performance. - If you need a specific size (larger than the snapshot), add --size N in GiB. You can grow but not shrink.

6. Detach the old root, attach the new one

aws ec2 detach-volume --volume-id "$SRC_VOL" --region "$REGION"
aws ec2 wait volume-available --volume-ids "$SRC_VOL" --region "$REGION"

# Use the ROOT device name from step 1 — /dev/xvda for Nitro Ubuntu/Amazon Linux 2+
# Use /dev/sda1 for older Xen-virt instance types
ROOT_DEV=/dev/xvda

aws ec2 attach-volume \
  --instance-id "$INSTANCE_ID" \
  --volume-id "$NEW_VOL" \
  --device "$ROOT_DEV" \
  --region "$REGION"

aws ec2 wait volume-in-use --volume-ids "$NEW_VOL" --region "$REGION"

Get the root device name wrong and the instance will refuse to boot — it can’t find the kernel because EC2 looks at RootDeviceName to know what to chain-load. Check Root from step 1, not your assumptions.

7. Start the instance and verify

aws ec2 start-instances --instance-ids "$INSTANCE_ID" --region "$REGION"
aws ec2 wait instance-running --instance-ids "$INSTANCE_ID" --region "$REGION"

SSH in and confirm:

lsblk -f
# nvme0n1p1 (or xvda1) should be your root /, mounted, ext4/xfs, same UUID as before

mount | grep ' on / '
# Should show your root device

aws ec2 describe-volumes --volume-ids "$NEW_VOL" \
  --query 'Volumes[0].{Encrypted:Encrypted,KmsKeyId:KmsKeyId}'
# Encrypted: true, KmsKeyId: matches your key

Run your app’s health check. If it had systemd services, systemctl --failed should be empty. Check that any external monitoring is green.

8. Clean up

Once you’re confident — wait a day or two on stateful boxes — delete the old unencrypted volume and the unencrypted snapshot.

aws ec2 delete-volume   --volume-id   "$SRC_VOL"   --region "$REGION"
aws ec2 delete-snapshot --snapshot-id "$SNAP_ID"   --region "$REGION"
# Keep $ENC_SNAP_ID — it's your encrypted baseline going forward.

Pitfalls to know up front

A few traps that have caused real outages:

AZ mismatch on create-volume. Already covered, worth repeating. The error message reads InvalidParameterValue: The Availability Zone is not the same. Recreate the volume in the right AZ; cost is just the create/delete time.
Root device name /dev/xvda vs /dev/sda1 vs /dev/nvme0n1. The AWS API root device name is /dev/xvda or /dev/sda1. The kernel may surface the volume as /dev/nvme0n1. Use the API name for attach-volume --device; the kernel name is irrelevant at attach time.
KMS permission gaps. If you’re using a CMK in a different account, or restricting your IAM role tightly, you need kms:Decrypt, kms:GenerateDataKeyWithoutPlaintext, kms:ReEncrypt*, kms:CreateGrant, and kms:DescribeKey somewhere in the chain. The error is OptInRequired or InvalidKmsKey. Don’t grant kms:* on the key to make it go away — grant exactly those five.
Forgetting the instance profile. Detaching and reattaching volumes does not touch the IAM role on the instance. But if you were using IMDSv1 and your migration coincides with an AMI change, double-check the role survived. Pair this work with the IMDSv2 migration and you only do the maintenance window once.
Application state outside the root volume. If your app keeps data on a secondary EBS volume, encrypt that one too with the same process. Root encryption alone leaves your real data unencrypted on disk, which is the worst possible posture: you get the operational cost of the migration with none of the actual data protection.
Backups inherit, but old backups don’t migrate. Once the volume is encrypted, AWS Backup / DLM snapshots inherit the encryption. Old snapshots from before today are still unencrypted. Either re-snapshot via the same copy-with-encryption trick or expire them through retention policy.

Rollback plan

You should write this down before you start, not figure it out at 11 p.m.

If the new volume won’t boot:

# 1. Stop the instance again
aws ec2 stop-instances --instance-ids "$INSTANCE_ID" --region "$REGION"
aws ec2 wait instance-stopped --instance-ids "$INSTANCE_ID" --region "$REGION"

# 2. Detach the new (broken) volume
aws ec2 detach-volume --volume-id "$NEW_VOL" --region "$REGION"
aws ec2 wait volume-available --volume-ids "$NEW_VOL" --region "$REGION"

# 3. Reattach the original unencrypted volume as the root
aws ec2 attach-volume \
  --instance-id "$INSTANCE_ID" \
  --volume-id "$SRC_VOL" \
  --device "$ROOT_DEV" \
  --region "$REGION"

# 4. Start the instance
aws ec2 start-instances --instance-ids "$INSTANCE_ID" --region "$REGION"

You are now back where you started, with the encrypted copy intact for forensic investigation of why the boot failed. Typical causes: wrong root device name, the volume was created in the wrong AZ and silently attached as a non-root device, or the source volume had a corrupt boot record that the snapshot faithfully preserved.

Do not delete $SRC_VOL or $SNAP_ID until rollback is no longer a concern — typically 24-72 hours of successful operation.

What this is not

To set expectations clearly:

This is not full-disk encryption against a local-machine threat model. KMS-encrypted EBS protects data at rest as managed by AWS. It does not stop a process running on the live instance from reading its own filesystem. For that, you need LUKS or filesystem-level encryption on top, with key management of your own — a separate project.
This is not a compliance attestation. Migrating to encrypted EBS is a posture improvement, not a certification. Your SOC 2 / ISO / HIPAA auditor will still want to see your key management policy, KMS key rotation status, and the inventory query that proves all volumes are encrypted, not just the one you remembered.
This is not a guarantee. EBS encryption closes a specific cross-account data leakage path and gives you a KMS-grant access boundary. It does not address misconfigured security groups, leaked long-lived IAM keys, S3 buckets without encryption, or application vulnerabilities. Treat it as one item on the list.

Encrypt EBS because it is cheap, well-understood, and removes a real foot-gun in your snapshot and sharing surface. Then keep going on the rest of the list.

QuickCheck CTA

If you’d rather not write the inventory queries to find every unencrypted volume across every account and region — and then chase down the missing KMS grants, the public AMIs, and the other quiet posture issues that pile up over time — QuickCheck runs a read-only, one-shot review of your AWS posture and produces a plain-English report. Unencrypted EBS volumes are one of the dozen items it surfaces, alongside open security groups, IMDSv1 stragglers, missing MFA on root, untagged keys, and a few other “you’d rather know” things. See an example in the sample report. Not magic, not a replacement for proper cloud security tooling, but a fast way to know where you stand.

About Tuck Sentinel

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Encrypting Your EBS Root Volume Without Rebuilding the Server (AWS 2026)",
  "description": "A practical, indie-founder guide to migrating an unencrypted EC2 root volume to KMS-encrypted EBS — without rebuilding the instance, losing data, or fighting AZ mismatch and root device name traps.",
  "author": {
    "@type": "Organization",
    "name": "Tuck Sentinel"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Tuck Sentinel",
    "url": "https://richgibbs.dev/"
  },
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://blog.richgibbs.dev/encrypting-ebs-root-volume-without-rebuilding/"
  },
  "image": "https://blog.richgibbs.dev/static/og-default.png",
  "articleSection": "Cloud Security",
  "keywords": "AWS, EC2, EBS, KMS, encryption, cloud security, snapshot, migration",
  "about": [
    { "@type": "Thing", "name": "AWS EBS encryption" },
    { "@type": "Thing", "name": "AWS KMS" },
    { "@type": "Thing", "name": "Cloud Security Posture" }
  ]
}

EC2 read-only hardening audit: what Inspector misses, and what to check by hand (2026)

Sat, 16 May 2026 14:05:00 +0000

You turned on AWS Inspector, you wired up IAM Access Analyzer, and the consoles are mostly green. Your EC2 fleet is two-to-five instances, the team is one-to-five people, and you’d like to believe the AWS-native tooling is enough.

This post is the honest answer to the question “is an Inspector scan the same thing as an EC2 read-only hardening audit?” — no. They overlap on maybe 30 % of the surface that actually gets indie SaaS owned. The other 70 % is instance-level configuration that AWS-native tooling, by design, doesn’t look at.

What follows is the read-only EC2 hardening audit a solo founder or small ops team can actually run in about an hour on a single instance: five categories of checks, each runnable from inside the box with no agent, no third-party SaaS, and no write access. None of it replaces Inspector. It is the other layer.

What AWS Inspector and Access Analyzer actually cover

It is worth being precise about what the native tools do well, because the gap is what we’re going to walk.

Amazon Inspector is a managed vulnerability-management service. It does three things, per the AWS docs: (1) continuously scans EC2 instances for software vulnerabilities and unintended network exposure, (2) scans container images in ECR, and (3) scans Lambda functions and layers. The EC2 scan is essentially CVE-package matching against the OS package inventory plus a network-reachability layer powered by the same engine as VPC Reachability Analyzer. See What is Amazon Inspector? for the canonical scope statement.

IAM Access Analyzer is a policy-and-resource analyzer. Per the Access Analyzer guide, it identifies resources in your account shared with external principals, validates IAM policies against best practices, and (in the newer “unused access” findings) flags unused permissions and roles. It does not look at anything inside an EC2 instance.

Both are valuable. Neither was designed to answer “is sshd accepting password auth on this box right now?”, because that’s not a control-plane question — it’s an instance-level configuration question, and the AWS control plane has no opinion about the contents of /etc/ssh/sshd_config.

The AWS Well-Architected Framework — Security Pillar is explicit that defense in depth requires both: the protect compute design principle calls out reducing attack surface, hardening operating systems, and enforcing service-level configuration as distinct from identity and detective controls.

That instance-level layer is what the rest of this post is about.

The five-check read-only EC2 hardening audit

Each of the five sections below is something a non-root SSH session can answer in under ten minutes. Read-only means: no apt install, no agent, no IAM changes, no Inspector activation toggles. You are just observing.

If you’d rather skip ahead and have somebody else run this against one host, that’s what the VPS/EC2 Hardening QuickCheck at the end of this post exists for. Otherwise, keep reading.

1. IMDSv2 enforcement (the SSRF gate Inspector won’t fail you on)

The Instance Metadata Service version 1 lets any process on the box — including any application code with an SSRF bug — fetch the instance’s IAM role credentials over plain HTTP with no token. IMDSv2 requires a session token obtained via PUT, which an SSRF attacker generally cannot mint.

AWS publishes the migration story at Configure the Instance Metadata Service for an existing instance. The instance attribute you want is HttpTokens=required. The instance attribute you want to also lock down is HttpPutResponseHopLimit=1 so that container workloads on the host can’t use the metadata service as a confused deputy. The full option reference is in Use IMDSv2.

Read-only checks from inside the box:

# Should 401 without a token (IMDSv2 enforced):
curl -s -o /dev/null -w "%{http_code}\n" \
  http://169.254.169.254/latest/meta-data/

# Should succeed with a token:
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/iam/info

If the first call returns 200, IMDSv1 is still accepted on this instance and any SSRF in your app stack is one HTTP request away from your role credentials. Inspector will not fail this for you; it shows up as a “finding” only if you’ve enabled the relevant network-reachability rule package and the instance is publicly reachable on the metadata port (which it never is). The check you want is the local one, above.

2. SSH posture (the line `sshd_config` actually applies)

Inspector does not parse sshd_config. The CIS Benchmarks do — see the CIS Amazon Web Services Foundations Benchmark and CIS AWS / Ubuntu / Amazon Linux benchmarks for the canonical control list. For a read-only audit you don’t need to map every CIS control; you need the five that account for most real-world EC2 compromises.

sudo sshd -T | grep -Ei \
  '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|permitemptypasswords|x11forwarding|clientaliveinterval|maxauthtries|allowusers|allowgroups) '

The sshd -T form dumps the effective runtime configuration, including Include files and Match blocks, which is the only honest way to audit SSH. Grepping /etc/ssh/sshd_config by hand will miss the cloud-init drop-in that re-enables password auth on Ubuntu AMIs, or the Match Address block a contractor added six months ago.

What you want to see, roughly:

permitrootlogin no (or prohibit-password, never yes).
passwordauthentication no.
kbdinteractiveauthentication no (the new name for challengeresponseauthentication; defaults flipped on several distributions in 2024).
permitemptypasswords no.
A clientaliveinterval and clientalivecountmax pair that actually evicts dead sessions.
An explicit allowusers or allowgroups line so a freshly-created system user can’t SSH by default.

3. Security-group surface (from the host’s perspective, not the console’s)

You can absolutely list security groups via the AWS Console. The console will show you the configured rules. It will not show you which of those rules the local box’s own listening sockets actually answer on, which is the surface that matters when somebody is scanning your public IP.

sudo ss -tulpenH | awk '{print $1, $5, $7}'

Cross-reference every listening 0.0.0.0: or [::]: socket with the SG inbound rules attached to the instance’s primary ENI. Three patterns to flag:

A daemon listening on 0.0.0.0 for a service that should be loopback only (Redis on :6379, Postgres on :5432, an internal admin endpoint on :9000). Loopback-bind in the daemon config; do not rely on the SG alone.
An SG rule open to 0.0.0.0/0 for a port no process is currently listening on. That is drift — somebody opened it during an incident and never closed it.
A daemon listening on a port that is covered by an open SG rule but should not be (a dev-mode HTTP server on :3000, a Jupyter on :8888).

The Well-Architected Security Pillar’s Protect networks design principle calls this out as “minimize the attack surface” — and the read-only version is just ss plus the SG rule list.

4. Patch state and unattended-upgrades reality check

Inspector will tell you which packages have known CVEs. It will not tell you whether the box is actually applying security updates on a schedule, which is the difference between “we patched on Monday” and “we have not rebooted since the 2024 OpenSSL CVE.”

# When did the kernel last reboot into the running version?
uptime -s
uname -r

# Is unattended-upgrades actually installed and running? (Debian/Ubuntu)
systemctl is-enabled unattended-upgrades 2>/dev/null
systemctl is-active  unattended-upgrades 2>/dev/null
grep -E '^(APT::Periodic::Update-Package-Lists|APT::Periodic::Unattended-Upgrade)' \
  /etc/apt/apt.conf.d/20auto-upgrades 2>/dev/null

# Amazon Linux 2023 / RHEL family:
systemctl is-enabled dnf-automatic.timer 2>/dev/null
systemctl is-active  dnf-automatic.timer 2>/dev/null

An instance whose last boot was 11 months ago and whose unattended-upgrades was disabled the day the founder hit a noisy apt upgrade is the most common ‘we never got owned because we got lucky’ setup in indie SaaS. The fix is two lines of config. The audit step is just running the four commands above.

5. Docker socket exposure

If the box runs Docker, the single most consequential misconfiguration is exposing the Docker daemon socket — either by mounting /var/run/docker.sock into a container that handles untrusted input, or by listening on a TCP port without TLS. Docker’s own security documentation calls this out as effectively root on the host. The CIS Docker Benchmark §2 covers the daemon-configuration controls in detail.

# Is the daemon listening on TCP anywhere?
sudo ss -tlpn | grep -E 'docker|2375|2376'

# Which running containers have the socket mounted?
docker ps --format '{{.ID}} {{.Image}} {{.Names}}' 2>/dev/null | \
  while read id image name; do
    if docker inspect "$id" 2>/dev/null | \
         grep -q '"/var/run/docker.sock"'; then
      echo "SOCKET MOUNT: $name ($image)"
    fi
  done

A container that mounts the socket can start a new container that mounts the host’s / and chroot into it. That is not a CVE — Inspector will never flag it — but it is an instant root-equivalent path the moment that container has a code-execution bug.

What this audit deliberately does not do

A read-only EC2 hardening audit is one layer. It is not:

A penetration test. Nothing here exploits anything. We are reading config and listening sockets, not attacking the application. The audit-vs-pentest framing for indie founders is in a separate post.
A CVE scan. Inspector is the right tool for that, and you should leave it on. The point of the post is the other layer, not a replacement for the package-vulnerability layer.
A compliance attestation. SOC 2 / HIPAA / ISO 27001 require documented policies, evidence collection, vendor management, and access reviews — none of which this audit produces. A hardened instance is a part of those, not a substitute.
A continuous control. This is a point-in-time read. The continuous version is your patch cadence, your unattended-upgrades, your SG-drift detection, and your IMDSv2-required default in the launch template.

The CIS AWS Foundations Benchmark and the AWS Well-Architected Security Pillar both treat instance-level hardening and IAM-level controls as distinct layers, and there is a reason for that: they fail differently, and one rarely catches the other.

When the read-only audit isn’t enough

If you’ve run the five checks above on one instance and it lit up in two or more sections, the bottleneck is usually not the audit — it is the time to do the remediation. At that point you have three options:

DIY the fixes in a maintenance window. Most of the items above are 10–30-minute changes. The Ubuntu / Debian EC2 hardening checklist is the matching to-do list.
Hire someone to run the audit on more than one instance. A multi-host audit is mostly cross-referencing the same five categories with the launch templates and AMI lineage that produced them.
Stop here and accept the residual risk. Sometimes the right answer.

There is no fourth option where AWS-native tooling alone closes this gap. Inspector, Access Analyzer, GuardDuty, and Security Hub are all great at what they’re designed to do; none of them parse sshd_config, none of them check whether the metadata service still accepts unauthenticated requests, and none of them tell you that a container is mounting the Docker socket.

AI/API bill jumped? Find the token burn before it eats the month

Thu, 21 May 2026 22:35:00 +0000

AI bills usually do not explode because the model suddenly got smarter. They explode because something operational and boring broke.

A cron job starts failing and retries forever. A background agent keeps using a frontier model for work that should be a cheap classifier or no model at all. A fallback path silently routes to a paid provider after the cheap provider hits quota. A browser automation loop keeps resubmitting the same task. Prompt caching is high, but one uncached workflow still burns the month.

That is the pattern I look for in an AI/API cost rescue pass.

The first-hour checklist

Start with the live account dashboard, but do not stop there. A dashboard tells you that money moved. It rarely tells you which boring system behavior caused it.

The fastest useful pass is:

List every recurring job, cron, queue worker, background agent, and scheduled task.
Mark anything that has failed more than once in a row.
Mark anything using a frontier model by default.
Mark anything with automatic fallback to another paid model.
Check whether failed fallback runs are still billed even when they produce no user-visible output.
Check cache hit rate and the workflows that miss cache.
Check which calls are interactive and which are unattended background work.
Disable non-revenue background jobs until there is a reason to re-enable them.
Put a daily dollar alert below the panic threshold, not above it.
Write down the kill switch before the next incident.

This is not elegant. It works.

Common spend leaks

The most common leaks are not exotic:

Failed scheduled jobs. A job that fails every 30 minutes can still create model calls, fallback attempts, summaries, traces, or notifications.
Wrong model defaults. High-reasoning models are useful for hard work. They are wasteful for health checks, digests, log summaries, and polling.
Fallback cascades. Cheap provider fails, expensive provider wakes up, then another fallback wakes up after that.
Retry loops. Browser automation, queue workers, and agent sessions often retry the full prompt instead of a small recovery step.
No budget boundary between product and ops. Customer-facing work, internal experiments, monitoring, and background housekeeping all hit the same billing account.
Helpful automation with no revenue path. Daily reports, market scans, and content queues feel productive while they quietly spend money.

The fix is usually less glamorous than the diagnosis: pause jobs, lower model tiers, cap retries, separate production and experiments, and make expensive paths explicit instead of automatic.

What to send for a review

Send redacted evidence only:

Billing screenshots with account identifiers hidden.
Provider usage by day and model.
Cron/task/job list.
Model routing config with secrets removed.
Recent failure summaries with tokens, keys, emails, customer records, and private logs removed.
A short note explaining what changed before the bill jumped.

Do not send API keys, tokens, SSH private keys, .env files, customer records, raw private logs, payment details, or regulated personal data.

Fixed-scope offer

AI/API bill jumped?

Get a same-day $9 first-aid triage. Send redacted billing screenshots, usage by day, routing/config notes with secrets removed, and what changed. I will return a 1-page kill list: likely spend leak, first things to pause, cheaper routing/caching/batching checks, and whether the full $499 rescue is warranted.

Redacted evidence only. No API keys, tokens, SSH keys, .env files, customer records, raw private logs, payment details, or regulated data.

Buy $9 triage

Need deeper help? See the $499 AI/API Cost Rescue.

Same-day review depends on receiving enough redacted evidence. This is first-aid triage, not guaranteed cost recovery.

Permissions first, then prompts. If the agent stack is connected to GitHub, Gmail, Slack, Stripe, or AWS and you have never written down what it is allowed to do, do that before optimizing tokens. The free agent permission map checklist is the one-page version: account, verbs, spend, approvals, logs, kill switches.

I am offering a fixed-scope AI/API Cost Rescue QuickCheck for founders running agents, internal AI tools, or automation hosts.

Price: $499

You get a written report within 24 hours after complete redacted intake:

Spend leak map: what appears to be burning money and why.
Kill list: what to pause first.
Model routing plan: what should use cheaper models, caching, batching, or no model.
Budget guardrails: caps, alerts, and daily checks.
One async clarification pass within 7 days.

This is advisory and fixed-scope. Implementation can be quoted separately if needed.

Buy the AI/API Cost Rescue QuickCheck

If you are not ready to buy, still do the first-hour checklist above. The important thing is to stop unattended spend before optimizing anything else.

Redacted evidence beats account access: how to get a useful QuickCheck without handing over credentials

Fri, 22 May 2026 22:06:00 +0000

The default small-business support pattern is backwards.

Someone asks for help. The helper asks for login access. The founder sends a password, an API key, an SSH key, a mailbox invite, or a pile of raw logs. Everyone moves faster for ten minutes, and the security risk gets worse for months.

For a fixed-scope review, that is usually unnecessary.

Most useful early answers do not require account custody. They require clean evidence: screenshots with identifiers hidden, DNS records, job lists, command output, billing summaries, and a short explanation of what changed.

That is the boundary QuickCheck is built around.

The point is diagnosis, not ownership

A first-pass review should answer a narrow question:

Why did this AI/API bill jump?
Why does this domain fail email authentication?
Why is this inbox impossible to work in?
What obvious server hygiene problems should be fixed first?

Those questions do not usually require control of the account. They require enough context to separate likely causes from noise.

There is a big difference between:

“Here is a redacted screenshot of usage by model and day.”
“Here is my OpenAI API key.”

The first one helps. The second one creates a new incident.

OpenAI’s API key safety guidance is blunt on this: do not expose keys in client-side environments, do not commit keys to repositories, use environment variables or a key management service, monitor usage, rotate keys when needed, and use IP allowlisting where appropriate.

The same principle applies outside OpenAI. A credential is not evidence. A credential is authority.

What never to send

Do not send:

API keys.
SSH private keys.
Passwords.
OAuth refresh tokens.
Environment files.
Full customer records.
Raw mailbox exports.
Payment details.
Regulated personal data.
Unredacted private logs.
Screenshots that expose account IDs, billing addresses, or unrelated customer names.

If the review cannot be completed without one of those, the scope is probably no longer “QuickCheck.” It is implementation, incident response, migration, or hands-on administration.

That can be valid work. It should be separately scoped, authorized, and handled with a different risk model.

What good redacted evidence looks like

Good evidence has three traits:

It shows the shape of the problem.
It hides secrets and unrelated people.
It is specific enough to produce a written recommendation.

For screenshots, redact:

API keys and tokens.
Account IDs.
Email addresses unless needed for the finding.
Customer names.
Billing address and card details.
Internal hostnames if they are not relevant.

Do not blur the numbers that matter. If the question is cost, keep the spend, dates, model names, token counts, request counts, and usage trend visible.

Do not redact the thing being reviewed. If the question is DMARC alignment, the domain and DNS records matter. If the question is server hygiene, the open ports and service names matter.

AI/API cost review: what to send

For an AI/API cost review, send redacted evidence like:

Provider usage by day.
Usage by model.
Spend by project, workspace, or API key label.
Recent billing screenshot with account identifiers hidden.
Cron, job, queue, or agent task list.
Model routing summary.
Retry and fallback rules.
Recent failure summary with secrets removed.
What changed before the bill moved.

The common mistake is sending only the invoice. The invoice proves money moved; it rarely shows the cause.

The useful evidence is the operational layer around the invoice: jobs, routes, retries, fallbacks, model defaults, cache misses, and unattended work.

OpenAI’s production guidance calls out project separation, billing limits, key tracking, and staging projects. That is exactly the kind of structure that makes a spend spike diagnosable without exposing secrets.

Email DNS review: what to send

For an Inbox/DNS QuickCheck, send:

Domain name.
Current SPF record.
Current DKIM selector records, if known.
Current DMARC record.
MX records.
Sending services used by the domain.
A redacted authentication result from a failed message, if available.
Whether the problem is password resets, receipts, newsletters, support replies, or all mail.

Do not send mailbox access. Do not forward private customer threads unless the headers are the only way to prove the issue and the body is removed.

Most email DNS problems are visible from public DNS and message headers. The mailbox contents are usually irrelevant.

Inbox cleanup review: what to send

For inbox cleanup, the safe path is survey first, delete second.

Send:

Approximate unread count.
Approximate total count.
Storage/quota screenshot with personal details hidden.
Top sender/category counts from a read-only survey.
The kinds of messages you are comfortable deleting.
The kinds of messages that must never be touched.

Do not hand a third-party app full mailbox OAuth just to find out that newsletters, old notifications, and automated receipts are the bulk of the mess.

The useful first answer is a cleanup plan: which senders to archive, which queries to test, what to delete only after a review window, and what filters should prevent the pile from coming back.

Server hygiene review: what to send

For a small VPS or EC2 hygiene review, send customer-run read-only output, not credentials:

OS and version.
Listening ports.
Firewall status.
SSH effective configuration.
Running services.
Update status.
Disk usage.
Web server vhost summary, if relevant.
Backup/snapshot status, if known.
Any public URL that should be checked.

Do not send SSH keys. Do not send cloud console credentials. Do not send a root password.

For a fixed-scope advisory pass, a read-only collector or manually copied command output is enough to identify the usual issues: password SSH, root login, missing firewall, stale packages, exposed admin ports, weak headers, no backups, no fail2ban, or public files that should not be public.

When access actually is needed

Sometimes access is the right answer.

Implementation needs access. Incident response may need access. A migration may need access. A production fix during an outage may need access.

But that should be a deliberate second step, not the default first ask.

A good review should end with one of three outcomes:

“Here is the fix list; you can do it yourself.”
“Here is the fix list; I can implement it if you approve a separate scope.”
“This is riskier than a QuickCheck; stop and handle it as incident response.”

That boundary protects both sides.

A simple redaction pass before you send anything

Before sending evidence:

Put the files in one folder.
Rename screenshots by topic, not by account name.
Redact keys, tokens, account IDs, addresses, unrelated people, and payment details.
Leave the problem data visible.
Add a short note: what broke, when it started, what changed, and what outcome you want.
Re-open every screenshot once after redaction and look for missed secrets.

If a screenshot contains both a secret and a useful number, crop it or cover only the secret. Do not hide the whole panel and expect a useful diagnosis.

A worked example: AI agents touching real tools

The same rule applies when you are reviewing an AI agent or workflow that has been wired into GitHub, Gmail, Slack, Stripe, or AWS. The reviewer does not need your tokens or your admin console — they need a written description of which account the agent runs as, what verbs it can perform, and what currently requires human approval. The agent permission map checklist is the redacted-evidence version of that conversation: nine columns, one row per connected tool, no secrets attached.

The working rule

If someone needs evidence, send evidence.

If someone needs authority, stop and scope that separately.

That one distinction prevents a lot of unnecessary risk.

Need a small review without handing over credentials?

QuickCheck is fixed-scope advisory work built around redacted evidence: AI/API cost triage, Inbox/DNS checks, inbox cleanup planning, and one-host server hygiene reviews.

See QuickCheck options

Redacted evidence only. No API keys, tokens, SSH keys, passwords, mailbox custody, raw private logs, payment details, or regulated personal data.

Sources worth keeping open

Before an AI agent gets real tool access, map what it can actually do

Sat, 23 May 2026 16:40:00 +0000

Most teams I talk to get the agent demo working before anyone writes down what it can do in production.

Not the model. Not the prompt. The operating permissions - what account it runs as, what it can touch, what it can spend, and who has to approve.

That gap is where the surprises live.

The table that should exist before the agent goes live

For any agent touching real tools, I would want a single table that answers:

which account it runs as
what it can read
what it can change
what it can send externally
what it can spend
what it can deploy or break
what requires human approval
what gets logged
how to turn it off fast

Nine columns. One row per connected tool. That is the whole artifact.

“Read-only” is not the safe answer you think it is

The uncomfortable bit is that “read-only” still matters.

A read-only agent may see source code, support tickets, invoices, Slack channels, customer records, logs, or internal docs. That is a real blast radius even if it never writes a byte.

And a write-capable agent needs verbs, not labels. “Write access to GitHub” is not a permission. These are:

comment
merge
email
refund
invite
export
deploy
delete

The label tells you nothing. The verbs tell you everything.

A default rule set I would start from

This is the boring, working starting point. Tighten or loosen per tool, but start here:

Outbound to humans: draft-only for anything that leaves the company - email, Slack DMs to customers, social, support replies. A human ships the final.
Code: PRs allowed, merges approved.
Money: read-only billing lookups, refunds approved.
Cloud: inspection allowed, resource changes approved.
Logs: every run shows trigger, tool call, object touched, result, and human approval if one was required.
Kill switch: one obvious off-switch per connected tool. Not a config flag buried in a repo. A button or a single command.

The point is not that these rules are correct for every team. The point is that some version of this exists in writing before the agent becomes production infrastructure.

What this is not

A few things to be honest about:

This is not a security audit. It is a map.
It is not a compliance review, not a pentest, not an IAM audit.
It does not make any agent “safe” or “least-privileged” or “production-ready.” Those words do a lot of work that the table does not do.
It does not replace your provider’s own controls - org policies, role scopes, OAuth scopes, MCP allow-lists. It tells you which of those you actually need to set.

The table is the first artifact, not the last one.

The boring summary

Write the permission map before the agent touches a real tool, not after.
One row per connected tool. Nine columns: account, read, change, send-external, spend, deploy-or-break, approvals, logs, kill switch.
Treat “read-only” as a real surface area, not a free pass.
Describe write access in verbs (comment, merge, email, refund, invite, export, deploy, delete), not labels.
Drafts for anything that leaves the company. Approvals for money, merges, and resource changes.
One kill switch per tool, obvious enough to use at 2am.

The table does not need to be fancy. It just needs to exist.

— Rich

Frequently asked questions

What is an agent permission map? A single table, one row per connected tool, that records which account the agent runs as, what it can read, what it can change, what it can send externally, what it can spend, what it can deploy or break, what requires a human’s approval, what gets logged, and how to turn it off fast. It is a written description, not a scanner.

Is “read-only” access actually safe for an AI agent? No. Read-only access can still expose source code, support tickets, invoices, Slack history, customer records, logs, and internal docs. The map should treat read-only as a real blast radius and decide whether the agent really needs it.

What is a kill switch for an AI agent? One obvious off-switch per connected tool that a human can use at 2am. A revoked OAuth token, a disabled GitHub App, a paused Zapier/Make scenario, a removed Slack app, or a single command that disables the agent’s outbound calls. Not a config flag buried in a repo.

Do I need a security audit, an IAM audit, or this map? The map is the cheaper first artifact and almost always the missing one. It is not a security audit, an IAM audit, a compliance review, or a penetration test. If you eventually need any of those, the map makes them shorter and cheaper because the scope is already written down.

Want a written permission-map review?

This lane is in private validation. If you are connecting an agent to GitHub, Gmail, Slack, Stripe, AWS, MCP servers, support tools, or a deploy path and you want a fixed-scope, async, written permission-map review, email support@richgibbs.dev with the tool the agent touches and the riskiest verb it can do.

What you get back in this window is the checklist and the report template I am working from, not a free review of your stack. No public price during the validation window. The QuickCheck site lists the related paid offers that already exist for AI/API cost rescue, inbox/DNS, and one-host server hygiene; the Agent Permission Map is a separate lane that is not yet a productized QuickCheck.

I will not ask for credentials, OAuth grants, or admin access. I cannot accept regulated data (health, financial-account, payment-card, student, children’s, HR).

Advisory checklist for operators. Not legal advice, not a security audit, not an IAM audit, not a compliance review, not a penetration test, and not a certification. No affiliation with or endorsement by GitHub, Google, Slack, Stripe, AWS, OpenAI, Anthropic, Microsoft, Cloudflare, or any other vendor named above. The author will not ask for credentials, OAuth grants, or admin access, and cannot accept regulated data (health, financial-account, payment-card, student, children’s, HR).

Docker Compose on one VPS: the production checklist before you outgrow it

Sun, 24 May 2026 11:48:00 +0000

Docker Compose is fine for a one-server app.

That sentence makes some people twitch, so here is the boundary: one VPS, one operator, a small app, a reverse proxy, a database you understand, and a business where “five minutes down while I fix it” is annoying but not catastrophic.

That setup does not need Kubernetes on day one. It does need a boring checklist, because most Compose outages are not deep container problems. They are simpler:

the process did not come back after reboot
logs filled the disk
a private port was accidentally published to the internet
the .env file became the only copy of production secrets
the database volume was “backed up” but never restored
the deploy command was whatever you typed last time from shell history
the health check said “container running” while the app was dead

Compose is not the problem in that story. Undocumented production habits are.

This is the checklist I would want in place before trusting a Docker Compose app on one VPS.

1. Write down the actual shape of the system

Before changing YAML, write one short inventory:

public hostnames
Compose project directory
services in compose.yml
published ports
named volumes
bind mounts
environment files
backup target
deploy command
rollback command
who gets paged when it breaks, even if “who” is just you

The act of writing that list will catch half the mistakes.

If you cannot say which volume contains production data, you do not have a deployment. You have a container that happens to be running.

2. Make the reverse proxy the only public door

For a one-box Compose app, I want exactly one public entry path:

80/tcp and 443/tcp to Caddy, nginx, Traefik, or Apache
app containers reachable only on the Docker network or loopback
database and cache ports not published publicly
admin tools disabled or bound to 127.0.0.1

The dangerous Compose line is usually this:

ports:
  - "5432:5432"

That publishes Postgres on every interface unless the host firewall saves you.

Prefer one of these shapes:

ports:
  - "127.0.0.1:3000:3000"

or no host port at all, with the reverse proxy joining the same Compose network.

Then verify from outside the box:

nmap -Pn -p 1-10000 your.server.ip

You should see SSH, HTTP, HTTPS, and very little else. If Redis, Postgres, MySQL, Meilisearch, Elasticsearch, or an admin dashboard shows up, stop and fix that before touching the application.

3. Use a restart policy, but do not confuse it with health

Docker documents restart policies for the basic “come back after exit or daemon restart” behavior. For a one-VPS app, unless-stopped is usually the least surprising default.

services:
  web:
    image: registry.example.com/myapp:2026-05-24
    restart: unless-stopped

That gets you through process exits and host reboots.

It does not prove the app is healthy.

A wedged app can keep a process alive forever. A web server can return 500s while the container is “Up”. A worker can be connected to the wrong queue and look fine from Docker’s point of view.

So pair restart policy with a real health check.

4. Add health checks that test the thing users need

Docker Compose supports service health checks in the Compose file. The command can be a list form or shell form; the important part is that it tests behavior, not just process existence.

services:
  web:
    image: registry.example.com/myapp:2026-05-24
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://127.0.0.1:3000/healthz >/dev/null || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 30s

A good health check answers one narrow question: “Can this service do the small thing users depend on?”

For a web app, that might be /healthz returning 200 after checking the database connection.

For a worker, it might be a queue heartbeat or a lightweight dependency check.

Do not make health checks expensive. Do not run migrations inside them. Do not call third-party APIs every 30 seconds. If the health check creates its own outage, it failed the assignment.

5. Put log rotation in the Compose file

The most boring production outage is a full disk.

Docker’s default json-file logging driver writes container output to JSON files on the host. Docker’s docs call out options such as max-size and max-file for limiting those logs. Use them.

services:
  web:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "5"

Then check the host:

docker system df
du -h /var/lib/docker/containers 2>/dev/null | sort -h | tail
df -h /

If logs can fill the root volume, your uptime depends on how chatty the app gets during an incident. That is not a plan.

6. Treat `.env` as config, not a secret vault

Compose reads environment files because it is convenient. Convenience is not the same as secret management.

For a one-person VPS, I am not going to pretend every app needs Vault, SOPS, KMS, and a ceremony. But the minimum line is still clear:

.env is not committed
.env is mode 0600 or at least not world-readable
secrets are not printed in deploy logs
backups do not spray .env into random buckets
the production .env file has a second recoverable copy somewhere intentional
old API keys get rotated when contractors, incidents, or accidental exposure make that necessary

Docker Compose also has secrets and configs concepts in the Compose specification. On a single VPS, even a simple file-mounted secret can be better than passing everything as environment variables, because it gives you a clearer boundary for “this file is sensitive.”

The main rule is not fancy: know where secrets live, know who can read them, and know how to rotate them.

7. Name volumes like you will have to restore them at 2am

This is the difference between a recoverable Compose app and a science project.

Bad:

volumes:
  data:

Better:

volumes:
  postgres_data:
  uploads_data:

Best: a README next to the Compose file that says:

postgres_data  -> production database
uploads_data   -> user uploads
backup target  -> s3://example-backups/myapp/
restore drill  -> docs/restore.md
last tested    -> 2026-05-24

If your database is inside Compose, backup and restore are production features. Not ops chores. Not future hardening. Product features.

At minimum:

docker compose exec -T db pg_dump -U app app > backup.sql

and a documented restore command that you have run on a clean database.

A backup you have never restored is just a comforting file.

8. Keep the deploy command boring and repeatable

The deploy path should be a script or runbook, not shell history.

For a pull-based deploy:

set -euo pipefail

cd /opt/myapp
docker compose config >/dev/null
docker compose pull
docker compose up -d --remove-orphans
docker compose ps
docker compose logs --since=10m --tail=200 web

For a build-on-host deploy:

set -euo pipefail

cd /opt/myapp
docker compose config >/dev/null
docker compose build --pull
docker compose up -d --remove-orphans
docker compose ps

That is not sophisticated. That is the point.

You want the same commands every time so that when a deploy fails, you are debugging the deploy, not your memory of the deploy.

9. Have a rollback that does not require creativity

If images are tagged only as latest, rollback is guesswork.

Prefer immutable-ish tags:

services:
  web:
    image: registry.example.com/myapp:2026-05-24-1842

Then rollback is boring:

cd /opt/myapp
git checkout previous-known-good-compose-file
docker compose pull
docker compose up -d --remove-orphans

If you build on the host, keep the previous image around long enough to go back.

If the database migration is not backwards-compatible, write that down before deploying. The worst rollback plan is “we can roll the app back, but not the data.”

10. Start Compose from systemd, not a forgotten terminal

If the host reboots, the app should return without you SSHing in.

One plain systemd unit can be enough:

[Unit]
Description=My app Docker Compose stack
Requires=docker.service
After=docker.service network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/docker compose up -d --remove-orphans
ExecStop=/usr/bin/docker compose down
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

Then:

sudo systemctl daemon-reload
sudo systemctl enable --now myapp-compose.service
systemctl status myapp-compose.service

This does not replace container restart policies. It gives the host one obvious owner for the stack lifecycle.

11. Alert on the boring host signals

For a one-VPS Compose app, the first useful alerts are not advanced:

root disk over 80 percent
memory pressure or swap thrash
public HTTP check failing
TLS certificate near expiry
Docker daemon down
app health endpoint failing
backup job missing its last-success marker
root-owned files accidentally created in bind mounts

That list catches more real incidents than a beautiful dashboard that nobody reads.

Start with cron, systemd timers, Uptime Kuma, Healthchecks.io, Better Stack, or whatever you will actually maintain. The tool matters less than the habit: an external check must notice when the box is not serving users.

12. Know when Compose is no longer the right tool

Compose on one VPS stops being cute when:

downtime during one-host maintenance is unacceptable
the database needs managed backups, replication, or point-in-time recovery
one deploy must roll across multiple hosts
the team needs per-service ownership and access control
traffic bursts require horizontal scaling
compliance or customer contracts require stronger operational evidence
the restore path depends on one person remembering everything

That does not mean “move to Kubernetes.” It means the shape changed.

Maybe the next step is managed Postgres and the app still runs in Compose. Maybe it is a second VPS. Maybe it is Fly.io, Render, ECS, Nomad, Kubernetes, or something boring from your cloud provider.

The important part is not defending Compose forever. It is knowing what risk you accepted while Compose was the right level of machinery.

The short version

Before you trust Docker Compose on one VPS, make these true:

one public door: reverse proxy only
no accidental public database/cache/admin ports
restart: unless-stopped or another deliberate restart policy
health checks that test real behavior
Docker log rotation
.env protected and recoverable
named volumes with a restore-tested backup path
one repeatable deploy command
one boring rollback path
systemd owns the stack on boot
external checks catch HTTP, disk, cert, Docker, and backup failures
a written threshold for when this setup has been outgrown

Compose is a good tool for a one-server business when the operator is honest about its edges.

The danger is not that Compose is too small.

The danger is pretending a running container is the same thing as a production system.

Sources

I ran a read-only server audit. Here's what I found that the scanners missed.

Wed, 27 May 2026 14:30:00 +0000

I keep my servers boring. SSH keys only. UFW defaults-deny. Unattended-upgrades on a timer. Fail2ban because why not. Nothing fancy — just the basics that every indie founder’s “I’ll get to it” list already has.

I thought I was fine. And I mostly was. But “mostly” is the kind of word that keeps incident responders employed.

Last month I ran a structured read-only audit on my own infrastructure. Same process I use for the QuickCheck — just a systematic posture review. No exploits. No intrusive scans. Just a checklist of things that tend to drift when you’re focused on shipping instead of hardening.

I wasn’t expecting to find much. That’s what made the find annoying.

The find

In a /var/backups/ directory on a utility box, there was a compressed archive from nine months ago. Inside: a full .env file from an old deployment script — database host, API keys, a service account token — world-readable. The backup job that created it had been disabled when we moved to a new deploy pipeline, but the artifact was still there. Anyone with a shell on that box — a compromised dependency, an unattended curl | bash, a stray container — could have read the whole thing.

No, this wasn’t an exposed CVE or an active exploit. It was quieter than that. It was the kind of thing that only matters after something else goes wrong. But if something had gone wrong — if someone had gotten a foothold — that file would have turned a limited incident into a full credential dump in about 30 seconds.

What I deliberately did not do

I did not run a vulnerability scanner. I did not attempt to crack anything. I did not change any configuration during the review. This was a read-only posture check: what’s listening, who has access, where are the seams, what did I forget about.

That distinction matters. A pentest answers “can someone break in right now?” A read-only audit answers “if someone does break in, how bad is it?” They’re different tools for different questions. Most solo devs need the second one long before they need the first. (I wrote more about that distinction here.)

The fix

The fix took fifteen minutes: delete the archive, rotate the compromised keys, add backup-cleanup to the deploy checklist. No firewall changes. No architecture redesign. No downtime. The risk wasn’t in the remediation — it was in the nine months the file was sitting there without anyone noticing.

That’s the part that stuck with me. It wasn’t a sophisticated attack that would have exposed me. It was a backup artifact from a retired pipeline, doing exactly what backups do — persisting data after the original source is gone.

What I do differently now

I added a quarterly check to my calendar: “read-only audit, one box.” Same methodology each time. Check users and groups. Check world-readable files in unexpected places. Check what’s listening on interfaces it shouldn’t be. Check cron jobs and systemd timers that might outlive their purpose. Takes about an hour. Has paid for itself twice over in peace of mind.

Your mileage will depend on your setup — age of the server, number of people who’ve had access, how many experiments got deployed and never cleaned up. But the basic principle holds: the best time to audit is before you have a reason to. The second-best time is now.

If you want a second set of eyes

Every time I tell this story at a meetup or on a thread, someone asks if I do these audits for other people. So I productized the methodology: the Hardening QuickCheck is the same structured read-only review I run on my own boxes. You get a report with what I found, why it matters, and what to do next. No production changes. No credentials stored. Just a prioritized write-up from someone who has been surprised by a backup file too.

See what a report looks like if you want to judge the format before committing.

And to be clear: this is a posture review, not a penetration test or a compliance certification. It won’t guarantee your server is unhackable. What it will do is tell you where the quiet risks are — including the ones that scanners miss because they aren’t looking for your old deploy-backup-2024.tar.gz.

Read-only audit, not a pentest. Your mileage depends on your setup. Not compliance certification or a guarantee of security.

Server monitoring & alerting for indie founders who self-host

Fri, 29 May 2026 04:40:00 +0000

You moved everything to a VPS. It is cheaper, faster, and under your control.

Then one night your site goes down and you only find out when a customer emails you at 2 a.m.

This post is the smallest monitoring and alerting stack that actually gets used by solo founders and small teams who self-host.

It is not a 47-metric Prometheus + Grafana + PagerDuty architecture. It is the boring, reliable version that tells you the server is down or the disk is full before your users notice.

What you actually need to monitor

For most indie setups the critical signals are simple:

Is the server reachable?
Is the web service responding with 200?
Is disk space under 80%?
Are the important processes still running?
Are there any new security updates that require a reboot?

Everything else (CPU load graphs, memory heatmaps, 500 custom metrics) is nice-to-have until you have the basics covered and actually look at the alerts.

The minimal stack that works in 2026

1. Uptime / HTTP check

Use a simple external ping service that hits your domain every 5 minutes and alerts on failure or slow response.

Options that stay free or cheap for low volume: - UptimeRobot (free tier is still generous) - Freshping - Or self-hosted with a small script + cron that curls and emails on failure

2. Disk + basic system alerts

Install a lightweight agent that watches disk, load, and processes.

Common choices: - monit (simple, config-file based, emails directly) - netdata (beautiful dashboards, works great on small VPSes) - Basic cron + df + mail scripts for the ultra-minimal route

3. Security update alerts

Most VPS providers now surface kernel and package updates. The ones that require reboot are the important ones.

A simple weekly cron that runs unattended-upgrades --dry-run and emails you when action is needed is often enough.

The alerting rule that matters most

Only alert when a human actually needs to do something.

Bad alerts (the kind everyone eventually mutes): - “CPU was above 60% for 3 minutes” - “Disk was at 72%” - “Response time was 800ms”

Good alerts: - “Site returned 5xx for 5 consecutive checks” - “Disk is at 92% — you have ~48 hours before it fills” - “Security updates require reboot (kernel)”

The goal is fewer alerts that you actually read and act on.

Quick start recommendation

For most new self-hosted setups in 2026 the fastest path that does not require learning a whole monitoring platform is:

UptimeRobot (or equivalent) for HTTP uptime
Netdata or monit for disk + process alerts
One weekly cron for security updates

Total setup time: under an hour.

You will sleep better knowing the server will tell you when something is wrong instead of waiting for customer emails.

This post follows the revenue publishing workflow. Distinct from all existing checklist/hardening posts. Ready for generator + deploy.

DKIM key rotation for indie founders: the 15-minute zero-downtime swap

Sun, 31 May 2026 17:10:00 +0000

You set up DKIM once during the initial SPF/DKIM/DMARC checklist. The key has been signing every outbound message since. It works. You moved on to the next fire.

Months later the same key is still the only one. If it leaks, if the laptop it was generated on disappears, or if you simply want a repeatable hygiene habit, you have nothing. Most indie founders never rotate DKIM keys. They treat the initial setup as a one-time event.

This post fixes that. It assumes you already have working SPF, DKIM, and DMARC records. The goal is a 15-minute zero-downtime rotation using two selectors, a short TTL window, and one monitoring step that catches the gotcha almost everyone misses.

The dual-selector trick

The usual approach is dangerous: generate a new key, replace the old DNS record, update the MTA config, and pray nothing is in flight. A message signed during the cutover can fail DKIM and bounce or land in spam.

Instead, keep two selectors active for a brief, controlled window. Name them by year and letter so they are obviously temporary: s2026a and s2026b.

The sequence is simple.

Generate the new key pair. Publish only the public key under the new selector (s2026b._domainkey). Leave the old selector (s2026a._domainkey) exactly as it is.
Set a short TTL on both TXT records (300 or 600 seconds) for the duration of the rotation.
Wait one full TTL so the new record has propagated everywhere.
Update your mail server configuration to sign with the new selector (s2026b).
Wait another short buffer (one TTL plus five minutes).
Remove the old selector DNS record.

At no point is mail signed with a selector whose public key is not published. No message is lost. The overlap window is measured in minutes, not days.

Most people who try to rotate without this pattern either cause a brief outage or leave the old selector in DNS “just in case” for months. The dual-selector method removes both problems.

ed25519 vs 2048-bit RSA in 2026

Use ed25519 unless you have a documented reason not to.

RFC 8463 added Ed25519-SHA256 support to DKIM. The resulting TXT records are roughly one-third the size of a 2048-bit RSA record. Signing and verification are faster. Every major mailbox provider (Google, Microsoft, Yahoo, Cloudflare, Amazon SES, Postmark, etc.) has supported it for years.

2048-bit RSA remains the safe fallback when you are dealing with ancient internal mail servers or a very old recipient MTA that has never been updated. In 2026 that situation is rare for normal SaaS and transactional mail.

Generate the new key with your normal tool and specify ed25519:

opendkim-genkey -t ed25519 -s s2026b -d yourdomain.com

Or the equivalent one-liner in rspamd or whatever you use. The rest of the rotation process is identical regardless of algorithm.

The stale-selector gotcha

This is the part that quietly breaks deliverability weeks after you think you are done.

Gmail and Outlook cache DKIM public keys by selector. The cache can live for up to 90 days. A message that was queued before your cutover, or that passes through a forwarder or mailing list that still holds the old selector in memory, can arrive signed with s2026a long after you deleted the record.

When that happens the receiving server looks for s2026a._domainkey and finds nothing. DKIM fails even though the mail was legitimately sent from your domain.

You will not see this in your own mail logs. The only reliable signal lives in the DMARC aggregate reports you already receive at your rua address.

Watch those reports for at least seven days after the planned removal date. Any signature that still references the retired selector means the record needs to stay live for another 30 days. Most forwarders and queues clear within a week; some legacy systems take longer.

Rotation cadence

Rotate every 6–12 months. Put the task on the calendar the same week you review SPF includes or rotate other long-lived secrets. Treat it as scheduled maintenance, not an emergency response to a suspected compromise.

If you only rotate when something feels wrong, you will rotate far too late and far too rarely.

Verification that the swap actually completed

Three quick checks confirm the rotation succeeded.

First, send a test message to any Gmail address and view the full headers. The DKIM-Signature line must show s=s2026b (or whatever new selector you chose). If it still shows the old selector, your MTA config change did not take effect.

Second, examine the DMARC rua reports for the domain. After the cutover date you should see zero dkim=fail rows that mention the old selector. Persistent failures from the retired selector are the stale-selector problem described above.

Third, run the built-in test from your MTA on the mail server itself:

opendkim-testkey -d yourdomain.com -s s2026b

It should report that the key matches the published record. Do the same for the old selector only while it is still published; once removed it will correctly fail the test.

If you want a deeper walkthrough of reading rua reports without paying for a SaaS dashboard, see the post on DMARC aggregate reports without a SaaS.

You already did the hard part

The original SPF/DKIM/DMARC checklist got your records green. This post gives you the habit that keeps them green after the first key ages out.

The dual-selector pattern is the smallest possible change that eliminates the risk of a rotation outage. The monitoring step with rua reports is the only way to know the job is truly finished instead of hoping the caches cleared.

The $19 Inbox/DNS Pack and the $79 QuickCheck

If you want the exact dual-selector DNS templates plus the 20-line opendkim or rspamd swap script, the Inbox/DNS Pack contains both ready to copy. No guessing at record syntax. No writing the flip logic from scratch.

After the rotation, if you want an outside set of eyes on a week of your actual rua reports to confirm the old selector is dead (including any forwarders still holding the cached key), the Inbox Cleanup QuickCheck is the option that removes the remaining uncertainty. You send the reports; I confirm the cutover completed and flag anything still signing with the retired selector.

Both are designed for solo founders who already understand the basics and just need the repeatable execution piece.

Do the rotation on a schedule. Use two selectors. Watch the reports for a week. That is the entire maintenance loop.

Rich Gibbs

Ubuntu/Debian EC2 hardening checklist (2026)

Why this checklist

Threat model assumptions

1. SSH

2. Firewall and listeners

3. OS updates and reboots

4. Admin surface

5. EC2 metadata service (IMDSv2)

Mid-article CTA

6. Logging and time sync

7. Backups and restore drills

8. Docker basics (if applicable)

What this is not

End-article CTA

About Tuck Sentinel

The Indie Founder's VPS Security 101

What “secure enough” looks like for one box

First-day setup

1. Create a non-root user with sudo

2. Set up SSH keys and disable password login

3. Turn on the firewall

4. Enable automatic security updates

Worried you missed something on first-day setup?

What to actually monitor

Failed logins

Listening ports

Disk free

Package updates available

Backups and restore drills

Don’t over-do it

Common mistakes

Worth a free second opinion?

What this is not

About Tuck Sentinel

AWS IMDSv2 Migration Without Breaking Things

Why Migrate

What Breaks

Detect IMDSv1 Use

CloudWatch metric: MetadataNoToken

Inventory: which instances even allow v1?

CloudTrail and VPC flow logs

On-host detection

Migration Steps

1. Baseline and freeze new IMDSv1

2. Sort instances into waves

3. Optional: try optional → required with a hop bump

4. Roll Auto Scaling groups

5. Sweep and confirm

Validation

Confirm v2-required at the API level

Confirm v1 is actually rejected on the host

Confirm credentials still resolve

Confirm app-level health

Confirm MetadataNoToken is zero

Rollback

QuickCheck CTA

What This Is Not

About Tuck Sentinel

SPF, DKIM, DMARC for indie founders: the 20-minute checklist

What “set up email DNS” actually means in 2026

The 20-minute checklist

1. List every tool that sends mail “from” your domain (3 minutes)

2. Pick exactly one SPF record (5 minutes)

3. Add DKIM for each sending tool (5 minutes)

4. Publish a cautious DMARC (3 minutes)

5. Verify the result with one real email (4 minutes)

Things to deliberately ignore in v1

Common gotchas an indie founder will hit

Next step: a $99 second pair of eyes

Related downloadable pack

Cloudflare Email Routing for indie founders: the 10-minute support@ setup

What Email Routing actually is

The 10-minute path

1. Enable Email Routing (1 minute)

2. Verify the destination address (2 minutes)

3. Add a routing rule (1 minute)

4. (Optional) Catch-all (30 seconds)

5. Send a real test (1 minute)

The one thing Email Routing does not do — and what to do instead

CloudWatch metric: `MetadataNoToken`

3. Optional: try `optional` → `required` with a hop bump

Confirm `MetadataNoToken` is zero

4. Delete the truly dead — with `older_than:` and a safety net

Why DMARC aggregate (`rua`) reports exist