driftd: A Self-Hosted Terraform Drift Detection Daemon

Terraform is fantastic until reality diverges from state. Someone clicks through the AWS console. An auto-scaling policy changes an instance type. A DNS record gets edited manually. The state file says one thing, the actual infrastructure says another — and you only find out when something breaks.

I built driftd to catch this quietly before it becomes a problem. It's a single Go binary that runs as a daemon, periodically scans your Terraform state against live AWS and Cloudflare resources, and surfaces any drift through a web UI, REST API, and Slack notifications.

The Problem with Drift

If you've run Terraform in production for any length of time you've seen it: the dreaded "changes detected on refresh" output that shows up unexpectedly. In a solo setup this is annoying. In a team environment with multiple engineers, multiple environments, and a mix of Terraform-managed and manually-adjusted resources — it compounds fast.

The existing solutions either require a full Terraform Cloud subscription, a heavy observability stack, or a CI job that runs terraform plan against every workspace on a schedule (which means storing credentials, managing state backend access, and waiting for plan execution times).

I wanted something simpler: a daemon that understands Terraform state files and knows how to query the matching resources directly.

Architecture: One Binary, Everything Inside

The entire tool ships as a single statically-linked Go binary. There's no separate database server, no frontend deployment step, no sidecar. Everything is embedded:

driftd (single binary)
├── Cobra CLI          — serve, scan, workspace, version commands
├── SQLite database    — scan history, drift results, workspace config
├── HTTP API           — REST endpoints at /api/v1
├── React frontend     — embedded via go:embed from ui/dist
└── Scheduler          — cron-based scan dispatcher

The internal structure follows a clean layered approach:

internal/
├── config/      Viper-based YAML configuration
├── database/    SQLite init + schema migrations
├── models/      Domain types (Workspace, ScanRun, DriftResult)
├── store/       CRUD layer over SQLite using sqlx
├── api/         HTTP handlers using stdlib net/http
├── scanner/     Core drift detection logic
│   ├── state.go        State file readers (S3 + local)
│   ├── fetcher.go      AWS resource fetchers
│   ├── diff.go         Comparison with smart field ignoring
│   └── cloudflare_fetcher.go
├── scheduler/   Cron-based scan scheduling
└── notifier/    Slack + webhook notifications

Key Technology Decisions

A few choices that shaped the design:

Pure Go SQLite (modernc.org/sqlite) — no CGO, no C compiler required, cross-compiles cleanly. The driver name is "sqlite" (not "sqlite3"). Paired with sqlx for raw SQL instead of an ORM — gives full control over queries and makes the data model explicit.

Single writer enforced: db.SetMaxOpenConns(1) plus WAL mode. SQLite is single-writer by design; enforcing this at the connection pool level prevents mysterious write conflicts.

ULIDs for IDs — sortable, URL-safe, timestamp-embedded. Querying by creation time works naturally without extra ORDER BY clauses on a separate timestamp column.

Interface-based fetchers — each resource type implements:

type AWSFetcher interface {
    ResourceType() string
    Fetch(ctx context.Context, resourceID string, stateAttrs map[string]any) (map[string]any, error)
}

Adding a new resource type means implementing this interface and registering it — no changes to the core scanner loop.

Stdlib net/http — no Gin, no Echo, no Chi. For a daemon with a handful of REST endpoints the standard library is sufficient and removes a dependency.

How a Scan Works

The scan loop is straightforward:

Read state file — from a local path or S3 object, depending on workspace configuration
Walk managed resources — iterate every resource block in the state
Fetch live attributes — for each resource type driftd supports, query the cloud provider directly
Compare — diff the state attributes against the live attributes, ignoring computed fields that are expected to vary
Store results — persist each resource's status (in_sync, drifted, or deleted) with a timestamp and the diff payload
Notify — if drift is found and notifications are configured, fan out to Slack and/or webhooks

The comparison step uses per-resource-type ignore lists to avoid noise. For EC2 instances, fields like arn, public_dns, and public_ip are excluded — these are computed by AWS and will always differ from what's in state after a plan/apply cycle.

Supported Resources

AWS

Terraform Type	Compared Attributes
`aws_instance`	instance_type, ami, availability_zone, tags
`aws_s3_bucket`	bucket, region
`aws_security_group`	name, description, vpc_id
`aws_db_instance`	instance_class, engine, engine_version, db_instance_status
`aws_vpc`	cidr_block, state, is_default

Cloudflare

Cloudflare fetchers activate automatically when credentials are present in the environment:

export CLOUDFLARE_API_TOKEN=<scoped-token>   # preferred
# or
export CLOUDFLARE_API_KEY=<global-key>
export CLOUDFLARE_EMAIL=<account-email>

Terraform Type	Compared Attributes
`cloudflare_record`	name, type, content, proxied, ttl, comment
`cloudflare_ruleset`	name, kind, phase, description
`cloudflare_zone_settings_override`	full settings block

Quick Start

# Build (requires Node.js for the embedded frontend)
make build

# Copy and edit the config
cp driftd.yaml.example driftd.yaml

# Start the server (binds to :8080 by default)
./driftd serve

# Open the UI
open http://localhost:8080

Add a Workspace

# Terraform state stored in S3
./driftd workspace add \
  --name production \
  --source-type s3 \
  --source-config '{"bucket":"my-tfstate","key":"prod/terraform.tfstate","region":"us-east-1"}' \
  --region us-east-1 \
  --schedule "@every 6h"

# Local state file
./driftd workspace add \
  --name local-env \
  --source-type local \
  --source-config '{"path":"/path/to/terraform.tfstate"}' \
  --region us-east-1

Run a One-Shot Scan

./driftd scan --workspace production

Configuration

server:
  port: 8080
  host: "0.0.0.0"

database:
  path: "./driftd.db"

log:
  level: "info"   # debug, info, warn, error

aws:
  region: "us-east-1"
  # profile: "my-profile"   # optional named profile

notifications:
  slack_webhook_url: "https://hooks.slack.com/services/..."
  webhook_url: "https://your-receiver.example.com/hook"
  on_drift: true      # notify when drift is detected
  on_delete: true     # notify when a resource disappears
  on_all_scans: false # notify after every scan regardless of result

AWS credentials follow the standard SDK chain: environment variables → ~/.aws/credentials → IAM instance profile. No credentials are ever stored in the config file or database.

Scan Scheduling

Each workspace can have its own schedule. driftd accepts standard cron expressions and the robfig/cron descriptors:

@every 15m        every 15 minutes
@every 6h         every 6 hours
@daily            once a day at midnight
0 */6 * * *       every 6 hours (standard cron)

An empty schedule means manual-only. The scheduler tracks running scans per workspace to avoid concurrent runs — if a scan is already in flight when the next trigger fires, the new invocation is skipped and logged.

REST API

All endpoints are under /api/v1. Responses use a consistent envelope:

{ "data": { ... } }

Errors return:

{ "error": "descriptive message" }

Method	Path	Description
`GET`	`/workspaces`	List workspaces
`POST`	`/workspaces`	Create workspace
`GET`	`/workspaces/:id`	Get workspace
`PUT`	`/workspaces/:id`	Update workspace
`DELETE`	`/workspaces/:id`	Delete workspace
`GET`	`/workspaces/:id/scans`	Scan history for a workspace
`GET`	`/scans/:id`	Get a specific scan run
`GET`	`/scans/:id/results`	Drift results for a scan
`GET`	`/health`	Health check

Deleting a workspace cascades — all scan runs and drift results for that workspace are removed automatically via SQLite foreign key constraints.

Notifications

The notification pipeline uses a composite MultiNotifier that fans out to all configured receivers. A single notifier failing (e.g., Slack webhook returning 5xx) is logged as a warning and doesn't prevent other notifiers from firing.

Trigger conditions are evaluated once per scan:

on_drift: true — at least one resource has drifted status
on_delete: true — at least one resource has deleted status (no longer exists in the live account)
on_all_scans: true — always notify, including clean scans

The Slack payload includes the workspace name, scan timestamp, and a formatted list of drifted resources with their attribute differences.

What's Not Included (Intentionally)

driftd is deliberately scoped. It does not:

Run terraform plan — no Terraform binary required, no state locking during scans
Apply changes — read-only, never modifies infrastructure
Store cloud credentials — uses the standard AWS SDK chain and environment variables for Cloudflare
Require a cloud database — SQLite is sufficient for the scan volumes a typical team produces

The goal was a tool I could drop on a small VM or into a Kubernetes pod and forget about, not another service that needs its own managed database and credential rotation pipeline.

CLI Reference

driftd serve                         Start the HTTP server and scheduler
driftd scan --workspace <name>       Run a one-shot scan
driftd workspace list                List all workspaces
driftd workspace add --name ...      Create a workspace
driftd workspace delete <name>       Delete a workspace
driftd version                       Print version

Use --config / -c on any command to point at a non-default config file.

The repository is at github.com/georg-nikola/driftd. Contributions welcome — especially additional resource fetchers for other AWS resource types or cloud providers.