2026-03-19
Building a Mini Backstage: An Internal Developer Portal on Kubernetes
Why Build a Developer Portal?
Every team with more than a handful of services eventually hits the same problem: no one knows what's running, who owns it, or where the docs are. The answer to this is usually Backstage — Spotify's open-source developer portal. It's excellent, but it's also a TypeScript monolith with a plugin ecosystem that can feel like overkill when you just want a catalogue of services with status indicators and links.
So I built my own — a mini Backstage that does exactly what I need and nothing more.
Repository: dev-portal on GitHub
What It Does
The portal is a service catalogue with the following per-service fields:
- Name, team, description — the basics
- Status —
healthy,degraded,down, orunknown - Health check URL — if provided, the backend pings it every 60 seconds and updates status automatically
- Links — docs, GitHub, and dashboard URLs surfaced as icon buttons on each card
- Tags — for filtering
The frontend has a sidebar with live filters (status, team, tags), a debounced search bar, and sidebar stats that always reflect the full dataset. All filtering is done client-side against an in-memory array fetched on load and refreshed every 30 seconds — snappy UX with no extra round trips.
Stack
The same stack I use across my homelab projects, because consistency pays dividends at 2am when something breaks:
| Layer | Technology |
|---|---|
| Frontend | Vanilla JS, HTML, CSS — no build step, no framework |
| Backend | FastAPI (Python), async SQLAlchemy, asyncpg |
| Database | PostgreSQL 16 |
| Ingress | Traefik v3 via IngressRoute CRDs |
| Tunnel | Cloudflare Tunnel → no open inbound ports |
| GitOps | ArgoCD + Image Updater |
| CI | GitHub Actions — build, Trivy scan, Playwright tests |
The frontend is intentionally framework-free. For a portal this size it's the right call: zero build complexity, instant load, and the whole thing is four files (index.html, style.css, app.js, and nginx.conf).
Architecture
Browser
↓
Cloudflare Edge (DNS / WAF / DDoS protection)
↓
Cloudflare Tunnel (outbound-only, zero open ports)
↓
Traefik (IngressRoute: /api → API, everything else → frontend nginx)
↓
┌─────────────────────┐ ┌────────────────────────────────┐
│ nginx (static) │ │ FastAPI (API) │
│ dev-portal:80 │ │ dev-portal-api:8000 │
│ │ │ ├── /api/services CRUD │
│ Serves HTML/JS/CSS │ │ ├── /api/services/{id}/check │
│ │ │ ├── /api/health │
└─────────────────────┘ │ └── background health checker │
└──────────────┬─────────────────┘
│
┌──────────────▼─────────────────┐
│ PostgreSQL StatefulSet │
│ NetworkPolicy: API pods only │
└────────────────────────────────┘
The Traefik routing uses two IngressRoutes:
# Frontend: everything that isn't /api
match: Host(`dev-portal.georg-nikola.com`) && !PathPrefix(`/api`)
# API: requests with /api prefix
match: Host(`dev-portal.georg-nikola.com`) && PathPrefix(`/api`)
No StripPrefix middleware — the FastAPI app registers its routers with prefix="/api", so the paths are preserved end-to-end. This mirrors how my movie-picker API is wired up.
The Background Health Checker
The most interesting piece of the backend is the async background task that polls service health endpoints:
async def _run_status_checks():
while True:
try:
await asyncio.sleep(settings.status_check_interval) # 60s
async with AsyncSessionLocal() as db:
result = await db.execute(
select(Service).where(Service.status_url.isnot(None))
)
svcs = result.scalars().all()
for svc in svcs:
new_status = await _ping_url(svc.status_url)
async with AsyncSessionLocal() as db:
obj = await db.get(Service, svc.id)
if obj:
obj.status = new_status
obj.last_checked_at = datetime.now(timezone.utc)
await db.commit()
except asyncio.CancelledError:
break
except Exception as exc:
logger.exception("Error in background status checker: %s", exc)
It's launched as an asyncio.Task in the FastAPI lifespan context manager and cancelled cleanly on shutdown. Each ping returns healthy for HTTP 2xx, degraded for non-2xx, and down for connection errors or timeouts.
There's also a POST /api/services/{id}/check endpoint that triggers an immediate check for a single service — surfaced on each card as a "Check now" button.
Playwright Integration Tests
I added a full Playwright test suite that runs on every CI push. The tests cover 44 assertions across 15 test groups:
- Page load and brand rendering
- Add service modal (open, form fields, close via Cancel and Escape)
- Creating a service and verifying the card appears with correct name, status badge, team, tags, and links
- Sidebar stats updating after create
- Search debouncing and empty state
- Status, team, and tag filter toggle behaviour
- Edit modal pre-filling and saving
- Delete flow with confirm modal (including cancel)
- Validation (empty name blocked)
- Toast notifications
- Zero console errors
# Run against docker-compose
docker compose up -d
python tests/test_dev_portal.py --url http://localhost:8080
# Run against the deployed cluster via port-forward
kubectl port-forward svc/dev-portal 18080:80 -n dev-portal &
kubectl port-forward svc/dev-portal-api 18001:8000 -n dev-portal &
python tests/test_dev_portal.py --url http://localhost:18080 --api-url http://localhost:18001
The --api-url flag uses Playwright's page.route() to intercept /api/* requests and redirect them to the separately port-forwarded API service. This lets the tests run against the actual deployed cluster without needing to touch DNS.
In CI (GitHub Actions), the Playwright job runs after both Docker builds complete, spins up the full stack with docker compose up --build, waits for the API health check, then runs the suite.
Deployment with ArgoCD and Helm
Three ArgoCD Application manifests manage the deployment (in my private talos-configs repo):
postgresql-dev-portal— PostgreSQL StatefulSet with 5Gi PVC, NetworkPolicy restricting access to API pods onlydev-portal-api— FastAPI,readOnlyRootFilesystem: true,/tmpas tmpfs emptyDir, non-root userdev-portal— nginx serving static assets
All three are annotated for ArgoCD Image Updater with semver strategy, so pushing a v*.*.* git tag automatically triggers a CI build, publishes ghcr.io/georg-nikola/dev-portal:X.Y.Z and ghcr.io/georg-nikola/dev-portal-api:X.Y.Z, and ArgoCD rolls out the new images.
Debugging: The Cloudflare Tunnel Trap
Getting the portal live took longer than building it, for two reasons.
First, the Helm chart templates had tls: certResolver: cloudflare in the IngressRoutes. This Traefik instance doesn't have a cert resolver configured — TLS is terminated by Cloudflare at the edge before traffic reaches the cluster via the tunnel. Traefik silently dropped both routes with "Router uses a nonexistent certificate resolver" in its logs, resulting in a clean 404 with no obvious error visible to the browser. The fix: remove the tls block entirely, matching how every other app in the cluster is configured.
Second, even after fixing Traefik, the site was still 404. The issue this time was that dev-portal.georg-nikola.com was never in the Cloudflare tunnel's ingress config. My initial Python patch to the ConfigMap had silently failed — the entry wasn't saved. Without it, cloudflared hit its catch-all http_status:404 rule for every request, so traffic never even reached Traefik.
Both bugs produced the same symptom (404), which made them harder to separate. The breakthrough was testing routing directly against Traefik via port-forward with an explicit Host: header — that returned 200, which proved Traefik was fine and pointed the finger squarely at the tunnel.
Lesson: always verify the tunnel config explicitly after patching it. I've now added the entry to the config.yaml in talos-configs so it's version-controlled and won't drift again.
What's Next
A few things I'm considering adding:
- Dependency graph — draw edges between services that call each other
- On-call ownership — link a service to a rotation or contact
- Changelog per service — embed the last few GitHub releases from the API
- Read-only API key for external tooling to query the catalogue
For now though, it does exactly what I wanted: one URL, every service, current status, all the links. No YAML schemas to write, no plugin registry to manage.
Live at dev-portal.georg-nikola.com — source at github.com/georg-nikola/dev-portal.