Avik
back|rust

Phase 2: I Rebuilt the Build Pipeline, Hardened Everything, and Broke It Again

By Avik MukherjeeMay 24, 202612 min readUpdated May 24, 2026

Where Part 1 Left Off#

Part 1 ended with a working loop: GitHub webhook → NATS → worker builds inside node:22-alpine → artifacts land in MinIO → API serves static files or cold-starts a Node container for SSR.

It worked. It also had problems I knew about but had not fixed:

  • The worker ran as root with the Docker socket mounted
  • Build containers could reach Postgres, NATS, and MinIO on the same network
  • Logs lived in an in-memory buffer and disappeared if the API restarted mid-build
  • Output detection (standalone vs static) was fragile
  • Every bug in Part 1 was a file-path or plumbing problem

I wrote an AUDIT.md — 1100 lines of everything wrong and everything I wanted next. Then I started checking items off. That became Phase 2.


The Big Architectural Shift: Stop Serving Artifacts, Start Serving Images#

The insight that changed everything: Vercel does not reassemble your app from S3 on every request. It builds a container (or serverless bundle), pushes it to a registry, and runs that.

My MinIO pipeline was clever for a day-one demo. For a real Next.js app with 600+ npm packages and a multi-gigabyte standalone output, it was wrong:

  • list_objects_v2 pagination bugs
  • _next/ vs .next/ path mapping
  • Downloading thousands of files on every cold start
  • Bind mount path mismatches between host and container

So I replaced the whole middle:

code
Before:
  git clone → npm build → docker cp → MinIO → download on serve → node container

After:
  git clone → nixpacks plan → buildctl → push to local registry → docker run image

The serve path became one line of intent: run the image Nixpacks built, wire Traefik labels, done.

LayerPhase 1Phase 2
Build plannerManual output detectionNixpacks (auto-detects Node, Bun, pnpm, etc.)
Image builderdocker build in workerBuildKit via buildctl
Artifact storeMinIOLocal Docker registry (localhost:5000)
Serve runtimeDownload + bind mount + node:22-alpinedocker run the built image
Traefik routingAPI proxyDirect labels on serve container

Key commits: 2b529e2, fee3e41, 950f85a, c1f7c9f.


May 18 — Railpack, Then Nixpacks, Then BuildKit (Three Times)#

I started with Railpack. It generates Dockerfiles and understands frameworks. Good idea. Then I hit the same class of problem as Part 1: getting the built image out of the build environment and into a registry without hanging on large tarballs.

The progression in my git history tells the story:

code
fee3e41  Use railpack build and BuildKit for builds
f348499  Use railpack --name and push image manually
99163a4  Add BuildKit container and use as builder host
7a67da4  Remove BuildKit service and references
950f85a  Replace railpack with nixpacks for builds
c1f7c9f  Refactor build process to use buildctl for Docker image push

I yo-yoed on BuildKit because I did not understand the split of responsibilities at first:

  • Nixpacks generates the Dockerfile and build plan
  • BuildKit executes the build and pushes layers to the registry
  • buildctl is the CLI the worker uses to talk to BuildKit

The final worker pipeline:

rust
// 1. Nixpacks writes .nixpacks/Dockerfile and prints the plan
nixpacks build -o . --install-cmd "bun install" .
 
// 2. BuildKit builds and pushes — worker never sees the tarball
buildctl build \
  --frontend dockerfile.v0 \
  --local context=. \
  --local dockerfile=.nixpacks \
  --opt network=vercel-clone_build-net \
  --output type=image,name=registry:5000/deployment-{id}:latest,push=true

Two registry hostnames on purpose:

  • registry:5000 — Docker-network hostname BuildKit uses to push
  • localhost:5000 — host-accessible hostname the API uses to docker run

That split caused a bug later. Worth it to avoid streaming multi-gigabyte images back through the worker.


May 17–18 — Security and Ops (Before the Fun Bugs)#

While refactoring the build, I landed infrastructure work from the audit:

Non-root runtime (12050b8, f4e8583): API and worker drop to appuser via gosu in a shared entrypoint. Docker socket GID mapping for Linux. BuildKit socket gets chmod 666 when group mapping fails.

Network isolation (docker-compose.yml): build-net for builds (internet + registry only), serve-net for preview containers (Traefik-facing only). Internal services no longer reachable from user code.

NATS hardening (3458d48): TLS config, auth, DLQ stream for failed jobs, admin API to list and replay:

code
GET  /v1/admin/failed-jobs
POST /v1/admin/failed-jobs/{sequence}/replay

Log persistence (6dc34e1, 5b3b399): Every log line inserts into build_log_lines as it arrives. Terminal builds aggregate into deployments.build_log. Logs survive API restarts.

Observability: Prometheus metrics, Grafana, Loki, Alloy — the stack is there even if I mostly stared at docker compose logs during debugging.

Graceful shutdown, build concurrency semaphore, resource limits on every compose service — all from the audit checklist.


May 24 — The Real Test: My Actual Portfolio#

Part 1 bugs showed up with toy projects. Phase 2 bugs showed up when I deployed my real site: avikmukherjee-portfolio-v2 — Next.js 16, Bun, 619 packages, ~140 second builds.

This is where Phase 2 gets honest.


Now the Bugs Start (Again)#

Everything above took a few days of refactoring. The next few hours were debugging on a real project. Continuing the numbering from Part 1 — but the shape of the bug list is different this time. Phase 1's failures were mostly file-path plumbing. Phase 2's were coordination and permissions across three separate containers.

Bug 10: The Non-Root Permission Gauntlet#

I checked "run as non-root" off the audit list. Then I spent an evening discovering that three different processes each need access to a different Unix socket, and OrbStack makes all of them root:root mode 660.

Round 1 — worker → BuildKit. First deploy after dropping root:

code
dial unix /var/run/buildkit/buildkitd.sock: connect: permission denied

Round 2 — BuildKit daemon → home directory. Got past the socket. BuildKit tried to set up the build session for uid 1001:

code
#2 ERROR: mkdir /home/appuser: permission denied

BuildKit runs in its own container. buildctl runs as appuser with HOME=/home/appuser. The daemon tries to create that path inside its filesystem. /home is root-owned. Fail.

Round 3 — API → Docker. Build succeeded. State: Ready. Preview URL: 404. No serve container existed:

code
failed to start deployment container: permission denied
  while trying to connect to the docker API at unix:///var/run/docker.sock

I had fixed the worker's sockets. The API still could not docker run the finished image. The deployment was Ready in Postgres. Nothing was serving it. State machine and runtime were disconnected.

The fixes are all variations on the same theme — map the socket GID, and when that fails (gid 0 on Docker Desktop / OrbStack cannot be assigned to a secondary group), fall back to chmod 666:

sh
# Worker: BuildKit socket
if [ -S /var/run/buildkit/buildkitd.sock ]; then
  chmod 666 /var/run/buildkit/buildkitd.sock 2>/dev/null || true
fi
 
# BuildKit container: pre-create the worker's home dir
mkdir -p /home/appuser && chown 1001:1001 /home/appuser
 
# API: Docker socket — test first, chmod if appuser still can't connect
if ! gosu appuser docker info >/dev/null 2>&1; then
  chmod 666 /var/run/docker.sock 2>/dev/null || true
fi

Not elegant. Not what you'd ship to production. But "non-root" in a Docker-in-Docker local setup is not one checkbox — it is a separate negotiation with every daemon your processes talk to.

Bug 11: Git Clone Exit Code 128 (The Duplicate Job)#

Build logs showed something confusing:

code
build started
cloning repository
error: git clone failed with exit code 128
[stderr] #10 RUN bun install
[stderr] #10 0.385 bun install v1.3.0

Clone failed — but bun install was running in stderr. Two jobs for the same deployment were in flight. The first cloned successfully. The second hit:

code
fatal: destination path '.' already exists and is not an empty directory.

Root cause: duplicate NATS messages. I chose JetStream specifically for at-least-once delivery — if the worker crashes, the job is not lost. What I did not plan for: "at least once" includes twice while the first run is still going. Double publish, redelivery during an in-flight build, worker restart picking up an unacked message while a spawned task is still running — all produce the same symptom.

The fix is idempotency at the worker, not tighter NATS config:

rust
if !active.insert(deployment_id) {
    tracing::warn!(%deployment_id, "duplicate build job while in flight, skipping");
    received.ack().await;
    continue;
}

Duplicate messages get acked and dropped. The work directory is wiped before each clone so a legit retry does not inherit stale files. This is the bug that actually taught me something about the queue I chose — not in the NATS docs, not in the tutorials, just in the logs looking like two builds at once.

Bug 12: Build Logs Exist in the DB but the UI Shows Nothing#

After a failed deploy, Postgres had 20 log lines. The dashboard said "No build logs available."

Two separate bugs:

Frontend: BuildLogViewer only opened SSE for active states. On error or ready, it never connected — so it never replayed from build_log_lines.

Streaming: Logs were persisted incrementally, but EventSource reconnect during an active build cleared the visible output.

Fix: always connect SSE (replay historical lines from DB first, then tail live). Show build_log immediately for terminal states. Direct DOM append with 32ms batching so 2000 lines of BuildKit output does not freeze the tab.

Two smaller fixes worth mentioning but not worth their own sections: docker run --pull always (BuildKit pushes to registry:5000 inside the network; the host daemon does not have the image until something pulls it — Ready in the DB ≠ image on the host), and http:// instead of https:// for *.localhost previews when SERVE_TLS=false. Both cost me twenty minutes each. Neither taught me anything new.


What the Architecture Looks Like Now#

code
GitHub push / manual deploy
        │
        ▼
   Axum API ──publish──► NATS JetStream (build.jobs)
        │                        │
        │                        ▼
        │                 Build Worker (appuser)
        │                   git clone
        │                   nixpacks plan
        │                   buildctl → BuildKit → registry:5000
        │                   publish logs → NATS (persisted to Postgres)
        │                   publish result → NATS
        │                        │
        ◄── state + image_ref ───┘
        │
        ▼
   docker run --pull always
   localhost:5000/deployment-{id}:latest
   on serve-net with Traefik labels
        │
        ▼
   http://{hash}-preview.localhost  →  your app

No MinIO download on serve. No output type detection. No _next/ path rewriting. Nixpacks handles the Dockerfile; BuildKit handles the build; the registry holds the artifact; Traefik routes by Host header.

I deployed my portfolio three times in one debugging session. The third one worked end-to-end without manual intervention.


What It Looks Like Working#

Push to GitHub or click deploy in the dashboard. Build starts. Logs stream line by line — Nixpacks plan, then two thousand lines of BuildKit pulling layers and running bun install and next build. Build finishes in about two minutes. State goes to ready. A URL like http://6b84844b-preview.localhost/ is live. HTML, CSS, JavaScript, fonts — everything loads.

That is the thing I wanted at the start of Part 1. Phase 2 just took a different route to get there.


What Phase 2 Taught Me That Phase 1 Didn't#

Part 1 was about plumbing — which directory does Next.js write to, which path does the Docker daemon resolve, which S3 prefix maps to which URL. One process, one filesystem, one mistake at a time.

Phase 2 was about boundaries — worker, BuildKit daemon, API, registry, Traefik each own a piece of the deploy, and "success" in one layer does not imply success in the next.

Three things I did not have words for after Part 1:

The registry is not the runtime. Pushing an image and running an image are different operations on different network paths. BuildKit pushes to registry:5000 from inside Docker's network. The API pulls from localhost:5000 on the host. You can have a successful build, a ready row in Postgres, and no container — because nothing bridged those two hostnames. Phase 1's MinIO pipeline had the same shape (upload ≠ serve) but the failure mode was slower and messier. The registry version is sharper: one missing flag.

Audit items are not features until you deploy a real app. Non-root, network isolation, TLS, DLQ — all landed before I ran my portfolio through the pipeline. Every permission bug only appeared under a real 619-package Next.js build with Bun and BuildKit. Toy projects do not mount three Unix sockets with incompatible ownership models.

Reliability guarantees have opposites. I wanted at-least-once delivery so jobs survive worker restarts. The opposite of a lost job is a duplicate job. JetStream delivered exactly what I asked for; I just had not designed the worker to be idempotent. That is a different class of bug than Part 1's — not "wrong path" but "wrong assumption about how many times this code runs."

If Part 1 answered "what happens when I click deploy," Phase 2 answered "what happens in the gaps between the boxes on the architecture diagram."


The Numbers#

MetricPhase 1Phase 2
Commits (approx.)~57~90+
Build time (real Next.js app)~2 min~2 min
Serve cold startDownload 2653 files + start Nodedocker pull + container start (~30s)
Log lines per buildHundreds2000+ (BuildKit is verbose)
Manual interventions to get a preview URLSeveral2 (then 0 after fixes)

The Bottom Line#

Phase 1 proved the idea. Phase 2 made it resemble something you could explain without apologizing.

The build pipeline is image-based. The worker is non-root (with pragmatic socket exceptions for local dev). Logs persist and replay. Failed jobs land in a DLQ you can inspect and replay. Preview URLs actually serve traffic when the state says Ready.

The deploy button is not magic. It is a webhook, a queue, a build container, a registry, and a reverse proxy. Surprisingly few moving parts once you trace all of them — but each moving part has its own permission model, and they do not agree with each other by default.

I still understand what happens when I click deploy. Now I also understand what happens when deploy clicks back.


Continued from Part 1: I Built a Vercel Clone in Rust in One Day. All code is on GitHub.

Sponsor

Support my open-source work

If my projects, blog posts, or tools have helped you, consider sponsoring me on GitHub. Every contribution keeps the side projects shipping.

Sponsor on GitHub