Running Claude Code as a CI Agent — 30 Days of Automated Commits on a Real Repo

Beginner11m readFull-stack developers

Running Claude Code as a CI Agent — 30 Days of Automated Commits on a Real Repo Byline: vybecoding.ai Editorial Pipeline Between April 1 and April 30, 2026, Claude Code made 422 direct commits to main on the vybecoding.ai repo.

Primary Focus

ai development

AI Tools Covered

AI-firstNext.jsConvex

What You'll Learn

  • .1: Why tmux slots, not GitHub Actions
  • .2: Session management
  • .1: What actually worked
  • .2: What broke
  • .1: Commit 91620ce33b — the Vercel 404 incident
  • .1: The `--print` exit code trap

Guide Curriculum

The Architecture

Learn key concepts

2 lessons
  • .1: Why tmux slots, not GitHub Actions2m
  • .2: Session management1m

Task Scope

Learn key concepts

2 lessons
  • .1: What actually worked2m
  • .2: What broke1m

The False Positive

Learn key concepts

1 lessons
  • .1: Commit 91620ce33b — the Vercel 404 incident2m

Configuration and Gotchas

Learn key concepts

3 lessons
  • .1: The `--print` exit code trap1m
  • .2: `--dangerously-skip-permissions` scope control1m
  • .3: Weekly cadence and what it reveals1m

Preview: First Lesson

The Architecture

.1: Why tmux slots, not GitHub Actions

The intuitive CI setup is a GitHub Actions workflow that checks out the repo, installs Claude Code, and runs claude --print on a schedule. Our ci-cd.yml and quality.yml workflows handle linting, typecheck, unit tests, and security audits — but they do not invoke Claude Code directly.

The reason is latency and state. GitHub Actions runners are ephemeral. Each job provisions a fresh environment, and Claude Code's value compounds when it can read the actual repo state that reflects hours of prior commits. Spinning up a runner per task loses that accumulated context and adds 60–90 seconds of cold-start per run for a tool that often finishes in under 2 minutes.

Instead, we run 8 parallel tmux slots (claude-01 through claude-09) as cron-triggered sessions on a persistent host. Each slot is assigned a task queue. The commit pattern is direct to main: claude-{slot}-[{date}]: {action} {file} [{complexity}:{count} files]. The existing CI pipeline picks up each commit post-hoc and validates it — linting, type checking, tests. Claude proposes via commit; CI acts as the gate.

This is a pull model. The agent does not open PRs. It commits, and the validation pipeline approves or surfaces failures. For a team repo where commits from multiple authors land on the same branch, this would require more ceremony. For a single-operator repo like ours, it is the minimum viable architecture.

Free Access

Start learning with this comprehensive guide

This guide includes:

4 modules with 8 lessons
11m estimated reading time

About the Author

H
✨ Vibe Coder
@hiram-clark

Hiram Clark is the founder and managing editor of vybecoding.ai and sets editorial direction for the guides and news published here. Articles are drafted with AI assistance and edited before publication. He works hands-on with the AI development tools, workflows, and infrastructure covered on the site.

Full Guide Content

Complete lesson text — start the interactive course above for exercises and progress tracking.

Module 1The Architecture

1.1.1: Why tmux slots, not GitHub Actions

The intuitive CI setup is a GitHub Actions workflow that checks out the repo, installs Claude Code, and runs claude --print on a schedule. Our ci-cd.yml and quality.yml workflows handle linting, typecheck, unit tests, and security audits — but they do not invoke Claude Code directly.

The reason is latency and state. GitHub Actions runners are ephemeral. Each job provisions a fresh environment, and Claude Code's value compounds when it can read the actual repo state that reflects hours of prior commits. Spinning up a runner per task loses that accumulated context and adds 60–90 seconds of cold-start per run for a tool that often finishes in under 2 minutes.

Instead, we run 8 parallel tmux slots (claude-01 through claude-09) as cron-triggered sessions on a persistent host. Each slot is assigned a task queue. The commit pattern is direct to main: claude-{slot}-[{date}]: {action} {file} [{complexity}:{count} files]. The existing CI pipeline picks up each commit post-hoc and validates it — linting, type checking, tests. Claude proposes via commit; CI acts as the gate.

This is a pull model. The agent does not open PRs. It commits, and the validation pipeline approves or surfaces failures. For a team repo where commits from multiple authors land on the same branch, this would require more ceremony. For a single-operator repo like ours, it is the minimum viable architecture.

1.2.2: Session management

Each cron trigger kills the existing tmux pane and starts a fresh Claude Code session before executing the task. This is not optional.

We observed a clear quality degradation in sessions running longer than 8 hours. Commits from the same slot at hour 1 are specific and targeted. By hour 8, the agent becomes conservative — it hedges on whether a change is safe, adds comments it wouldn't have added earlier, and sometimes skips edits it should make. This is context window accumulation: the session has absorbed many rounds of tool output and prior diff context, and the signal-to-noise ratio drops.

The fix is hard restart on each scheduled trigger:

#!/bin/bash
SLOT=$1
SESSION="vybeclaw"
WINDOW="claude-${SLOT}"

# Kill existing pane — do not resume
tmux send-keys -t "${SESSION}:${WINDOW}" C-c "" ENTER
tmux send-keys -t "${SESSION}:${WINDOW}" "exit" ENTER

# Wait for clean exit, then respawn
sleep 2
tmux new-window -t "${SESSION}" -n "${WINDOW}"
tmux send-keys -t "${SESSION}:${WINDOW}" \
  "claude --print --dangerously-skip-permissions < /tmp/task-${SLOT}.txt > /tmp/log-${SLOT}.txt 2>&1" \
  ENTER

Fresh session per task. No exceptions.


Module 2Task Scope

2.1.1: What actually worked

Over April 2026, 300 of 422 commits were classified SIMPLE, 61 MEDIUM, 35 COMPLEX. The complexity classification comes from the commit message suffix, not a post-hoc assessment — the agent tags its own work based on file count and scope.

The tasks that produced reliable, zero-regression output:

  • Single-file component edits with TypeScript error output as the spec. Paste the tsc error into the task. The agent has a complete, unambiguous success criterion.
  • Adding an optional field to a Convex schema. The schema change, the corresponding TypeScript type, and any immediate callers are all in the same file graph — the agent can enumerate them without guessing.
  • Updating import paths after a file move. Mechanical, verifiable by tsc. Works every time.
  • Fixing a failing lint rule. The ESLint output is the entire spec. No judgment required.

The top-edited files over the period:

| File | Edits |

|---|---|

| app/admin/social/studio/page.tsx | 51 |

| components/admin/studio/PromptEditorPanel.tsx | 25 |

| app/admin/layout.tsx | 22 |

| convex/agentControlPlane.ts | 16 |

| convex/schema.ts | 9 |

The concentration on admin pages and Studio components reflects active development with frequent requirements changes — the agent was continuously reconciling component state with new backend contracts. convex/schema.ts at 9 edits shows the agent's bread and butter: incremental optional-field additions where the schema validator is the entire spec. app/admin/social/studio/page.tsx at 51 edits across the month — a complex admin UI — was handled without regressions because each task was scoped to a single interaction pattern.

2.2.2: What broke

Three categories produced failures:

Cross-file refactors without enumerated callers. If the task says "rename this function," the agent needs to know where it's called. Without an explicit caller list, it finds what it can via grep and misses dynamic call sites. The TypeScript check catches some of this, but not all — particularly in Convex functions where the call happens through the generated api object. UI changes where correctness requires visual inspection. The agent can modify JSX and confirm that TypeScript passes. It cannot confirm that the layout renders correctly at mobile breakpoints. We do not use Claude Code for visual redesigns. Tasks where "done" requires runtime behavior. This is the one that cost us.

Module 3The False Positive

3.1.1: Commit 91620ce33b — the Vercel 404 incident

On April 2026, the agent added outputFileTracingExcludes to next.config.ts, excluding content/ and v2/ from the Vercel build bundle. The reasoning was sound by static analysis: these are large directories that add significant bundle size. The agent correctly identified the file paths as large; it incorrectly inferred they were safe to exclude.

Vercel's output file tracing determines which files are bundled into the serverless function for each route. Excluding content/** removed the actual guide and content files that server-side routes read at request time. The deployed functions had no content to serve. The result was 404s across all guide routes.

The commit was reverted same day (commit message: fix: revert content/ and v2/ exclusions — runtime reads break site).

The failure class: static analysis is insufficient to infer runtime read patterns. File size tells you nothing about whether a path is read at build time, request time, or not at all. The agent had no mechanism to discover that getStaticProps and server actions read those paths dynamically.

This is the exact boundary condition where human review catches what the agent misses. The CI pipeline passed — TypeScript compiled, lint passed, no unit tests cover Vercel build output. The failure was only visible after deployment.

The lesson is not "don't use Claude Code for config changes." It is: tasks whose correctness can only be verified via production behavior require human review before merge, regardless of CI pass rate.


Module 4Configuration and Gotchas

4.1.1: The `--print` exit code trap

claude --print exits non-zero on successful task completion. This is not a bug in the current version — it is the observed behavior, and trusting $? as the success signal will break your automation.

In our pipeline, this caused Sprint 36 to be classified as TIMEOUT twice despite all six stories passing with Grade A+ 97/100. The exit code check triggered the fallback chain unnecessarily. Two valid runs were logged as failures.

The fix: after capturing the exit code, grep the log for a task-specific success marker before classifying as failure.

# Capture output and exit code
claude --print --dangerously-skip-permissions < task.txt > /tmp/run.log 2>&1
EXIT_CODE=$?

# Do NOT trust EXIT_CODE alone
# Check for task-specific success marker first
if grep -q 'GRADE:\|SUCCESS\|COMPLETE\|Grade:' /tmp/run.log; then
  EXIT_CODE=0
  echo "[RESCUED] non-zero exit overridden — success marker found in log"
fi

if [ $EXIT_CODE -ne 0 ]; then
  echo "[FAILED] no success marker, treating as real failure"
  cat /tmp/run.log
  exit 1
fi

Pick a marker that is specific to your task output format. Generic markers like "done" will rescue genuine failures. A grade or structured completion token works.

4.2.2: `--dangerously-skip-permissions` scope control

Non-interactive operation requires --dangerously-skip-permissions. The flag bypasses the confirmation prompts that appear when Claude Code attempts file writes, shell commands, or network access. In an interactive session, these prompts are the safety mechanism. In CI, they hang the process indefinitely.

The replacement safety mechanism is CLAUDE.md project rules. The agent reads the project CLAUDE.md before every task and respects its constraints. Our project rules explicitly prohibit:

  • rm -rf operations
  • Direct database mutations via CLI
  • Changes to .env.* files
  • Commits to non-main branches without explicit instruction
# CLAUDE.md — Agent Constraints

## Prohibited Operations
- Never run `rm -rf` or destructive shell commands
- Never modify `.env.local`, `.env.production.local`, or any `.env.*` file
- Never run `DROP TABLE` or destructive database operations
- Never push to remote without explicit instruction in the task

## Required After Any Code Change
- Run `npm run typecheck` and confirm exit 0
- Run `npm run lint` and confirm exit 0

These rules are enforced at the model level, not at the shell level. They are not a substitute for a sandbox. For a repo where a mistake is catastrophic (production database, billing code), add shell-level restrictions on top.

4.3.3: Weekly cadence and what it reveals

The April 2026 breakdown by week: 63 commits (week 1), 183 (week 2), 110 (week 3), 47 (week 4). Week 2's spike corresponds to a period of active feature development where we were feeding the agent a high volume of TypeScript errors from an in-progress admin UI refactor. Week 4's drop reflects the codebase stabilizing — fewer clear-criterion tasks available.

This is the natural shape of agent-assisted development. The agent is most productive when there is a large backlog of well-specified, verifiable tasks. When the backlog thins and tasks require more judgment, throughput drops and you should not force it. Fighting for commits in week 4 by loosening task scope is how you get more false positives.

The one outage we had came from a task that fell outside the "verifiable by static analysis" boundary. Forty-three days without a regression before it. The architecture works for the task class it was designed for. Scope discipline is the entire product.