2026-04-22

How to Track What Your AI Agent Is Doing (Without Watching It All Day)

My AI co-founder runs for about 14 hours a day while I'm at my day job.

For the first two months, I had no idea what it was actually doing during those 14 hours. I'd come home, check Telegram, see a summary report, and mostly trust it. Sometimes I'd spot something weird in my Twitter replies or notice a Reddit comment that didn't sound right. But I had no systematic way to know whether the agent was working, failing quietly, or making decisions I wouldn't have approved.

That's the blindspot most people don't talk about when they build AI agents. You get obsessed with what the agent does. You never build the system to see what it actually did.

Here's how I fixed that.

---

Why AI Agent Monitoring Is Different from Regular App Monitoring

Traditional monitoring is binary. Your server is up or it's down. Your API returned 200 or it errored. Something broke or it didn't.

AI agent monitoring doesn't work that way. An agent can complete every task successfully — no errors, no exceptions, no failed API calls — and still have done the wrong thing in every single case. Wrong tone. Wrong facts. Wrong decision about whether to post or not post. Wrong interpretation of a nuanced instruction.

The failure modes are qualitative, not technical. That's what makes tracking an AI agent genuinely hard.

You need to monitor three separate layers, and most people only monitor one.

---

Layer 1 — Activity Logs: What Did the Agent Actually Do?

The first layer is raw activity. Before you can evaluate quality, you need a complete record of what happened.

Every action your agent takes should write a log entry. Not a summary. A full record: timestamp, task type, input context, output, and outcome. If your agent posts a tweet, the log should contain the tweet text, the timestamp, the trigger that caused it, and whether it posted successfully.

My setup writes these logs to a file on disk in the Vault, and copies a summary to Telegram at the end of each day. The disk log is the source of truth. Telegram is just the readable summary.

Two rules for activity logs:

Log outputs, not just actions. It's not enough to know the agent ran the "post to Twitter" task. You need to know what it actually posted. Agents can succeed at the action level while failing at the output level.

Log the reasoning when possible. If your agent has to make a decision — post this or skip it, respond now or queue it — log why it made that call. This is what lets you retrain your prompts when something goes wrong. Without the reasoning, all you have is the outcome, and you can't fix what you can't trace.

---

Layer 2 — Quality Signals: Was the Output Actually Good?

Activity logs tell you what happened. Quality signals tell you whether it was any good.

This is where most solo founders give up, because quality is subjective. But you can proxy it with a handful of concrete signals that correlate with output quality:

Engagement signals. If your agent posts social content, track engagement per post. Not to optimize for likes, but to catch outliers. A tweet that gets zero engagement from an account with active followers means something went wrong with the content. A Reddit comment that gets downvoted in a thread where you normally get upvotes is a signal.

Review flags. Build your agent to flag its own uncertain outputs for human review. Every agent I run has a confidence threshold. Below it, the output gets routed to a Telegram review queue instead of posting directly. These flagged items are the most useful quality signal I have — they show me exactly where the agent's judgment breaks down.

Output audits. Once a week, I read through 10–15 randomly selected outputs from the activity log. Not all of them. Just a sample. This catches the slow drift that doesn't show up in engagement data: the subtle tone shift, the phrasing that's gotten repetitive, the response pattern that's drifted from my actual voice.

The weekly audit takes about 20 minutes. It's caught more problems than all my automated checks combined.

---

Layer 3 — Decision Audits: Are the Right Things Getting Done?

The third layer is the hardest to build and the most valuable to have.

Activity logs show you what happened. Quality signals show you whether individual outputs were good. Decision audits show you whether the agent is making the right strategic calls at the task level.

Examples of the questions you're trying to answer here:

My setup handles this through a weekly review file. Every Sunday, Evo writes a structured review to the Vault that covers: tasks completed, tasks skipped, tasks routed for review, and any anomalies it detected in its own behavior. I read this on Sunday evening in about 10 minutes.

The key is that the agent writes this itself. Not as a polished summary, but as a structured output in a consistent format that I can scan quickly. The format matters. If the format changes, I can't spot the diff.

---

The Practical Setup (What I Actually Use)

Here's the exact stack, simplified for a solo founder:

Disk logs: Every agent task writes a JSON entry to a log file in the Vault. Timestamp, task, output, outcome, confidence score, review flag (true/false).

Daily Telegram summary: At the end of each operational day, the agent reads the log file, generates a plain-text summary, and sends it via Telegram. This is the thing I actually read every day. It takes about 90 seconds to skim.

Review queue: Outputs below my confidence threshold (set per task type) route to a separate Telegram queue. I review these before bed. Usually 2–5 items per day.

Weekly audit: Sunday morning, the agent writes a structured weekly review to the Vault. I read it Sunday evening.

Monthly log review: On the first of each month, I do a deeper scan of the previous month's logs. This is where I catch patterns: tasks that are consistently low-confidence, output types that consistently need editing, decisions that consistently go the wrong way.

That's the whole system. Nothing fancy. No enterprise observability stack. No Langfuse or Arize or anything that costs money per trace. Just structured logs, daily summaries, and a habit of actually reading them.

---

The One Thing Most People Miss

There's a temptation to automate the monitoring layer as aggressively as you automate everything else. Build a system that automatically flags issues, automatically routes problems, automatically generates insight reports.

Don't do that yet.

The monitoring layer needs to stay human-readable and human-reviewed, at least until you deeply understand your agent's failure modes. Automated monitoring is only useful when you know exactly what you're monitoring for. Early on, you don't know. The process of reading logs, doing weekly audits, and manually reviewing flagged outputs is how you learn.

Once you know your agent's three most common failure modes, you can automate detection of those three things. Not before.

The goal of monitoring isn't to outsource your judgment about whether the agent is working. It's to give your judgment something to work with.

---

What This Connects To

If you haven't already built the foundation this runs on, start with the identity file — your agent's voice and behavior spec. That's what quality monitoring actually checks against. There's no meaningful quality signal without a clear definition of what good looks like.

The AI agent guardrails system I wrote about last week is the gate between monitoring and action: what happens when your monitoring catches something. Monitoring and guardrails are two sides of the same system. You need both.

And if you want the full architecture: how identity files, memory systems, source-of-truth documents, guardrails, and monitoring fit together into a working co-founder setup, that's what Book 1 covers. It's $7 and it's the exact system I run.

---

The Short Version

Your AI agent is doing things while you're not watching. You need three things to know if those things are good:

1. Activity logs — a complete record of every output, not just every action

2. Quality signals — engagement data, review flags, and a weekly manual audit

3. Decision audits — a weekly structured review of whether the agent is making the right calls

Build all three before you trust your agent to run unsupervised. Most problems I've caught came from the audit layer, not the automated signals. The habit of reading matters more than the sophistication of the system.

Start with a simple log file and a daily Telegram summary. Add the rest as you learn what you're actually monitoring for.

---

*Published by Michael Olivieri / Xero AI*


← Back to Blog

Start here

Your First AI Agent is the fastest way to build your first real AI system. Weekend read, practical steps, $7.

Get the guide — $7

Already have one? Build an AI Co-Founder goes deeper — $19.