The Problem
On a normal Tuesday I counted 14 browser tabs. Slack, Linear, GitHub, two Google Docs, Notion, three different ChatGPT conversations, Gmail, calendar, two API doc sites, a Stack Overflow answer, and the bug I was supposed to be fixing.
That's the work that happens around the work. Tab-switching. Context loss. Pasting the same paragraph into a new chat because the old chat doesn't remember last Tuesday. By 3pm I'd written maybe forty lines of code, but I'd been at the keyboard since nine.
Browser-based AI chats are stateless. They can't see your files. They can't run your scripts. They forget who you are between conversations. So you pay a tax every time you want help: explaining the project, pasting the relevant snippets, restating the architecture, repeating constraints you've already mentioned in six earlier chats. The model gets smarter every quarter. The interface stays a silo.
The result is a developer who spends real hours each week feeding an LLM the same context it already had yesterday, then copying the answers back into a real workspace where the actual code lives.
The Solution
The AIOS is a local repo that treats the AI as a teammate instead of a chatbot. It runs inside the terminal, reads my actual files, executes scripts on my machine, talks to the same APIs I use, and remembers what I told it last week. The memory lives in markdown. Versioned. Editable. Mine.
Architecturally it's boring on purpose. No custom server. No vector database. No embedding pipeline. Just a structured folder tree of context files, skill definitions, and reference docs that the AI loads progressively as needed. The AI is the runtime; the repo is the operating system.
The whole thing rests on four chronological layers that can't be skipped:
- Context — what the system knows about my voice, my stack, my priorities, my current sprint
- Connections — how it reaches local files, third-party APIs, and the terminal without breaking anything
- Capabilities — modular markdown "skills" that wrap repeatable workflows into deterministic recipes
- Cadence — the move from user-triggered runs to scheduled loops and remote cron routines
You can't build cadence on missing context. You can't run skills against connections that don't exist. The order matters, and rushing it is how you end up with an agent that produces confident garbage.
What Got Built
The Directory Tree
The repo enforces strict separation of concerns. Every kind of information has exactly one home:
├── .aios/
│ ├── skills/
│ │ ├── audit/skill.md
│ │ ├── level-up/skill.md
│ │ └── onboarding/skill.md
│ └── aios_master.md
├── archives/
├── context/
│ ├── identity.md
│ ├── business_architecture.md
│ └── core_priorities.md
├── references/
│ ├── system_frameworks.md
│ └── tracker_api_endpoints.md
├── decisions/
│ └── architecture_log.md
└── .env
aios_master.md is the master prompt. It maps the active skills, declares workspace rules, and points to every other relevant directory.
context/ holds the long-lived files: who I am, how I write code, what my business mechanics actually are, what the next 90 days look like.
references/ is the local cache for external documentation. API endpoints, library specs, framework notes — all stored once, read cheaply at runtime, no HTTP round-trips.
decisions/ is the architecture log. Every time something structural changes, it gets a dated entry. Future-me reads this; past-me writes it.
archives/ is where deprecated configs and old logs go to die without polluting the working tree.
That's it. Five folders, one env file, one master prompt. Anyone could read the repo top-to-bottom in twenty minutes and understand the whole system.
Onboarding: Day 1 Bootstrapping
A fresh AIOS is useless. It has no context, no voice, no priorities. The first thing it needs is an intake.
I run an onboarding skill that conducts a structured interview and writes four profile files:
- Core Profile — who I am, what I build, what environment I ship to
- Style Profile — pulled from 2–3 recent code snippets and design docs so the model picks up my actual patterns instead of inventing a generic developer voice
- Priority Profile — current 90-day milestones and sprint constraints
- Domain Profile — stack, deployment pipeline, environment rules
The output of the interview gets committed to context/. From that point forward, every skill that runs has access to those four files. The agent stops acting like a stranger who just met me five seconds ago.
Connections: Markdown Over MCP
The obvious move for connecting an AI to third-party tools is to install MCP servers. I tried that. The token cost is brutal — every prompt turn loads the full tool schema, even when you only need one endpoint.
So I inverted the pattern. I had the agent crawl the API docs once, extract the four or five endpoints I actually use (create task, fetch work logs, update status, log time), and store them as a flat markdown file at references/tracker_api_endpoints.md. Now when a skill needs to hit the tracker, it reads a 400-token reference instead of loading a 4,000-token MCP schema.
For tools where the agent does need write access, I created scoped bot accounts — separate from my user account, restricted to specific workspaces, with explicit read/write caps. If the agent ever loses its mind, the worst it can do is mess up one project board. It can't drop a database or delete a repo.
Secrets live in .env. Nothing sensitive ever goes into a skill file, and the master prompt explicitly forbids reading .env aloud.
Skills: The Repeatable Recipes
A skill is a folder containing a skill.md file and any scripts it needs. The file has YAML frontmatter declaring boundaries:
---
name: team-pulse-check
description: Evaluates outstanding tickets across the tracking board.
allowed_tools: [fetch_api_data, write_local_file]
disable_model_invocation: true
---
Below the frontmatter sits the step-by-step workflow. Hyper-explicit, deterministic, no room for the model to improvise. Then an Anti-Goals block spells out what the skill must never do: don't overwrite tracking locks, don't prompt mid-run, don't touch closed tickets.
The interesting part is how skills get loaded. The agent reads in three tiers:
- Level 1 — Global Scan (~100 tokens): Only the names and YAML descriptions of every skill. The model matches the user's intent to a specific tool.
- Level 2 — Ingestion (~1,000–2,000 tokens): The body of the matched
skill.mdgets loaded. - Level 3 — Payload Execution: Heavy scripts, raw data tables, and templates only open when the skill instructions explicitly call them.
Most prompt turns stop at Level 1. The token bill stays small because nothing irrelevant ever enters the context window.
Where skills repeatedly burn tokens looking up the same IDs (project IDs, workspace IDs, tag IDs), I hardcode the IDs into the skill file or its companion reference doc. Tool discovery overhead disappears. The skill goes straight to the right endpoint.
For skills that need to chew through large file searches, I delegate to a lightweight sub-agent inside the skill. It does the search, returns the result, and the master context stays clean.
The Local LLM Wiki
Vector databases are massive overkill for the amount of personal knowledge most developers actually accumulate. Below 100,000 words, a flat folder of linked markdown notes outperforms a RAG pipeline on both cost and recall accuracy.
So I built a wiki inside the repo:
raw/— the ingestion bin. Articles, meeting transcripts, project briefs land here via browser clippers or terminal scripts.wiki/— the parsed index. The agent reads raw files and produces clean concept nodes, technical definitions, and dependency profiles.wiki/index.md— a master index of every major hub with cross-links and comparison tables.wiki/log.md— every modification the agent makes to the wiki, time-stamped.wiki/hot.md— a rolling ~500-word cache of the immediate context state. Always loaded.
A lint script runs over the wiki on a schedule. It looks for broken links, orphaned notes, structural inconsistencies, and gaps where a topic exists but never got fleshed out. If a gap is fillable from public sources, it issues a targeted web fetch and writes the result back.
The wiki doesn't try to remember everything. It tries to remember exactly what's likely to matter again, and the lint cycle prunes the rest.
Diagnostics: /audit and /levelup
Two skills exist purely to keep the system honest.
/audit runs an architectural review. It scans the folder configs, checks the health of every connection, counts and rates the user-defined skills, and outputs a grade out of 100 alongside the top structural vulnerabilities. I run it every Friday.
/levelup is a five-question diagnostic loop designed to surface friction I haven't noticed:
- Frequency Check: Which development step did you do manually 3+ times this week?
- Drudgery Audit: Which step felt like copy-paste boredom?
- Smart Intern Test: Which workflow could a smart intern handle, but you did because explaining it would have taken longer?
- Scale Constraint: If your infrastructure load 10x'd Monday, what breaks first?
- Growth Leverage: Which process would double your output if it were fully autonomous?
The answers feed directly into the next sprint of skill development. The system improves itself by interrogating me about where I'm still being inefficient.
Cadence: Local Hooks vs Cloud Routines
The last layer is automation cadence — getting skills to run without me asking.
Local scheduling uses /loop. I can pin a one-time prompt to fire at a specific time (/loop at 16:30 run compliance check) or set a recurring interval (every ten minutes, poll the project tracker). Loops live in the terminal session's memory and auto-expire after three days, which prevents zombie crons from eating CPU after I've closed the window.
For anything that needs to outlive the terminal, I sync the repo to a private GitHub vault and trigger it remotely. A cloud routine spins up a sandboxed container (≥4 vCPUs, 16GB RAM), pulls the repo, injects env vars from a secure store, runs the skill, commits or notifies, and destroys the instance. Completely stateless. Nothing about the local machine is involved.
| Metric | Local Hooks (/loop) | Remote Cloud Routines |
|---|---|---|
| Runtime infrastructure | Local developer machine | Stateless cloud container |
| State persistence | Tied to terminal lifecycle | Destroyed post-run |
| Minimum run interval | 1 minute | 1 hour |
| File access | Full local system | Cloned repo + cloud env keys |
| Max lifespan | 3-day hard expiry | Single-shot cycle |
Local loops handle the polling work that needs to happen while I'm working. Cloud routines handle the long-cycle work that needs to happen whether or not I'm at my desk.
The Results
| Metric | Before | After |
|---|---|---|
| Browser tabs during a working session | 12–18 | 1–3 |
| Re-explaining project context to the AI | Multiple times daily | Zero |
| Manual workflow execution per week | 6–10 hours | Under 30 minutes (review only) |
| Where credentials, templates, schedules live | Scattered (post-its, password manager, browser bookmarks, memory) | The repo |
| Loading project context into a new chat | 30–60 minutes per chat | Loaded automatically |
| Friday architecture review | Never happened | Automated, 5 minutes |
The three KPIs I actually track aren't quantitative. They're directional, and they say more about whether the system is working than any token count could.
Tab Isolation. I stopped opening endless browser windows. Roughly 90% of operational and communicative work now happens natively in the terminal. The browser is for reading and for the few apps that genuinely don't have an API.
Context Offloading. Internal frameworks, credentials, templates, scheduling — all of it lives in the repo, not in my head and not on scattered post-its. When I need to recall a stack decision from three months ago, I open the architecture log instead of trying to remember.
Team Proxy Scalability. A teammate can query my AIOS profile via version-control hooks and pull project architectures, look up priorities, or schedule tasks without ever needing to distract me. The repo is interview-ready documentation that happens to also be runtime configuration.
Why It Works
The whole design rests on three bets that turned out to be right.
Bet 1: Markdown beats infrastructure. Every time I considered building something heavier — a vector DB, an MCP server, a custom UI — the markdown version performed better. Cheaper to read. Easier to version. Trivial to edit by hand when the AI gets something wrong. Markdown is what the model is best at reading anyway. Building on top of it is like building a website on top of HTML instead of inventing a new document format first.
Bet 2: Progressive loading beats a stuffed context window. The instinct with a 200K-token context window is to fill it. That's the wrong move. The right move is to load the minimum necessary at each tier and let the model pull more only when it's needed. Skills with three-tier loading consistently outperform skills that dump everything up front, both in cost and in answer quality. A focused 2,000-token context produces better output than a sprawling 50,000-token context every time.
Bet 3: Deterministic recipes beat agentic autonomy. I tried letting skills run open-ended at first. The model would improvise, take detours, and occasionally produce something brilliant, surrounded by three hours of useless wandering. Writing skills as hyper-explicit step-by-step recipes with anti-goals cut variance to near zero. The "Boring is Beautiful" rule applies: if a skill keeps hitting edge-case loops, break it down, strip out AI autonomy on the brittle piece, and rewrite that chunk as a deterministic Python or JavaScript script. The agent calls the script. The script does the work. Predictability returns.
Human review stays in the loop where it matters. Drafts go to me before they go anywhere external. CRM rows get written but are reviewable. The system does the structural work; the judgment calls stay mine.
The Daily Cycle
A normal day inside the AIOS looks like this:
Morning. I run the day-planning skill. It pulls from my calendar, evaluates current sprints, parses the tracker, and generates a daily timeline inside the terminal. By the time my coffee's ready, the day is mapped.
Working hours. I stay in the terminal. If a skill misfires or the agent acts like a generic chatbot, I patch the skill.md immediately. Vulnerabilities don't survive past the day I notice them.
Evening. A sync skill reviews execution histories, archives expired logs, and updates the priority chart for tomorrow.
Friday. /audit runs. I read the grade and the top vulnerabilities. If a skill is consistently grading poorly, I refactor it. If a workflow has been manual three weeks in a row, /levelup catches it and turns it into a skill.
The work that used to happen around the work — context-pasting, tab-switching, manual logging, re-explaining the project, re-finding the password, re-remembering the decision — quietly disappeared. What's left is the actual building. That's not a productivity improvement. It's a structural change to how a workday feels.
