Module 0: Foundations of Agentic Engineering
Estimated time: ~1 hour
What you will learn in this module:
- What a coding agent is and how it differs from autocomplete and chat assistants
- The current landscape of tools, models, and harnessesThe environment the agent runs inside, like Claude Code, Cursor, or Aider. It provides file access, terminal, and UI.
- How tokensThe basic unit of text the model processes. Roughly 4 characters of English equal 1 token, but code is less efficient., context windowsThe maximum amount of text the agent can see at once, typically 128K to 1M tokens., prompt cachingProvider-side storage of repeated context prefixes that reduces cost on sequential requests., and context compactionAutomatic summarization when the context window approaches its limit, replacing detailed history with a compressed version. affect both cost and output quality
- How the agent sees your codebase, navigates files, and takes real actions through tool useHow agents take real actions (reading files, running commands, searching code).
- Working safely with git, the common problems every team hits early, and how to recover
- Data privacy, where your code is sent, and real-world cost breakdowns
- Rules filesA markdown document the agent reads at the start of every session containing your project's conventions and standards., skillsReusable, task-scoped instruction sets that the agent loads only when relevant to the current task., hooksDeterministic shell scripts that run at specific points in the agent workflow, enforcing rules that can never be skipped., and session workflow that turn a generic agent into one that knows your team’s conventions
The Core Mental Model
In traditional engineering, you write the code, but in agentic engineering, you describe what you want, the agent writes it, and your job becomes reviewing, correcting, and steering the output toward the right result. You build up rules, conventions, and reusable skillsA skill is a reusable, task-scoped instruction set (SKILL.md) that the agent loads when relevant, like a template for how to handle a specific type of work. over time so that every future session is faster and more accurate than the last.
Your Role Is Director
This is a genuine change in how you work. You are no longer the one who types the solution. You are the one who defines the problem clearly, points the agent at the right files and context, reviews everything it produces, and corrects it when it goes wrong. The agent is fast and capable but has no judgment about what you actually want. That judgment is yours to provide.
The biggest mistake teams make early on is treating the agent like a better Stack Overflow, pasting in a question and copying out the answer. A coding agent can read your entire codebase, run tests, execute commands, and modify files across your project. It can take a task from start to finish without you intervening at every step, but only if you learn to direct it well. That takes deliberate practiceThe mental shift is real. You move from being the person who writes code to the person who defines problems, evaluates output, and steers the agent toward the right result..
Choosing the Right Tasks
Knowing what to delegate matters as much as how you delegate it. The best tasks are well-scoped, easy to verify, and do not require judgment calls that go beyond the code itself. Product decisions, trade-offs that depend on context only your team has, or anything where “done” is hard to define upfront should stay with you. A useful test is whether you can write a short spec for what you want. If you cannot, the task is not ready to delegate. Clarify first, then hand it off.
Review Everything
Treat every piece of output the agent produces as a draft, not a commit. The agent does not flag its own limitations. It will write code that compiles, passes tests, and looks right while quietly misunderstanding what you actually needed. Your job after every task is to read the diff, run the tests, and verify the behavior rather than accepting output just because it looks plausible. Think of it like reviewing a pull request from someone who is fast, confident, and occasionally wrong in ways that are hard to spot.
What Is a Coding Agent?
A coding agent is an LLM that can read your codebase, plan changes across multiple files, execute code and run tests through the terminal, and iterate on its own mistakes when something goes wrong. Unlike autocomplete tools that predict the next line, and unlike chat interfaces where you ask a question and get a text answer, a coding agent operates in a loopThe agent thinks about what to do, takes an action, observes the result, then decides whether to keep going or stop. It repeats this cycle until the task is done or it needs your input. where it thinks about what to do, takes an action, observes the result, and adjusts its approach until the task is complete or it needs your input.
The distinction is between a coding assistant and a coding agent. An assistant answers when asked. An agent takes action autonomously within a defined scope. It reads files, writes code, runs terminal commands, searches your codebase, and chains multiple steps together without you intervening at each step.
The spectrum of autonomy:
| Level | Behavior |
|---|---|
| Manual Coding | You write everything |
| Autocomplete | It suggests the next line |
| Chat Assistant | It answers when asked |
| Coding Agent | It does what you define |
| Fully Autonomous | It decides + does (not reliably here yet) |
Most teams today sit somewhere between “autocomplete” and “chat assistant.” This guide moves you to “coding agent.”
Harnesses, Models, and How to Choose
When teams first start working with coding agents, they often treat the harness and the model as interchangeable. The harness is the environment your agent runs inside, covering the shell, the editor, and the file system access, while the model is the AI doing the actual reasoning. Choosing them independently, and knowing when to swap either one, is a core part of using agents well.
Harnesses
A harness is the environment the agent runs inside, encompassing the shell, editor integration, file system access, terminal, and UI.
| Type | Examples | Description |
|---|---|---|
| Terminal native | Claude Code, Aider, Codex CLI, Gemini CLI | Run directly in your terminal with full filesystem and shell access |
| AI native IDEs | Cursor, Windsurf, Zed | Full code editors rebuilt around AI |
| IDE extensions | Cline, Roo Code, GitHub Copilot, Continue | Plug into existing editors like VS Code or JetBrains |
| Cloud/autonomous | Devin, OpenHands, Manus, Codegen | Run in their own cloud environment |
| Spec driven | Kiro (AWS) | Structure work around specifications first |
| Automated pipeline | Factory | Automated code generation pipelines |
| IDE extension | Kilo Code | Lightweight agent for rapid prototyping |
Models
A model is the AI doing the actual reasoning and code generation inside the harness. It reads your prompt and context, decides what actions to take, and produces the output you review. Different models have meaningfully different strengths. Some are better at complex multi-step reasoning, others are faster and cheaper for routine work, and some can be self-hosted if your security posture requires it.
| Model | Strengths | Approx. API Pricing (per 1M tokens) |
|---|---|---|
| Claude Opus 4.6 | Best reasoning, ~80% SWE-bench, 1M context | $5 / $25 |
| Claude Sonnet 4.6 | Strong coding at lower cost | $3 / $15 |
| GPT 5.3 Codex | Fast, polyglot, strong terminal scores | $1.25 / $10 |
| Gemini 2.5 Pro | Large context, strong at code understanding | Varies |
| DeepSeek R1 | Open weight reasoning, self-hostable | Self hosted or API |
| Qwen 3.5 | Open weight, cost efficient | Self hosted or API |
| Llama 4 Code | Open weight, runs locally | Self hosted |
| MiniMax Text-01 | 1M context, strong at long-context tasks | API |
| GLM-4 (Zhipu) | Open weight, strong multilingual coding | Self hosted or API |
The choice of harness depends on how your team works today (terminal people vs IDE people), your security requirements, and how much autonomy you want to give the agent. The choice of model depends on the complexity of your tasks, your budget, and whether you need self-hosted deployment. In practice, many developers in 2026 use a combination of both. They might use an IDE like Cursor for daily coding alongside a terminal agent like Claude Code for complex refactors and architectural work.
Understanding Model Numbers
When a new model is announced, the numbers you will see most often are parameter count and training token count. Parameters are the internal numerical weights the model learned during training, representing billions of tiny adjustments made while processing vast amounts of text, code, and data. More parameters generally mean more capacity to handle complexity. Training tokens are how much text the model read before it was released. For example, when a model is described as “70 billion parameters trained on 15 trillion tokens,” it means a medium-large model trained on a very large corpus, though neither number tells you how well it codes.
A well-optimised smaller model can outperform a larger one trained carelessly. What actually matters for your work is benchmark performance on coding tasks, context window size, and cost per token, all of which are in the table above.
Reasoning vs Standard Models
Models differ in capability and in how they arrive at an answer. Some generate output directly and quickly; others work through a structured thinking process before responding. This difference is significant when choosing a model for a given task.
Standard models like Claude Sonnet 4.6, GPT 5.3 Codex, and open-weight models like Qwen 3.5 generate responses token by token from left to right. They are fast, good for straightforward tasks, and cost efficient.
Reasoning models like Claude Opus 4.6 with extended thinking, DeepSeek R1, and o3 have an explicit “thinking” phase where they plan, consider alternatives, backtrack, and self-correct before producing output. This takes more tokens and more time, but the results on complex tasks are noticeably better.
When to Use Which
| Task Type | Model Type | Why |
|---|---|---|
| Routine code generation, simple fixes, boilerplate | Standard | Faster, cheaper, sufficient quality |
| Complex logic, multi-step reasoning, debugging | Reasoning | Upfront thinking cost pays off in fewer iterations |
Most harnesses let you switch models mid-session. A good practice is to start planning with a reasoning model since it handles ambiguity and architectural decisions better, then switch to a faster standard model like Sonnet to execute the concrete implementation once the plan is clear.
Tokens: The Currency You’re Spending
Every interaction with an agent costs tokens.
For English prose, roughly 4 characters equal 1 token, but code tokenizes less efficiently because special characters ({, }, =>, //), indentation, and camelCase variable names all consume more tokens per line than plain text. A line of English runs about 7–10 tokens; a line of Python or JavaScript runs about 10–15 tokens.
The Token Economy
- Context window is how much the agent can “see” at once, ranging from 128K to 1M tokens. A bigger window means more awareness but also more cost per request.
- Input tokens are what you send (code, instructions, file contents) and make up the bulk of cost.
- Output tokens are what the agent generates. The volume is smaller but output is typically 3–5x more expensive per token than input.
- Reasoning tokens are generated during the “thinking” phase and you pay for them even if you never see them. This is the hidden cost that catches teams off guard when they first switch to reasoning models.
Context Compaction
When a session approaches the model’s context limit, the harness automatically summarizes and compresses the conversation history. In Claude Code and OpenCode, you can also trigger this manually with a /compact command when you want to start a cleaner session without losing your progress.
Compaction is lossy by design. The summary preserves the broad shape of what was discussed but drops fine-grained details like exact file paths, specific error messages, and verbatim snippets. This is why shorter, focused sessions produce better output than long sprawling ones. If the agent has already compacted once and output quality still seems off, starting a fresh session with a clean context and a precise summary is almost always the faster path.
Prompt Caching
When you send a request, a large chunk of the context is identical every time. Your skills, rules file, repo structure, and working files don’t change between requests within a session.
Prompt caching means the provider stores this static prefix so you don’t pay full price on every request.
How it works
- First request sends everything at full price
- Subsequent requests with the same prefix hit the cache (10–90% discount)
- Cache has a TTL, so long pauses between requests may invalidate it
Practical implications
- Front-load static context by putting skills and system instructions at the beginning and variable task instructions at the end. This maximizes cache hits across sequential requests.
- Keep sessions warm. Rapid back-and-forth keeps the cache alive, while long pauses may invalidate it and cost you more.
- Skills and rules files pay for themselves. They sit in the cached prefix and are nearly free after the first request, so the quality improvement comes at almost no marginal cost.
- Most modern harnesses handle caching automatically (Claude Code, Cursor, and others), but understanding the mechanic helps you structure sessions for maximum efficiency.
How the Agent Sees Your Codebase
A coding agent does not have your entire repo loaded in memory. The context window is large but finite, so the agent must be selective.
What happens when you give a task
- The harness indexes your repo, building a file tree and sometimes generating embeddings so the agent can search efficiently
- The agent decides which files are relevant based on filename, import chains, codebase search, or your explicit guidance in the prompt
- Selected files get loaded into the context window alongside your prompt, rules, and skills
- The agent works with only what is in the window. Everything outside effectively does not exist
Consequences
- If the agent does not load the right file, it will hallucinateWhen the agent generates plausible but incorrect code or references nonexistent APIs. plausible but incorrect code, or produce something that works in isolation but conflicts with the rest of your codebase
- Explicit file references in your prompts like “read
src/api/auth.tsandsrc/types/user.tsbefore making changes” dramatically improve accuracy - Large monorepos are inherently harder for agents to navigate than well-structured smaller codebases, and your project organization directly affects how effective the agent can be
- Catching conflicts with unloaded files is your responsibility during review
Tool Use: How Agents Do Things
An agent is acting on your system: reading files, running commands, writing code, and modifying your project. Knowing the tool use loop takes the mystery out of what the agent is doing when you watch it work.
The agentic loop
graph TD
A[You give a task] --> B[Agent THINKS]
B --> C[Agent calls a TOOL]
C --> D[Tool returns a RESULT]
D --> E[Agent OBSERVES the result]
E --> F{Done?}
F -->|Yes| G[OUTPUT]
F -->|No| B
Common tools
| Tool | Purpose |
|---|---|
| File read | Load file contents into context |
| File write/edit | Create or modify files |
| Terminal/bash | Run commands, tests, git operations |
| Search/grep | Find patterns and definitions across the codebase |
| Web fetch | Pull documentation or external information |
| MCP tools | Connect to external services via MCPModel Context Protocol, an open standard (now under the Linux Foundation) for connecting agents to external tools. |
In practice
- Each tool call costs tokens because both the request and the response consume context window space. An agent that makes 20 file reads before writing a single line of code is spending your context budget
- You can steer efficiency directly through your prompts. Asking the agent to “read
auth.tsandtypes.ts, then implement the change” is both cheaper and faster than “figure out which files are relevant” - Once you understand the tool use loop, you can read what the agent is doing as it works, which helps you decide when to let it run and when to interrupt
Working with Git
Before running your first agent session, establish one non-negotiable habit. Treat git as your undo button. An agent can modify dozens of files in the time it takes you to read its plan. Without checkpoints, recovering from a bad run means manually untangling changes across your entire codebase.
The three rules
- Always work on a branch. Never run an agent session directly on
main. Create a branch before every session by runninggit checkout -b agent/feature-namebefore the agent touches anything. If the output is wrong,git checkout mainand you are back to a clean state in one command. - Commit before you start. Make sure your working tree is clean before the first task. If the agent produces good output halfway through and then goes off the rails, a clean starting commit means you can cherry-pick the good changes and discard the rest.
- Use
git diffas your primary review tool. Review the diff before running tests. The diff shows what the agent actually did; treat its prose summary as a starting point, not a substitute.
Commit checkpoints during long sessions
For tasks that span many files, ask the agent to commit at logical milestones:
“Implement the auth handler, commit, then move on to the tests”
This creates granular history you can step through with git log --oneline and roll back to any checkpoint if the next step goes wrong.
Git worktrees for parallel work
Once you are comfortable with single-agent sessions, git worktreesA linked working directory enabling multiple agents to work in parallel on separate branches. let you run multiple agents in parallel on the same codebase without conflicts. Each worktree is a separate directory checked out on its own branch, but all sharing the same .git history. You can have one agent refactoring the auth module while another writes tests, with no risk of overwriting each other.
git worktree add ../project-feature-name feature/feature-nameAs of early 2026, git worktrees have emerged as the standard isolation mechanism for teams running parallel agent workflows.
When Things Go Wrong
Every team runs into the same problems early. Recognising them quickly is what separates an hour lost from five minutes lost.
The agent modifies things you didn’t ask it to
The agent begins with your task and somewhere mid-session decides to “improve” something adjacent. It reformats files you didn’t mention, refactors a utility function, or installs a dependency without asking. The output compiles, tests pass, but the diff is three times larger than it should be.
Most harnesses show you the agent’s plan before it executes. If the file list includes things you didn’t ask about, interrupt and tighten the scope. Add explicit constraints telling the agent “Only modify files in src/api/. Do not change shared utilities or package.json.”
The output looks right but isn’t
Code can compile, pass tests, and look correct while still being logically broken. This happens most often when the agent didn’t load the right files into context, made assumptions about an API it couldn’t see, or referenced a function signature that doesn’t exist.
The agent keeps trying the same thing
The agent gets stuck cycling through attempts. It tries something, gets an error, tries a variation, then gets the same error. This almost always means it is missing context it cannot find by searching your codebase.
If the agent has tried the same fix three or more times, stop it. Paste in the relevant file directly, the exact error message verbatim, and restart with a focused prompt. The agent cannot ask you for what it doesn’t know it needs.
Output quality drops over a long session
As a session grows long, the context window fills with back-and-forth conversation, intermediate results, and abandoned approaches. The agent starts ignoring earlier constraints or writing code that contradicts decisions from 30 messages ago.
If you’ve been in the same session for 45–60+ minutes and quality has dropped, start a fresh session with a clean context and a precise summary. It almost always produces better output than pushing through.
When anything goes wrong
git reset or git checkout is faster than trying to surgically undo agent changes. Know your last clean commit, reset to it, and restart with a tighter prompt. The session history is not worth preserving if the codebase state is wrong.
Where Your Code Goes
Your engineering leads and security team will ask about this first. Have the answer ready.
| Setup | Code leaves your machine? | Who sees it? |
|---|---|---|
| API-based (Claude API, OpenAI API) | Yes | Provider processes it; most do NOT train on API data |
| IDE tools (Cursor, Windsurf, Cline) | Yes | Some route through their own servers first |
| Self-hosted open weight models | No | Only your team |
| Cloud agents (Devin, Manus) | Yes | Provider has access within their sandbox |
Questions to answer before adopting any tool
- Does the provider train on API inputs, and is this different for free tier vs paid plans?
- Where are the provider’s servers located, and does data residency matter for your organization?
- Is the data encrypted in transit and at rest?
- Can you use a self-hosted or VPC-deployed model if your security posture requires it?
- What happens to your code after the session ends? Is it logged, cached, or retained?
- Does the tool support BYOK (bring your own key)?
For most teams building non-classified software, API-based tools with a no-training-on-input policy are sufficient to get started. For regulated industries or teams with sensitive IP, self-hosted open weight models like DeepSeek R1, Qwen 3.5, or Llama 4 Code give you full control over data boundaries, though you take on infrastructure management and lower model quality.
What This Actually Costs
Rough cost ranges (March 2026)
| Usage Pattern | Estimated Daily Cost per Developer |
|---|---|
| Light use (a few tasks/day) | $2–$5 |
| Moderate use (agent-assisted for most work) | $5–$15 |
| Heavy use (continuous with large context) | $15–$40+ |
These assume frontier model API pricing. If you are on a subscription plan like Cursor ($20/month) or Claude Pro ($20/month), the cost structure is different but comes with usage limits and rate caps that heavy users hit regularly.
What drives cost up
- Loading large files repeatedly instead of leveraging the cache
- Using reasoning models on simple tasks where a standard model would produce the same result
- Running long, unfocused sessions where the context grows without reaching resolution
- Multi-agent workflows multiplying token consumption
What keeps cost down
- Writing precise prompts that minimize unnecessary exploration
- Using skills and rules files that frontload cached context
- Breaking large tasks into focused sessions
- Defaulting to standard models, escalating only when needed
- Knowing when to stop the agent and write the code yourself, because sometimes that is the most efficient path
Rules Files, Skills, and Codebase Readiness
This separates teams who get consistent output from teams who feel like the agent produces something different every time.
Rules files
A rules file is a markdown document the agent reads automatically at the start of every session. It contains your project’s conventions, coding standards, testing expectations, build commands, and anything else the agent needs to know to work effectively in your codebase.
| Tool | Rules File |
|---|---|
| Claude Code | CLAUDE.md |
| Codex | AGENTS.md |
| Cursor | .cursor/rules/ |
| JetBrains Junie | .junie/guidelines.md |
As of late 2025, AGENTS.md has emerged as a universal standard under the Linux Foundation’s Agentic AI Foundation, and most major tools now support it. If your team uses multiple coding agents, you can maintain one AGENTS.md as your source of truth and symlink it to the tool-specific filenames.
A good rules file should be concise (under 200 lines), contain only information the agent cannot infer from the code itself, and be treated as a living document that you update whenever the agent makes a mistake you don’t want repeated. It is onboarding documentation for a team member who joins fresh every session.
Skills
A skill is a more specific, task-scoped instruction set (SKILL.md) that the agent loads only when relevant. Rules files apply to every task. Skills activate selectively based on what you are doing. A skill for creating a new API endpoint would specify the folder structure, error handling pattern, and testing style. A skill for generating a PDF would specify which library to use and what steps to follow.
A prompt is a one-time instruction for a specific task. A skill is a persistent, reusable instruction that applies across many tasks. Prompts are directions to a specific destination. Skills are the driving rules the driver follows regardless of where they are going.
As of March 2026, there are community-maintained libraries with over a thousand portable skills that work across Claude Code, Cursor, Gemini CLI, Codex CLI, and other agents.
Hooks
Rules files and skills are probabilistic. The agent follows them most of the time, but under pressure from long context or complex tasks it may skip instructions. Hooks provide deterministic enforcement through shell scripts that run automatically at specific points in the agent’s workflow and always execute without exception. If you need a rule that is never violated, like “always run the linter before committing” or “never execute rm -rf”, a hook is more reliable than a rules file entry.
Your codebase affects agent performance
An important insight that teams often overlook is that agent quality is shaped by your prompting and by the state of your codebase in equal measure. Missing documentation, undocumented environment variables, build processes that require tribal knowledge, and absent linter configuration all make the agent’s job harder.
A codebase that a new engineer could set up and run in 30 minutes is also a codebase where an agent performs reliably. Investing in your AGENTS.md, test coverage, and build tooling is both engineering hygiene and a direct multiplier on the leverage you get from every agent session.
The Session Workflow
Every working session with an agent follows this rhythm:
| Step | Action |
|---|---|
| 1. FRAME | Define the task in plain language, be specific about scope |
| 2. ORIENT | Point the agent at the right files, context, and skills |
| 3. EXECUTE | Let the agent work; observe its plan before it acts |
| 4. VERIFY | Review the output, run tests, check edge cases |
| 5. STEER | Correct course if needed, add constraints, iterate |
This loop applies whether you are fixing a one-line bug or building a full feature across multiple files and services.
Your first session
Do not start with a large or ambiguous task. Pick something small, well-understood, and easy to verify. Try a bug you already know the fix for, a test file for a module you wrote yourself, or a small refactor in a section of the codebase you understand well.
The goal of the first session is to learn how the agent reads your codebase, how much context it loads before acting, and how to read its plan before it executes. A task where you already know the right answer is the best learning environment because you can evaluate its output against your own knowledge.
Glossary
| Term | Definition |
|---|---|
| Agent | An LLM that can plan, execute, and iterate on coding tasks autonomously within a defined scope |
| Harness | The tool or environment the agent runs inside (Cursor, Claude Code, Aider, Devin) |
| Rules file | A markdown document (CLAUDE.md, AGENTS.md, .cursorrules) the agent reads automatically to learn your project’s conventions |
| AGENTS.md | A universal, tool-agnostic rules file format supported by most major coding agents, stewarded by the Agentic AI Foundation |
| Skill | A reusable, task-scoped instruction set (SKILL.md) for a specific class of work |
| Hook | A deterministic shell script that runs at a specific point in the agent workflow, providing enforcement that rules files cannot guarantee |
| Context compaction | Automatic summarization when the context window approaches its limit, replacing detailed history with a compressed version |
| Token | The basic unit of text the model processes (~4 characters for prose, less efficient for code) |
| Prompt caching | Provider-side storage of repeated context prefixes that reduces cost on sequential requests |
| Reasoning model | A model with an explicit thinking phase (Claude Opus 4.6, DeepSeek R1, o3) |
| Standard model | A model that generates output directly (Claude Sonnet 4.6, GPT 5.3 Codex) |
| Tool use | How agents take real actions (reading files, running commands, searching code) |
| MCP | Model Context Protocol, an open standard (now under the Linux Foundation) for connecting agents to external tools |
| Subagent | A lightweight agent instance spawned to handle a subtask in parallel |
| Agent team | Multiple independent agent instances working together with inter-agent communication and coordinated execution |
| Hallucination | When the agent generates plausible but incorrect code or references nonexistent APIs |
| Context degradation | Gradual decline in output quality as sessions grow long |
| Git worktree | A linked working directory enabling multiple agents to work in parallel on separate branches |
| Scope drift | When an agent exceeds its defined task boundary, modifying files it was not asked to change |