Testing FCLI as a Real Agent Kernel
Today I pressure-tested FCLI, a local-first coding agent runtime I have been building.
The goal was not to see whether a model can write a cute poem or generate a tiny Python script. That part is already easy.
The real goal was to test the software underneath the model:
- planner loop
- typed capabilities
- file writes
- shell execution
- verification
- git inspection
- execution traces
- recovery behavior
In other words:
Can this thing behave like the bones of a real worker?
Not a chatbot. Not a demo wrapper. A small agent kernel that can plan, act, observe, recover, and leave behind enough evidence to understand what happened.
What Changed in the Agent Loop
The most important architectural direction in FCLI right now is the replan loop.
Instead of treating every failure as terminal, FCLI now runs a bounded loop:
plan -> approve -> execute -> observe -> replan
That matters because real agent work is rarely one perfect plan followed by one perfect execution.
A file might be missing.
A command might fail.
A tool might return output that changes the next step.
The agent might realize it needs more information.
The runtime needs to observe what happened and then decide whether to continue, stop, ask a question, or report a real failure.
One important improvement is the question path.
If the agent does not have enough information, it can produce a QuestionAction.
In interactive mode, that question can be answered and fed into the next planning iteration.
In non-interactive mode, the run stops as:
AWAITING_USER_INPUT
instead of pretending it succeeded or crashing vaguely.
That is a big difference between a brittle script runner and an actual agent runtime.
Test 1: Simple File Creation
The first test was intentionally small.
I asked FCLI to create a file named oceasn with three ocean-themed haikus.
It succeeded.
The agent:
- created the requested file
- changed only the requested file
- did not stage anything
- did not commit anything
- saved a trace
This proved the basic path:
natural language request -> plan -> typed file write -> final response
Small test, but important.
Before testing harder behavior, the simple path has to work cleanly.
Test 2: Harder Constrained File Generation
Then I made the prompt harder.
I asked FCLI to create five ocean-themed haikus with strict constraints:
- exactly five haikus
- three non-empty lines each
- one blank line between haikus
- first lines spelling
OCEAN - standalone word
tideexactly twice - banned words excluded
This run was more interesting.
I had deleted the file before the test, so the agent first tried to read the existing file and failed.
Instead of collapsing, it moved into a clarification path and asked whether it should create the file.
After I answered, it recovered and produced a valid final file.
The trace showed:
status: completed
executed actions: 3
failed actions: 1
iterations: 5
That is not “perfect.”
But honestly, this is exactly the kind of behavior I want an agent runtime to expose.
The failure did not disappear.
The failed read was preserved in the trace.
The agent recovered.
The final artifact passed validation.
That is much more useful than a system that silently hides its mistakes.
Test 3: Capability-Level Python Execution
Next, I wanted to separate model quality from software capability.
So I tested FCLI’s direct command capability with foundation run.
The run created a small Python script, executed it, wrote an output file, and produced the expected SHA value.
That confirmed the lower-level capability layer was working.
The test proved:
- shell execution worked
- file outputs were created
- history was recorded
- rerunning the generated script worked
This was an important distinction.
The capability layer was not the weak point.
Test 4: Weak Provider Failure
Then I tried a harder natural-language task through the previous provider setup.
That failed twice before any tool action ran.
The failure was:
Provider returned an empty chat response
This was useful signal.
It showed that the underlying file and shell capability layer could work, but the planner/provider path was brittle.
The model did not reliably produce a usable structured plan for the harder task.
This is one of the most important distinctions when evaluating an agent system:
Did the tool fail?
Did the planner fail?
Did the model fail?
Did the runtime recover?
In this case, the planner path failed before tools were even invoked.
That means the problem was not file writing.
It was not shell execution.
It was not git inspection.
The failure happened earlier, at the provider/planner boundary.
Test 5: Stronger Provider With GPT-5.5
After adding the Codex provider path, I tested with:
--provider codex --model gpt-5.5
This was the cleanest run of the day.
The hard smoke test asked FCLI to:
- create a new scoped directory
- write a CSV file
- write a Python analyzer
- write an expected JSON file
- run the analyzer
- verify
result.jsonagainstexpected.json - verify a Markdown report line exactly
- inspect git status
- avoid staging or committing anything
The result:
HARD_SMOKE_OK
The trace summary was also clean:
status: completed
executed actions: 6
failed actions: 0
iterations: 2
capabilities used:
- foundation.file.write
- foundation.shell.command
- foundation.git
I also reran local verification outside the agent and got:
LOCAL_VERIFY_OK
That was the clearest evidence from the day.
The stronger model did not just produce better text. It improved the planner path.
The capabilities were already there.
The stronger model selected and sequenced them better.
What Performed Better
With GPT-5.5, FCLI did not just “sound smarter.”
It behaved better as an agent planner.
The improvement was visible in the trace:
- valid structured plan on the first planning pass
- correct capability selection
- no empty provider response
- no failed actions
- deterministic verification passed
- final git status inspection was scoped correctly
- nothing was staged or committed
This moved the system from:
interesting but brittle
to:
usable for scoped local coding-agent tasks
That is a meaningful step.
What Still Needs Work
The tests also exposed real gaps.
Provider Configuration Needs Cleaner Boundaries
Switching provider/model can leave provider-specific settings behind, such as base URL or credential source.
That can accidentally mix an OpenAI provider name with Ollama credentials or endpoint config.
Provider configuration needs sharper isolation so each provider has a clean and predictable boundary.
FCLI Needs Repeated Evals
One clean hard smoke test is encouraging, but reliability does not come from one good run.
Reliability comes from repeated deterministic tasks and measured success rates.
The next stage should include a proper eval suite with many scoped tasks and clear expected outputs.
Failure Modes Should Become More Structured
The agent should keep making failure states more machine-readable.
Examples:
- provider empty responses
- unsupported model errors
- missing files
- policy blocks
- awaiting-user-input states
- verification failures
These should not just be strings in logs.
They should become structured outcomes that a supervisor, UI, or future worker runtime can consume.
The Result Contract Should Be Smaller
The trace is rich, but if FCLI becomes a worker kernel, another system should not need to parse the entire trace to understand what happened.
A cleaner result contract should include:
- status
- changed files
- commands executed
- verification result
- artifacts created
- failed actions
- risk flags
- final git state
The trace can remain detailed, but the summary should be small and predictable.
The Main Takeaway
The most important observation from today is that FCLI is no longer just a command wrapper.
It now has the core shape of an agent kernel:
- bounded planning loop
- typed capabilities
- approval and policy checks
- execution traces
- replan behavior
- question/awaiting-input path
- deterministic verification
- structured history
The weaker model exposed brittleness in planning.
The stronger model showed that the runtime can do real work when the planner produces good actions.
That is a useful place to be.
FCLI is not production-ready.
It is not fully reliable yet.
But it is definitely past the toy demo phase.
The next step is not adding more magic.
The next step is making the loop boring:
- more evals
- cleaner provider config
- better structured failure results
- repeatable worker-style task contracts
That is how FCLI becomes a dependable local agent runtime.