Using Arbor to measure component reuse
Arbor is an AI framework that runs iterative coding experiments, measuring each change against a baseline metric before deciding whether to keep it — here we set it to work on one of development's oldest debates: when does extracting reusable components actually improve your code?
Refactor, Refactor, Refactor. But how do we know the changes we're making actually improve the codebase? Often these changes are made wholesale, so it's hard to know what individual commit may have moved the needle.
Background
This is where Arbor comes in. Using its "hypothesis tree refinement" Arbor is able to break down small experiments, measure their effectiveness against a baseline metric and then transfer its discoveries to the next experiment.
In short, this lets you iteratively improve against a specified metric; while preserving knowledge and insights in the tree. Long-running experiments are prone to drift; Arbor coordinates its agents and is ruthless about terminating experiments that don't lead to success (which I saw for myself).
In tests, Arbor outperformed Claude Code and Codex when used alone by 250% on the same compute budget.
How does it do this? It runs a 6 step loop:
- Observe. Analyse current results and failure modes.
- Ideate. Propose 1–3 new ideas from the analysis and tree insights.
- Select. Pick the highest-priority idea to test.
- Dispatch. Run an Executor on it in an isolated git worktree.
- Back-propagate. Record the result; abstract the insight up to ancestor nodes
- Decide. Continue / merge into trunk / prune / stop
This probably feels familiar from your own workflows but what makes Arbor distinct is that
an agent receives an initial artifact, an objective, a development evaluator, and a held-out evaluator, then improves the artifact through iterative experimentation without step-level supervision
The result is a process that makes research cumulative — not more attempts, but less repetitive, more memory-aware search.
So you can confidently run longer loops knowing the Arbor agent will be selective on what it implements. You're not going to leave your computer and come back to something unrecognisable from what your originally intended.
Method: The Setup
You can use Arbor with your existing Claude subscription
claude plugin marketplace add RUC-NLPIR/Arbor
claude plugin install arbor # installs the skills + registers `arbor mcp`
Then run it inside your coding agent
arbor-research-agent optimize this repo for <metric>. Ask before training, package installs, or B_test.
Method: Application
Choosing the right problem and metric is worth deliberating over. I wanted to settle the debate over when to reuse components, balancing the benefits of reuse against the perils of abstraction. The general guidance is that you should reuse a component when it appears 3 times in your code. In reality teams either tend to favour highly domain-specific components or a design system that dictates when and how reuse happens. Walking that line gets harder as the team grows.
At first I thought I could measure the First Load metric, but it turns out that extracting code into a reusable component/shared module is byte-neutral; and the real gains come from maintainability.
Fortunately there exists a metric known as the Maintainability Index, a composite metric that quantifies how easy and cost-effective a section of code is to understand, modify, and maintain. It calculates a single score on a rebased scale from 0 to 100.
Arbor created an eval for me as part of the first step in its process, if you want more control it's suggested you create this eval yourself.
Observation & Analysis
Arbor produces lots of documentation to help you see the outcomes of its experiments, it produces a summary report at the end of each cycle, provides feedback in the console, and will run a live dashboard either within the console or at a separate url so you can view the experiments in real-time. I can imagine the HTML dashboard on a boardroom screen — execs do love a live-streamed metric.
I did four runs in total with two separate metrics, I used Sonnet 4.6 for the first 3 runs and changed to Opus 4.8 for the final run. Each of the first three runs had 3-4 cycles (see below for individual hypotheses) with the final run from Opus 4.8 rejecting my original premise.
Using Opus 4.8 as the Executor agent I noticed it front-loaded some questions for me, and suggested a change in the evals that were used. It also stopped cycles sooner, predicting that subsequent cycles would bring no further value.
**Stopped after 1 cycle** (not 3): static analysis + one empirical node gave a
conclusive null result; additional cycles of the same change class would
reproduce ~0 KB. Continuing would burn budget without new information.
Here's the cost breakdown:
- Run 1 - API calls: 258, duration: 32m, cost: $8.5082
- Run 2 - API calls: 142, duration: 28 mins, cost: $5.4759
- Run 3 - API calls: 161, duration: 1h 14m, cost: $6.8734
- Run 4 - API calls: 141, duration: 2h 49m, cost: $13.0429
Raw observation log | ||
|---|---|---|
Run 1 Optimize app for component reuse with shared styles and modules. Metric: First Load JS (KB) from pnpm build output, minimize. 3 cycles. | 3 cycles, using 3 mechanisms: 1. Extract slugify and parseTags function bodies from component files | 1: Bundle metric unmoved; correctness and maintainability improved. |
Run 2 Maximize Maintainability Index (MI) of src/**/*.{ts,tsx} in the app | 3 cycles, using 3 mechanisms: | 1: Under canonical AST-based MI, page.tsx improved 6.8→13.7 and 9 new focused files added to mid-range. Combined trunk moved 44.96→45.58. |
Run 3 2nd pass Maximize Maintainability Index (MI) of src/**/*.{ts,tsx} in the app | 4 cycles, using 4 mechanisms: 1. Extract ExperimentRecord and OutcomeSection fieldsets into their own components; wire both back via controlled props. | 1. FieldNoteForm improved (MI 12.0→17.9); However ExperimentSection (MI 34.2); OutcomeSection MI 38.5) score below trunk average (45.63) |
Run 4 2nd pass more capable model Minimize total First Load client JS (KB) for the ren-ai Next.js 16 blog via component reuse, shared styles, and shared modules. | 1 cycle Extract a shared client sub-component for the reference/artefact list editing UI that is structurally duplicated between | Extracting code into a reusable component/shared module is byte-neutral for total client JS (+0.12 KB module-boundary overhead). Turbopack includes the same code whether inline or imported, and hoists shared imports into common chunks automatically. |
Scoring Matrix
The baseline Maintainability Index (MI) before starting experimentation was 44.96 and finished at 46.60. This is a demonstrable improvement. The index was recorded for each separate hypothesis and fed back into the trunk where the MI increased.
Conclusions
Returning to our original hypothesis: the clearest gains come from small utils, and medium-sized components are best handled at a per-domain level.
Beyond this experiment, Arbor has obvious value as an ongoing check against the bloat that quietly accumulates.
Imagine what it might uncover when leveraged against testing pipelines where code flakiness can be difficult to diagnose? Or, determining which doc structure is best to improve the searchability of your codebase for AI agents?
Whether or not you believe the Maintainability Index is a good measure of code quality, Arbor has shown the value of taking metrics out of the static dashboard and embedding them into a workflow to inform the architecture and structure of our codebases in an active and ongoing way.
- MDREPORT01.md2.1 KBDownload
Arbor Session Report — component-reuse-001
- MDREPORT02.md4.0 KBDownload
Arbor Session Report — component-reuse-002
- MDREPORT03.md3.7 KBDownload
Arbor Session Report — component-reuse-003
- MDREPORT04.md4.4 KBDownload
Arbor Session Report — component-reuse-004
- JSeval-mi.js3.7 KBDownload
Maintainability Index eval
References
Citations follow Chicago Manual of Style — Notes & Bibliography (web) format.
- Dou, Zhicheng. “Arbor.” RUC-NLPIR Lab. https://ruc-nlpir.github.io/Arbor/.
- “Arbor.” Github repository. 29 June. 2026. https://github.com/RUC-NLPIR/Arbor.
- Gilboy, Tim. “Maintainability Index - What is it and where does it fall short?.” Sourcery. March 7, 2022. https://www.sourcery.ai/blog/maintainability-index.
- Dickson, Ben. “New AI optimization framework beats Claude Code and Codex by 2.5x on the same compute budget.” Venture Beat. June 18, 2026. https://venturebeat.com/orchestration/new-ai-optimization-framework-beats-claude-code-and-codex-by-2-5x-on-the-same-compute-budget.
0 Comments