Vol. I · No. 1 · THE FIRST, TOKEN ISSUE · AD MMXXVI
Field NotesUSING ARBOR TO MEASURE COMPONENT REUSE

Using Arbor to measure component reuse

Arbor is an AI framework that runs iterative coding experiments, measuring each change against a baseline metric before deciding whether to keep it — here we set it to work on one of development's oldest debates: when does extracting reusable components actually improve your code?

Experimentation RecordFIG. 2-B
HypothesisApplying an objective metric to incremental component extraction in a Next.js app will reveal whether reuse improves code quality — independent of team convention or subjective judgment.
Modelclaude-sonnet-4.6, claude-opus-4.8
Trials4
Duration2 days
Scored byEvals - first load, maintainability index
MethodUsing the arbor-research-agent create hypothesis to test small measurable changes against a baseline. Implement changes from working-tree only when they pass the threshold. Repeated 4 times against two separate metrics, each pass contained 1-4 separate hypotheses.
OutcomeFirst metric (page load) was a mismatch, maintainability index

Refactor, Refactor, Refactor. But how do we know the changes we're making actually improve the codebase? Often these changes are made wholesale, so it's hard to know what individual commit may have moved the needle.

Background

This is where Arbor comes in. Using its "hypothesis tree refinement" Arbor is able to break down small experiments, measure their effectiveness against a baseline metric and then transfer its discoveries to the next experiment.

In short, this lets you iteratively improve against a specified metric; while preserving knowledge and insights in the tree. Long-running experiments are prone to drift; Arbor coordinates its agents and is ruthless about terminating experiments that don't lead to success (which I saw for myself).

In tests, Arbor outperformed Claude Code and Codex when used alone by 250% on the same compute budget.

How does it do this? It runs a 6 step loop:

  1. Observe. Analyse current results and failure modes.
  2. Ideate. Propose 1–3 new ideas from the analysis and tree insights.
  3. Select. Pick the highest-priority idea to test.
  4. Dispatch. Run an Executor on it in an isolated git worktree.
  5. Back-propagate. Record the result; abstract the insight up to ancestor nodes
  6. Decide. Continue / merge into trunk / prune / stop

This probably feels familiar from your own workflows but what makes Arbor distinct is that

an agent receives an initial artifact, an objective, a development evaluator, and a held-out evaluator, then improves the artifact through iterative experimentation without step-level supervision

The result is a process that makes research cumulative — not more attempts, but less repetitive, more memory-aware search.

So you can confidently run longer loops knowing the Arbor agent will be selective on what it implements. You're not going to leave your computer and come back to something unrecognisable from what your originally intended.

Method: The Setup

You can use Arbor with your existing Claude subscription

claude plugin marketplace add RUC-NLPIR/Arbor
claude plugin install arbor         # installs the skills + registers `arbor mcp`

Then run it inside your coding agent

arbor-research-agent optimize this repo for <metric>. Ask before training, package installs, or B_test.

Method: Application

Choosing the right problem and metric is worth deliberating over. I wanted to settle the debate over when to reuse components, balancing the benefits of reuse against the perils of abstraction. The general guidance is that you should reuse a component when it appears 3 times in your code. In reality teams either tend to favour highly domain-specific components or a design system that dictates when and how reuse happens. Walking that line gets harder as the team grows.

At first I thought I could measure the First Load metric, but it turns out that extracting code into a reusable component/shared module is byte-neutral; and the real gains come from maintainability.

Fortunately there exists a metric known as the Maintainability Index, a composite metric that quantifies how easy and cost-effective a section of code is to understand, modify, and maintain. It calculates a single score on a rebased scale from 0 to 100.

Arbor created an eval for me as part of the first step in its process, if you want more control it's suggested you create this eval yourself.

Observation & Analysis

Arbor produces lots of documentation to help you see the outcomes of its experiments, it produces a summary report at the end of each cycle, provides feedback in the console, and will run a live dashboard either within the console or at a separate url so you can view the experiments in real-time. I can imagine the HTML dashboard on a boardroom screen — execs do love a live-streamed metric.

I did four runs in total with two separate metrics, I used Sonnet 4.6 for the first 3 runs and changed to Opus 4.8 for the final run. Each of the first three runs had 3-4 cycles (see below for individual hypotheses) with the final run from Opus 4.8 rejecting my original premise.

Using Opus 4.8 as the Executor agent I noticed it front-loaded some questions for me, and suggested a change in the evals that were used. It also stopped cycles sooner, predicting that subsequent cycles would bring no further value.

**Stopped after 1 cycle** (not 3): static analysis + one empirical node gave a
conclusive null result; additional cycles of the same change class would
reproduce ~0 KB. Continuing would burn budget without new information.

Here's the cost breakdown:

  • Run 1 - API calls: 258, duration: 32m, cost: $8.5082
  • Run 2 - API calls: 142, duration: 28 mins, cost: $5.4759
  • Run 3 - API calls: 161, duration: 1h 14m, cost: $6.8734
  • Run 4 - API calls: 141, duration: 2h 49m, cost: $13.0429

Raw observation log

Run 1

Optimize app for component reuse with shared styles and modules. Metric: First Load JS (KB) from pnpm build output, minimize. 3 cycles.

3 cycles, using 3 mechanisms:

1. Extract slugify and parseTags function bodies from component files
2. Create src/lib/styles.ts with named constants for the five most-repeated Tailwind className strings
3. Extract shared FormField compound component


1: Bundle metric unmoved; correctness and maintainability improved.
2. The bundler deduplicates module-level string constants. Gains from string extraction are modest but measurable.
3. React's compiled JSX produces equivalent createElement calls regardless of component abstraction depth.


Run 2

Maximize Maintainability Index (MI) of src/**/*.{ts,tsx} in the app

3 cycles, using 3 mechanisms:

1. Extract inline sub-components from page.tsx to dedicated components; move utility functions and placeholder constants
2. Extract ArtefactList and TagInput from FieldNoteForm to dedicatedcomponents
3. Replace locally-redefined utility fns and locally-redefined components in page.tsx with imports from the shared modules


1: Under canonical AST-based MI, page.tsx improved 6.8→13.7 and 9 new focused files added to mid-range. Combined trunk moved 44.96→45.58.
2: Distributing ~105 LOC of dense interactive UI logic (artefact list with upload, tag chip input with keyboard handling) across two focused single-responsibility components raised avg MI by ~27%.
3. Removing ~90 LOC of duplicate utility functions and components from issues/[number]/page.tsx gave MI 16.0→21.9; extracting 3-tool SVG placeholder from page.tsx gave MI 13.7→17.1. Net trunk gain +0.05.

Run 3

2nd pass

Maximize Maintainability Index (MI) of src/**/*.{ts,tsx} in the app

4 cycles, using 4 mechanisms:

1. Extract ExperimentRecord and OutcomeSection fieldsets into their own components; wire both back via controlled props.
2. Extract references section into own component; wire back via references+onChange props.
3. Extract three tiny lib/ modules from FieldNoteForm.tsx into shared utils. Also consolidate hintText and sectionClass style tokens.
4. Create /lib/parse.ts and /lib/file-utils.ts generic parseJson utility, formatSize and fileExt helpers

1. FieldNoteForm improved (MI 12.0→17.9); However ExperimentSection (MI 34.2); OutcomeSection MI 38.5) score below trunk average (45.63)
2. PostForm MI improved 10.8→15.0 and lost ~89 LOC, but the new ReferencesManager.tsx scored only 29.1 MI — below current trunk average — so adding it to the file pool pulled the overall average down by 0.16 from trunk.
3. Tiny lib files scored well above trunk average 45.63. Adding 3 such files while removing inline constants raised overall avg by +0.34. Confirms the pivot: tiny constant/utility extractions clear the math threshold that medium-component extractions cannot.
4. Two tiny files added high-MI entries to pool; and removed 5 local definitions; 16.3→18.4 MI

Run 4

2nd pass more capable model

Minimize total First Load client JS (KB) for the ren-ai Next.js 16 blog via component reuse, shared styles, and shared modules.

1 cycle

Extract a shared client sub-component for the reference/artefact list editing UI that is structurally duplicated between PostForm.tsx (references) and FieldNoteForm.tsx (artefacts)

Extracting code into a reusable component/shared module is byte-neutral for total client JS (+0.12 KB module-boundary overhead). Turbopack includes the same code whether inline or imported, and hoists shared imports into common chunks automatically.

Scoring Matrix

The baseline Maintainability Index (MI) before starting experimentation was 44.96 and finished at 46.60. This is a demonstrable improvement. The index was recorded for each separate hypothesis and fed back into the trunk where the MI increased.

Conclusions

Returning to our original hypothesis: the clearest gains come from small utils, and medium-sized components are best handled at a per-domain level.

Beyond this experiment, Arbor has obvious value as an ongoing check against the bloat that quietly accumulates.

Imagine what it might uncover when leveraged against testing pipelines where code flakiness can be difficult to diagnose? Or, determining which doc structure is best to improve the searchability of your codebase for AI agents?

Whether or not you believe the Maintainability Index is a good measure of code quality, Arbor has shown the value of taking metrics out of the static dashboard and embedding them into a workflow to inform the architecture and structure of our codebases in an active and ongoing way.

Outcome RecordFIG. 2-A
StatusSuccess
Date closed28 June 2026
Runs4
ArtefactsFIG. 2-C
  • MD
    REPORT01.md

    Arbor Session Report — component-reuse-001

    2.1 KBDownload
  • MD
    REPORT02.md

    Arbor Session Report — component-reuse-002

    4.0 KBDownload
  • MD
    REPORT03.md

    Arbor Session Report — component-reuse-003

    3.7 KBDownload
  • MD
    REPORT04.md

    Arbor Session Report — component-reuse-004

    4.4 KBDownload
  • JS
    eval-mi.js

    Maintainability Index eval

    3.7 KBDownload

References

Citations follow Chicago Manual of Style — Notes & Bibliography (web) format.

  1. Dou, Zhicheng. Arbor.” RUC-NLPIR Lab. https://ruc-nlpir.github.io/Arbor/.
  2. Arbor.” Github repository. 29 June. 2026. https://github.com/RUC-NLPIR/Arbor.
  3. Gilboy, Tim. Maintainability Index - What is it and where does it fall short?.” Sourcery. March 7, 2022. https://www.sourcery.ai/blog/maintainability-index.
  4. Dickson, Ben. New AI optimization framework beats Claude Code and Codex by 2.5x on the same compute budget.” Venture Beat. June 18, 2026. https://venturebeat.com/orchestration/new-ai-optimization-framework-beats-claude-code-and-codex-by-2-5x-on-the-same-compute-budget.

0 Comments