Using Autoresearch Project to Build The Fastestest Java Decompiler
Most people first see this Andrej's Autoresearch project as an ML autotuning setup: an agent edits one file (train.py), runs short experiments, and keeps only measurable improvements. Under the hood, though, the real value is not āLLM training.ā The real value is the architecture: a closed-loop research system with explicit goals, constrained change scope, objective evaluation, and hard keep/revert rules.
That pattern transfers cleanly to systems work, including decompiler optimization which is the pivotal part of Jar.Tools. I called my decompilation engine IPND, which I wanted to be the fastest was to decompile a java class into readable Java source code.
The Core Architecture (Domain-Agnostic)
At a high level, this project separates policy from execution:
- Policy lives in
program.md(what to optimize, what constraints matter, what constitutes a win). - Execution lives in code and tooling (
train.py, test harnesses, benchmarks, profiling tools). - Decisions are recorded as structured experiment outcomes (baseline vs current).
Three design choices make this robust across domains:
- Fixed evaluation protocol: same benchmark shape each iteration, so comparisons stay valid.
- Explicit baseline: every candidate is judged relative to a known reference, not gut feeling.
- Tight loop latency: faster iteration means more hypotheses tested per hour.
In ML, the metric is validation bits-per-byte. In decompiler work, the metric can be latency, memory, correctness parity, or all three.
Mapping the Architecture to the Decompiler Project
For the decompiler, I used the same loop but swapped in system-level targets:
- Performance targets: class and jar decompilation latency.
- Memory targets: RSS/HWM behavior under realistic workloads.
- Correctness guardrails: API tests, decompiler tests, parity checks.
- Output contract: stable API behavior and artifact format.
The implementation already has natural seams for this:
crates/core: parsing, decompilation, emit logic.crates/api: HTTP contract, job orchestration, artifact generation.crates/cliandcrates/ffi: integration surfaces.
This is exactly what makes the architecture reusable: once a system has deterministic entry points and measurable outputs, it can be optimized by the same research loop regardless of domain.
How I Used It for Decompiler Logic Improvements
The practical cycle looked like this:
- Establish baseline with fixed corpus and repeat count.
- Profile CPU and memory hotspots (
perf,heaptrack, runtime summaries). - Hypothesize a change (for example: zip writer mode, decompile path behavior, branch coverage for edge cases).
- Patch and validate with tests and coverage gates.
- Re-benchmark and compare against baseline.
- Keep only measurable wins.
This gave us concrete, decision-ready metrics instead of anecdotal āfeels fasterā claims. Example outcomes from the class decompilation track:
- Mean latency improved versus baseline in repeated top-class comparisons.
- Memory improved in the same benchmark family, with process-level peak RSS/HWM reductions.
- Coverage was raised and enforced with
cargo llvm-cov --fail-under-lines 90on the API package, so optimization work did not erode reliability.
Notable Speed Changes With Code Examples
Below are concrete code-level changes that helped performance in the decompiler path.
1) Parallelize method decompilation only when class size justifies it
In crates/core/src/emit/mod.rs, method bodies are decompiled in parallel only for sufficiently large classes. Small classes stay serial to avoid scheduler overhead.
fn should_parallelize_method_decompile(coded_method_count: usize, total_code_bytes: usize) -> bool {
coded_method_count >= 24 && total_code_bytes >= 12_000 && method_decompile_parallelism() > 1
}
if should_parallelize_method_decompile(coded_methods.len(), total_code_bytes) {
let results = coded_methods
.par_iter()
.map(|(method_index, method)| {
(*method_index, crate::decompile::decompile_method_v1(class, method, *method_index, decompile_opts))
})
.collect::<Vec<_>>();
// write back results...
}
Why it matters:
- The threshold gate prevents regressions on tiny classes.
- For big classes (many coded methods), this unlocks multicore throughput where most time is spent.
- Thread pool reuse (
OnceLock<ThreadPool>) avoids rebuilding pools per class.
2) Replace map-heavy method body storage with indexed slots
The emitter path uses vector-indexed storage for method bodies and moves values out with take(), reducing lookup and clone overhead.
let mut method_bodies: Vec<Option<crate::decompile::MethodBody>> = vec![None; class.methods.len()];
// fill method_bodies[method_index] = Some(body)
let body = method_bodies
.get_mut(method_index)
.and_then(|slot| slot.take());
Why it matters:
- Indexing by method index is O(1) and cache-friendly.
Option::take()avoids repeated cloning of large body structures.- This directly improves the hot emit path for classes with many methods.
3) Add no-allocation fast paths in identifier rewriting
String-rewrite utilities now bail out immediately when there is nothing to replace, instead of always allocating an output string.
fn replace_identifier_all_if_needed(source: &str, from: &str, to: &str) -> Option<String> {
if from.is_empty() || from == to { return None; }
if !source.contains(from) { return None; }
// rewrite only if needed...
Some(out)
}
Why it matters:
- Large decompiled method bodies often do not need renaming rewrites.
- Avoiding unnecessary allocations cuts CPU and memory churn in post-processing.
- This is a classic hot-path optimization with low risk and high frequency payoff.
4) Optimize artifact ZIP write path for throughput
For output packaging, I moved to low-cost compression by default and made āstoredā mode configurable for memory-sensitive runs.
let file_options = if use_stored_artifact_entries() {
SimpleFileOptions::default().compression_method(CompressionMethod::Stored)
} else {
SimpleFileOptions::default()
.compression_method(CompressionMethod::Deflated)
.compression_level(Some(1))
};
Why it matters:
- Deflate level 1 significantly reduces compression CPU cost versus higher levels.
- Stored mode can further reduce transient memory pressure when compression is not needed.
- This affects jar decompile end-to-end latency because artifact creation is on the critical path.
What was measured
On our class benchmark track (largest-class/top-N comparisons), current vs baseline showed sustained improvements:
- Top-10 class retest: mean latency delta
-18.44%and weighted delta-24.67%. - Randomized top-10 sample from top-50 classes (two-run aggregate): mean delta about
-16%, weighted delta about-27%.
These changes were only kept when they held against baseline under the same harness and passed the regression tests.
Current Full-Jar Decompiler vs CFR Numbers (Current Checkout)
To compare full jar decompilation (not per-class microbenchmarks), I used the same input jar for both tools:
- Jar:
external_jars/commons-lang3-3.14.0.jar(404classes) - Passes:
3 - Host CPU:
Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz - Java:
openjdk version "21.0.9" 2025-10-21
Commands used:
# IPND full-jar decompile (API worker path)
IPND_PERF_PASSES=3 IPND_PERF_BUDGET_MS=50 \
cargo test -p ipdn perf_run_decompile_job_common_jar -- --ignored --nocapture
# CFR full-jar decompile (whole-jar invocation, 3 passes)
java -jar dist/tools/cfr-0.152.jar external_jars/commons-lang3-3.14.0.jar --outputdir <tmp> --silent true
Aggregate latency results (full jar)
| Slice | IPND mean (ms) | CFR mean (ms) | CFR/IPND ratio |
|---|---|---|---|
| Overall (all passes) | 369.440 | 5998.479 | 16.237x |
| Cold pass only (pass 1) | 409.150 | 6397.316 | 15.636x |
| Warm passes only (pass 2-3) | 349.585 | 5799.060 | 16.588x |
Supporting percentiles from the same run set:
- Overall p50: IPND
352.106 ms, CFR5913.400 ms - Overall p95: IPND
409.150 ms, CFR6397.316 ms
Output artifact context:
- IPND artifact zip size:
416735bytes - CFR extracted source size:
1173989bytes across248files
Interpretation:
- On this full-jar workload, current IPND is about
16xfaster than CFR on the same machine. - Unlike class microbenchmarks, this comparison includes full pipeline cost (jar indexing, decompile loop, output materialization).
- The per-class CFR benchmark remains useful for instruction-level tuning, but jar-level numbers are the right KPI for end-user jar decompilation throughput.
Why This Architecture Scales Beyond ML and Decompilers
The pattern works anywhere you can define:
- a controllable change surface,
- a repeatable run harness,
- and a strict scoring function.
That includes compilers, API backends, data pipelines, search ranking services, and frontend rendering performance.
The transferable blueprint is:
- Define objective as a metric, not a story.
- Lock evaluation protocol.
- Automate measurement and diffing.
- Require objective keep/revert decisions.
- Track baseline drift explicitly.
If you do just these five things, āautonomous researchā stops being an ML novelty and becomes a general engineering operating model.
Under-the-Hood Components That Matter Most
A lot of teams underestimate this part. The architecture only works when each component is explicit and stable:
- Objective layer: a single optimization target plus guardrail targets.
- Mutation layer: where code changes are allowed (for ML here it was primarily
train.py; for decompiler work I intentionally touched scoped Rust modules). - Execution layer: deterministic commands for tests, perf runs, and memory runs.
- Evaluation layer: scripts and logs that produce machine-readable deltas (
summary|...,memory|...). - Decision layer: keep/drop rules that run the same way every iteration.
Without this separation, optimization efforts drift into ad-hoc debugging. With it, every iteration contributes to a cumulative research trajectory.
A Practical Template for Other Projects
If you want to adapt this architecture to a new project, start with a minimal contract:
- Pick one benchmark corpus that matches production pressure.
- Record one immutable baseline run.
- Define pass/fail thresholds for regressions.
- Automate one command that prints current vs baseline.
- Gate merges on those numbers.
Once that is in place, you can scale out to multi-objective optimization (speed, memory, reliability, quality) without losing control of experiment integrity.
Closing
What started as an ML experiment loop is really a system for disciplined optimization under uncertainty. In our decompiler work, that architecture let us improve speed and memory without sacrificing correctness or API stability. The key was not domain-specific tricks; it was the loop design itself: baseline, profile, patch, verify, compare, repeat.