Skip to main content

Command Palette

Search for a command to run...

22x Faster Builds: Inside GALA's Compilation Performance Journey

How profiling, batch analysis, and content-addressed caching took multi-file compilation from 44.7s to 2.0s

Updated
6 min read
M

Senior Software Engineer at Snowflake. Expert in scalable architecture, cloud tech, and security. Passionate problem-solver and mentor.

In GALA 0.24.0, we reduced compilation time for multi-file packages from 44.7 seconds to 2.7 seconds -- a 22x improvement. This post covers what was slow, how we found it, and what we did about it.

The Problem

GALA transpiles .gala files to .go files. For single-file packages, this is fast -- parse, analyze, transform, emit. But real projects have multi-file packages. A server might have 7 files in the same package, each file needing to know about types, methods, and sealed types defined in the other 6.

Before 0.24.0, each file in a package was transpiled as a separate process invocation. The gala transpile command was called once per file, and each invocation:

  1. Parsed the target file
  2. Scanned all sibling files to extract type information
  3. Analyzed the full dependency graph (imports, Go type inference, sealed type metadata)
  4. Transformed the AST to Go
  5. Emitted the output

For a 7-file package, step 2 and 3 happened 7 times, each time re-parsing the same siblings and re-resolving the same imports. The Go type inference step -- which shells out to go list to resolve return types and method signatures from Go packages -- was particularly expensive, adding hundreds of milliseconds per invocation.

The gala-server package (7 files) took 44.7 seconds. Most of that was redundant work.

Profiling First

GALA 0.24.0 introduced GALA_PROFILE=1, an environment variable that enables compilation profiling. When set, the transpiler emits timing breakdowns for every phase:

[PROFILE] Parse:     12ms
[PROFILE] Analyze:   340ms
[PROFILE] Transform: 85ms
[PROFILE] Emit:      3ms
[PROFILE] Total:     440ms

Running this across all 7 files revealed the pattern immediately: analysis dominated. Each file spent ~340ms in analysis, and the analysis for each file was doing nearly identical work -- re-parsing sibling files, re-resolving Go types, re-building sealed type metadata.

The profiling data made the optimization path obvious. Without it, we might have guessed wrong and optimized the transform phase (which was already fast) or the parser (which was negligible).

Lesson: always profile before optimizing. The GALA project enforces this as policy -- the memory file literally says "always profile with GALA_PROFILE=1 before optimizing."

Solution 1: BatchAnalyzer

The first optimization was architectural: share analysis state across files in the same package.

The BatchAnalyzer takes all files in a package at once. It parses each file, then runs a single unified analysis pass that:

  • Extracts type declarations, method signatures, and sealed type definitions from all files
  • Runs Go type inference once for the entire package's import set
  • Builds a shared metadata cache that every file can read from

Instead of 7 analysis passes (one per file), there is now 1 analysis pass for the whole package. The analysis result is then distributed to each file's transformer.

The implementation required refactoring the analyzer's entry point. Previously, NewGalaAnalyzer accepted a single file and optional sibling file paths. The new NewBatchAnalyzer accepts all files upfront and returns a shared analysis context:

Before:  7 files x 1 analyzer each  = 7 full analyses
After:   7 files x 1 shared analyzer = 1 full analysis + 7 lookups

Solution 2: Disk Cache

Even with batch analysis, the Go type inference step (resolving return types, struct fields, and method signatures from Go standard library and third-party packages) was still expensive for the first compilation of a package.

GALA 0.24.0 introduced a disk cache at .gala/cache/. The cache stores analysis results keyed by content hash -- a SHA-256 of the file contents plus the contents of all its imports. If the file and its dependencies haven't changed, the cached analysis is reused.

The cache format is simple: JSON files named by their content hash. On a warm cache, the analysis phase drops from ~340ms to ~5ms per file, because no Go type inference or sibling parsing is needed.

Cache invalidation is content-addressed, so it is always correct: if any source file changes, its hash changes, and the cache misses. No timestamps, no file watchers, no stale cache bugs.

Solution 3: Batch Transpilation

The final piece was eliminating the process-per-file overhead. Before 0.24.0, the build system (Bazel or gala build) invoked the transpiler binary once per .gala file. Each invocation paid the cost of process startup, Go runtime initialization, and flag parsing.

Batch transpilation runs a single transpiler process for all files in a package. The process:

  1. Parses all input files
  2. Runs the BatchAnalyzer (one shared analysis pass)
  3. Transforms each file using the shared analysis context
  4. Emits all .go files

This eliminated 6 process startups (for a 7-file package) and enabled the shared analysis to happen in-memory rather than being serialized to disk and re-read.

Results

Benchmarked on the gala-server package (7 .gala files):

ConfigurationTimeSpeedup
0.23.x (file-at-a-time, no cache)44.7sbaseline
Batch analysis, no cache6.1s7.3x
Batch analysis + disk cache (cold)5.8s7.7x
Batch analysis + disk cache (warm)2.7s16.6x
Batch transpilation + warm cache2.0s22x

The warm-cache number is the steady-state experience during development: edit a file, rebuild, and the unchanged files hit the cache while only the modified file gets re-analyzed.

For single-file packages (like most examples), compilation time is unchanged at ~200-400ms. The optimization specifically targets the multi-file case where redundant work was the bottleneck.

Lessons Learned

1. Profile before optimizing. The profiling data pointed directly at analysis as the bottleneck. Without it, we might have spent time optimizing the wrong phase.

2. Shared state beats repeated computation. The batch analyzer is conceptually simple -- do the work once, share the result -- but it required restructuring the analyzer's API from "single file in, analysis out" to "all files in, shared context out."

3. Content-addressed caching is worth the investment. It is simpler than timestamp-based caching (no clock skew issues, no "did the file actually change?" ambiguity) and it is always correct. The overhead of computing SHA-256 hashes is negligible compared to the analysis it replaces.

4. Process startup costs add up. For a language tool that gets invoked hundreds of times during a build, the cost of spawning a new process each time is real. Batch mode amortizes this to one process per package.

5. The optimization was straightforward once the data was clear. There was no clever algorithm, no sophisticated caching strategy. The 22x improvement came from eliminating obviously redundant work that the profiler made visible.

Try It

GALA 0.24.0+ is available now. To see compilation timing for your own packages:

GALA_PROFILE=1 gala build ./...

The playground at https://gala-playground.fly.dev transpiles instantly (single-file), but for multi-file projects, the batch transpilation makes a meaningful difference in the edit-compile-run cycle.

GALA From the Ground Up

Part 4 of 9

A ten-part technical blog series exploring GALA — a modern programming language that brings sealed types, pattern matching, monadic error handling, and immutable-by-default semantics to the Go ecosystem. Each post takes a single concept, shows the problem it solves with real code, compares it to idiomatic Go, and is honest about trade-offs. Written for Go developers who want more expressiveness and functional programmers who want Go's runtime.

Up next

Building a Reliable Transpiler: Lessons from 80+ Bug Fixes

What breaks when you map functional programming onto Go's type system, and how testing infrastructure keeps it fixed