Observability8 min read

Adding Observability to a 500K-Line Monolith Without a Rewrite

Adding structured logging, distributed tracing, and metrics to a legacy monolith without a full rewrite. Real code examples and a war story from a 500K-line codebase.

observabilitylegacy codestructured loggingdistributed tracingmetricsmonolith

Why I Care About Observability in a Monolith

I spent three years maintaining a 500K-line Java monolith that had been running since 2009. The only observability was a log file with messages like "Processing request 1234" and the occasional stack trace. When something broke, we'd grep the logs, guess, and deploy a fix. It worked, but it was slow and painful.

Adding observability to that codebase wasn't about rewriting everything. It was about finding the smallest changes that gave us the most signal. This post covers the techniques I used: structured logging without changing every log call, distributed tracing via decorators, and metrics extraction from existing timers.

Start With Structured Logging: The Least Invasive Change

The first thing I did was add a structured logging layer. The codebase used a custom logging facade that wrapped log4j 1.2. I couldn't change every call site, but I could change the underlying implementation.

I created a bridge that parsed the log message for key=value pairs and emitted them as JSON. If the message had no structured data, it would just add the thread name, timestamp, and log level. The trick was to preserve the original log method signatures so the rest of the code didn't need to change.

Structured logging output from unchanged legacy log calls.
// Original call site: unchanged
logger.info("User login userId={} from ip={}", userId, ip);

// Underlying implementation: emits JSON
{
  "level": "INFO",
  "message": "User login",
  "userId": "abc123",
  "ip": "10.0.0.1",
  "thread": "http-nio-8080-exec-4",
  "timestamp": "2025-04-01T10:30:00Z"
}
lightbulb

If your legacy logger uses string concatenation, you can still add structure by wrapping the logger in a facade that parses the final string for key=value patterns. It's not perfect, but it's better than nothing.

Tracing Without Touching Every Method

Distributed tracing was harder because the code had no concept of trace context. Every request created new threads, and there was no correlation ID. I introduced a thread-local trace ID that was set at the HTTP request boundary (using a servlet filter) and then propagated through a custom executor service that copied the context to new threads.

For the actual spans, I used a decorator pattern. I created a @Traced annotation that could be placed on any public method. The annotation was processed by an AOP aspect that started a span, delegated to the method, and closed the span. This let me add tracing to critical paths (like the payment processing pipeline) without modifying the business logic.

AOP-based tracing annotation for legacy methods.
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface Traced {
    String name() default "";
    String[] tags() default {};
}

@Aspect
public class TracingAspect {
    @Around("@annotation(traced)")
    public Object trace(ProceedingJoinPoint pjp, Traced traced) throws Throwable {
        String spanName = traced.name().isEmpty() ? pjp.getSignature().toShortString() : traced.name();
        Span span = tracer.buildSpan(spanName).start();
        try (Scope scope = tracer.activateSpan(span)) {
            return pjp.proceed();
        } catch (Exception e) {
            span.setTag("error", true);
            span.log(e.getMessage());
            throw e;
        } finally {
            span.finish();
        }
    }
}

Extracting Metrics From Existing Timers

The monolith had a homegrown metrics class that tracked counts and durations using static methods. Calls like `Metrics.recordTiming("checkout.total", duration)` were scattered everywhere. Instead of replacing them, I added a Micrometer meter registry that hooked into these static calls. I wrapped the static `Metrics` class with a delegate that forwarded to both the old system (for backward compatibility) and Micrometer.

This gave me instant access to Prometheus metrics for all existing instrumentation. I didn't have to add a single new metric call to get dashboards for checkout latency, login failures, and database query times.

Bridging legacy static metrics to Micrometer without changing call sites.
public class LegacyMetricsBridge {
    private static final MeterRegistry registry = new SimpleMeterRegistry();

    public static void recordTiming(String name, long durationMs) {
        // Forward to legacy system (if still needed)
        LegacyMetrics.recordTiming(name, durationMs);
        // Record to Micrometer
        Timer.builder(name)
            .register(registry)
            .record(Duration.ofMillis(durationMs));
    }

    public static MeterRegistry getRegistry() {
        return registry;
    }
}

The Silent Database Bloat

  1. 14:02Pagerduty alerts on increased response latency for the order API.
  2. 14:05Grep logs: found no errors, but noticed many 'Processing order' messages with no timestamps.
  3. 14:10Checked new structured logs: saw that a single SQL query was taking 12 seconds, but the old logs only showed total endpoint time.
  4. 14:15Used trace spans to identify the exact repository method causing the slow query.
  5. 14:20Found that a missing index on a join table was causing full table scans, growing as data accumulated.
  6. 14:30Added the index. Latency dropped from 12s to 200ms.

Lesson

Without tracing, we would have spent hours grepping logs and guessing. The structured logging and tracing we added earlier that month paid for itself in one incident.

Instrument the Database Layer Once

One of the highest-impact changes was instrumenting the database access layer. The codebase used a custom DAO layer that directly executed SQL via JDBC. I created a wrapper around `DataSource.getConnection()` that returned a proxied connection. Every statement execution was captured as a span with the SQL text (parameterized), duration, and any error.

This single change gave us visibility into every database query across the entire application. We could now see which queries were slow, which were called most frequently, and which were failing silently. The overhead was negligible because we only added a few microseconds per query.

Wrapping DataSource to add tracing to every SQL statement.
public class TracingDataSource implements DataSource {
    private final DataSource delegate;
    private final Tracer tracer;

    @Override
    public Connection getConnection() throws SQLException {
        return new TracingConnection(delegate.getConnection(), tracer);
    }

    // Other methods delegate directly...
}

public class TracingConnection implements Connection {
    @Override
    public PreparedStatement prepareStatement(String sql) throws SQLException {
        return new TracingPreparedStatement(delegate.prepareStatement(sql), sql, tracer);
    }
    // ...
}

Testing Your Observability Changes

Adding observability to legacy code is useless if it breaks silently. I always deployed a canary instance first and verified that spans were being exported, metrics were appearing in Prometheus, and logs were structured. I wrote a simple integration test that sent a request and checked that at least one span was created and that the log output was valid JSON.

One time, the tracing library was misconfigured and wasn't exporting any spans. Without the test, I wouldn't have noticed until the next incident. The test caught it immediately.

Smoke test commands to verify observability is working.
# Quick smoke test for observability after deploy
curl -s http://localhost:8080/actuator/health | jq '.status'
# Should print "UP"

curl -s http://localhost:8080/actuator/metrics | jq '.names | length'
# Should be > 0

# Check logs for structured output
curl -s http://localhost:8080/api/orders | tail -n 1 | jq '.message'
# Should not be null

What I'd Do Differently

I started with metrics because they were easy, but I should have started with tracing. Tracing gave us the fastest debugging wins. Also, I wish I had added a correlation ID to every log line from day one — it would have made grepping much easier.

Another mistake: I tried to instrument everything at once. Instead, pick one critical user flow (e.g., checkout) and instrument it end-to-end. Once that works, expand. You'll get more value faster and avoid overwhelming the team with changes.

The best observability change is the one that doesn't require a code review of every file. Wrappers, decorators, and AOP are your friends.

Wrapping Up

Adding observability to legacy code doesn't require a rewrite. Start with structured logging via a facade, add tracing with AOP, extract metrics from existing instrumentation, and wrap the database layer. Each change is small, reversible, and immediately valuable.

The 500K-line monolith is still running today, but now it has dashboards, traces, and structured logs. And I sleep better at night.

Frequently asked questions

What if the legacy codebase has no dependency injection (everything is static)?

Use a thread-local context to propagate trace IDs. In languages like Java, use a static `TraceContext` holder or a `ThreadLocal` variable that you set at request entry. In Python, use `contextvars`. This avoids changing function signatures.

How do I instrument a function that has no hooks and is called thousands of times?

Use bytecode manipulation (Java: Byte Buddy, Python: monkey-patching) to wrap the function at runtime. For example, wrap all methods matching a pattern to add tracing spans. Test thoroughly to avoid performance regressions.

Should I add observability to error handling code that is rarely executed?

Yes, especially for error paths. Instrumenting them first often reveals silent failures that have been happening for years. Add a span around error-handling blocks and tag them with the error type. You'll be surprised what you find.

How do I convince the team to spend time on observability instead of new features?

Pick a single recurring production issue (e.g., a slow API that no one can explain). Show that with basic tracing you can pinpoint the cause in minutes. Once the team sees the value, they'll demand more instrumentation.