Property-Based Testing Finds Elusive Bugs: A Real-World Example

I've been writing unit tests for over a decade. I've used TDD, BDD, and every mocking framework that's come down the pike. And I've still shipped bugs that should have been caught. The worst one lived in a rate limiter I wrote for a financial API — it silently dropped requests under specific timing conditions. It survived code review, integration tests, and 18 months of production traffic.

That bug was an off-by-one in a sliding window counter. The window boundary was inclusive on one side and exclusive on the other, but I'd mixed them up. Every example-based test I wrote passed because I happened to pick timestamps that fell cleanly into one window or another. It wasn't until I switched to property-based testing that the bug surfaced.

What Property-Based Testing Actually Does

Instead of writing one test with specific inputs and expected outputs, you write a property — a statement that should be true for all valid inputs. The testing framework then generates hundreds or thousands of random inputs and checks the property. When it finds a failing case, it shrinks the input to the smallest example that still fails.

For my rate limiter, the property was simple: given any sequence of request timestamps, the number of requests counted in any window should never exceed the limit. That's it. No specific timestamps, no hand-crafted edge cases.

A Hypothesis-based property test for the rate limiter. The framework generates random lists of timestamps and checks the invariant.

from hypothesis import given, strategies as st
from my_rate_limiter import RateLimiter

@given(st.lists(st.floats(min_value=0, max_value=1e6), max_size=1000))
def test_rate_limiter_invariant(timestamps):
    limiter = RateLimiter(max_requests=10, window_seconds=60)
    counts = []
    for ts in timestamps:
        count = limiter.record(ts)
        counts.append(count)
    # Property: count should never exceed 10
    assert all(c <= 10 for c in counts)

lightbulb

When writing property tests, start with the simplest invariant you can think of. Don't try to model the entire system at once. For a rate limiter, the invariant is trivial: no window should ever exceed the limit. That one line caught the bug.

The Shrinking Moment

When Hypothesis ran my test, it quickly found a failure. But instead of giving me a random list of 1000 timestamps, it shrunk the input down to three timestamps: [0.0, 1.0, 59.999]. With a 60-second window and a limit of 10, those three timestamps produced a count of 11 in some window.

The problem was clear: the window [0, 60) counted timestamps at 0.0 and 1.0, but the next window [60, 120) should have started counting at 60.0. However, my implementation treated the boundary as inclusive on both sides, so timestamp 59.999 fell into both windows. An off-by-one that only appeared when a timestamp landed within 1 millisecond of a window boundary.

I had written unit tests for timestamps exactly at 0 and exactly at 60, but never at 59.999. The shrinking told me exactly where to look.

The Production Outage That Could Have Been Prevented

T-18 monthsInitial deployment of rate limiter with the off-by-one bug.
T-12 monthsFirst customer report of 'request throttling' under heavy load. Team attributes to network issues.
T-6 monthsSecond report during a trading surge. No one investigates deeply.
T-0Property-based test catches the bug during a routine test refactoring.

Lesson

The bug had been causing sporadic request drops for 18 months, but because it only triggered on specific timing conditions, it was dismissed as transient network issues. A property-based test caught it in under 2 seconds of test execution.

Why Example-Based Tests Miss This

Example-based tests are inherently limited by the imagination of the person writing them. You test what you think of. For edge cases, you'll probably test the obvious ones: empty input, max integer, null values. But you won't test all possible combinations of timestamps near window boundaries.

Property-based tests don't have that limitation. They explore the input space randomly, so they find the corner cases you didn't think to write. The key is writing a good property — one that captures the essence of correctness without being too vague.

Common properties that work well across domains:

arrow_rightIdempotency: applying an operation twice produces the same result as applying it once.
arrow_rightRound-tripping: serializing then deserializing yields the original object.
arrow_rightModel-based: compare against a simpler but slower reference implementation.
arrow_rightInvariant: some condition holds before and after the operation.
arrow_rightMonotonicity: an operation only increases (or decreases) a value.

When Not to Use Property-Based Testing

Property-based testing isn't a silver bullet. It's not great for testing UI interactions or complex workflows with many dependent steps. It also requires clear, testable properties — if you can't articulate what correctness means, you can't write a property test.

But for algorithmic code, data processing pipelines, serialization, and state machines, it's incredibly effective. I now use it for any function that takes structured input and produces structured output.

The rate limiter bug taught me that even simple code can harbor subtle bugs that survive for years. Property-based testing doesn't guarantee bug-free code, but it does guarantee that you've explored more of the input space than any human could manually.

The bug had been causing sporadic request drops for 18 months. A property-based test caught it in under 2 seconds of test execution.

Getting Started with Property-Based Testing

If you've never used property-based testing before, start small. Pick a function you know well and write a simple invariant. For example, if you have a sorting function, the invariant is that the output is sorted and has the same elements as the input. That's it.

Here's a minimal example using Hypothesis in Python:

A simple property test for a sorting function. Two invariants: sorted order and same elements.

from hypothesis import given, strategies as st

def my_sort(arr):
    return sorted(arr)

@given(st.lists(st.integers()))
def test_sorted_invariant(arr):
    result = my_sort(arr)
    # Property 1: result is sorted
    assert all(result[i] <= result[i+1] for i in range(len(result)-1))
    # Property 2: result contains same elements
    assert sorted(result) == sorted(arr)

Run that with pytest, and Hypothesis will generate hundreds of random lists. It will test empty lists, large lists, lists with duplicates, lists with negative numbers — every combination. And if it finds a failure, it shrinks the list to the smallest example that fails.

That last part is the real magic. Shrinking turns a random failure into a minimal, human-readable test case. Without it, property-based testing would be much less useful.

18 months

Time a subtle off-by-one bug survived in production before property-based testing caught it

Final Thoughts

I still write example-based tests for the obvious cases. But I now pair them with property-based tests for the non-obvious ones. The combination is powerful: example tests document the expected behavior, and property tests explore the unexpected.

If you've never tried property-based testing, do it on your next bug fix. Write a property that captures the invariant you think you're preserving. Run it. You might be surprised at what it finds.

And if you're using a rate limiter in production, double-check your window boundaries.

Frequently asked questions

How is property-based testing different from fuzz testing?

Fuzz testing throws random input at a program to see if it crashes. Property-based testing also generates random input, but it checks that the program satisfies specific properties (invariants) for every generated input. The key difference is that property-based tests know what correct behavior looks like, so they can catch logical errors, not just crashes.

What kinds of bugs are property-based tests best at finding?

They excel at finding edge-case bugs that involve complex state, concurrency, or input that falls into unusual combinations. Common categories: off-by-one errors in algorithms, mishandling of empty or null values, violations of invariants under concurrent access, and regressions in serialization/deserialization.

Do I need a special language or framework?

No. Most popular languages have property-based testing libraries: Hypothesis (Python), QuickTheories (Java), Fast-Check (JavaScript), and Hedgehog (Haskell). They all work with your existing test runner.

How do I come up with good properties?

Start with the simplest invariants: idempotency (applying an operation twice gives the same result), round-tripping (serialize then deserialize yields the original), and model-based testing (compare against a simpler but slower reference implementation). As you gain experience, you can write more domain-specific properties.

Property-Based Testing: Finding the Off-by-One That Lived in Production for 18 Months