Unit-testing SQLAlchemy code against with Alembic migrations, and doing it fast

Tue 19 May 202620 min read

Forewords

Testing code that talks to a real database is one of those topics where everybody has an opinion and almost nobody is happy with the result. The two extremes are well-known: at one end, the test suite mocks out the ORM entirely and ends up validating that the mocks behave like the mocks; at the other end, every test spins up a full PostgreSQL container, applies all the migrations, and the suite takes thirty minutes to run on CI.

The middle ground is, in my experience, the only one that pays off in the long run: tests run against a real SQLAlchemy session, against a real schema produced by the project's own Alembic migrations, but the database itself is fast, ephemeral, and cleanly reset between tests. This article describes a pattern that does exactly that. The pattern itself is dead simple once you have seen it; the article is mostly about why the obvious approaches do not work and what you need to know to make this one work.

The code samples are written in a generic, illustrative style — the goal is to make the pattern reusable on any SQLAlchemy + Alembic project rather than to document a specific codebase. The examples assume Python 3.10+, SQLAlchemy 2.x (the future=True flag) and Alembic in its standard configuration.

What we want from our tests

Before writing any test infrastructure, it is worth being explicit about the properties we want.

Property 1 — schema parity. The schema used in the tests must be the same schema that production will eventually run. The only way to guarantee that is to apply the project's own Alembic migrations. If we let SQLAlchemy create the tables from the ORM metadata (the famous Base.metadata.create_all), we will miss everything the ORM cannot express: server-side defaults, check constraints, indexes defined directly in migrations, partial uniqueness, computed columns, data migrations, etc. We also silently miss the order in which migrations apply, which is exactly the thing we want the suite to exercise. Finally, we want to test for a new migration step M(N+1) while we are developing.

while you are developing your branch and assuming you have a schema migration, you need to know if the migration M(N+1) plays well with the previous ones M1..N ones. You will certainly overwrite M(N+1) while developing, or squash all of them before merge, or simply start with some mock data at any step \(N \in [[1..N]]\) to remove any risk coming from the migration at deployment time.

Property 2 — test isolation. Each test must see an empty database or a database that is in a known, controlled state. A test that leaks rows into the next one is a test that will eventually fail intermittently for reasons nobody can reproduce, making de facto the tests less trustworthy (a problem that occurs way too often in dev teams).

Property 3 — speed. The whole point of a unit test suite is that you can run it dozens of times a day. Re-running thirty Alembic migrations before each test will not survive contact with reality. Besides that, every CI second costs money and is limited on platforms such as GitHub, Bitbucket or Gitlab. The faster the better (of course within the constraints of correctness), for the devs and the pocket.

Property 4 — production-faithful API surface. The tests should use the same machinery for manipulating the ORM as the production code uses. In particular, we do not want a separate "testing-only" data access code: if the test calls MyService.do_thing(...) and the service grabs its session from a SQLEngine.session() helper, then the test must influence what that helper returns.

These four properties pull in different directions, which is why the SQLAlchemy documentation devotes a whole section (as well as helpers/support code) to the topic and why most projects end up with their own homegrown infrastructure.

Property 1 was initially the main motivation for the design in this article, but we can have all the rest for almost free. Migrations are a difficult beast once we consider the deployment to production: they often require an exclusive lock on the DB making it at best read-only, they cannot be too long to run, they should be as safe as possible from disaster and be able to roll-back, etc. This is why any important migration is done is several steps, both in the schema and the code: in order to reach the target schema, one has to migrate through several intermediate schemas together with several intermediate code that go along. Since migrations and code are tightly coupled, the testing of this coupled setup becomes important.

The three approaches that almost work

Approach A — `Base.metadata.create_all` per test

This is what every SQLAlchemy tutorial shows you. You build an in-memory SQLite engine and ask SQLAlchemy to create every table the ORM knows about.

from sqlalchemy import create_engine
from myapp.db.common import Base

engine = create_engine("sqlite:///:memory:", future=True)
Base.metadata.create_all(engine)

This is not satisfactory: it works ok for a five-minute demo but it violates all but property 3. Most importantly, you are not applying any of the migrations: the schema you get is whatever the ORM can derive from your model classes at the time the test executes, not what Alembic produced after several months of migrations. Anything that lives only in migrations (data backfills, check constraints written by hand, columns added by raw SQL) silently disappears from the test schema.

Approach B — the SQLAlchemy "joining a session into an external transaction" recipe

The official SQLAlchemy documentation propose a recipe based on nested transactions: you open a connection, start a transaction on it, bind the session to that connection, install after_transaction_end listeners to restart savepoints, and roll back the outer transaction at the end of the test (eg. in the tearDown). The advertised benefit is that you apply your migrations only once and the rollback gives every test a clean state, without ever truncating tables.

In theory this is elegant. In practice, it is one of those things that mostly works until it suddenly does not. The behaviour depends on various configuration variables such as the exact session flags (autocommit, autoflush, expire_on_commit), on whether the code under test calls session.commit() (real production code does), on whether nested savepoints are emulated correctly by the dialect, and on the precise SQLAlchemy version.

Most importantly: this design is not robust to whatever you do within the test. For instance, you may access the existing engine object in part of the code and establish transactions there, and this would ultimately propagate up to the setup/tearDown and to the next test down the line. In order to have full test isolation with this scheme, you then need to prevent the code from using various constructs that implies direct engine or connection accesses: basically one should work only with Session (and hope for the best): this discards for instance the migrations because they are somehow tinkering with the engine.

Approach C — spin a "real" DB for every test

Because of the variability of SQLAlchemy with respect to the engine backend, one may want to spin a real DB (for instance in a containerized manner) for the sake of the tests. The DB will obviously be empty when started, and migrations have to be applied to it to make it usable with the current version of the code.

This is of course the closest solution to a production environment, but it is:

dead slow: you should spin a DB and tear it down as an external process with which you need to synchronize (wait for it to be ready etc),
the problem of running the migrations has not been addressed,
the problem of leakage of one test to another has been solved only partially: do you spin a new DB at every test? or you amortize this cost by using the same DB for several tests? The former will bring perfect isolation but of course very slow with a total overhead as a function of the number of tests, the latter will enable potential leaks,
you would need to synchronize the setup/teardown of the DB in your test and/or your CI, which will come with its own set of problems as well.

To summarize: this approach is an important one for checking complex scenarios as a pre-deployment and integration sanity check, with a nice property of being playable on the real data - from eg. a sandbox instance - and is perfect for checking the migrations, but hardly usable for simple, unitary operations of the code.

The pattern: migrate once on disk, snapshot per test in memory

The idea is to introduce a two-level cache.

At high level (for instance the class level through setUpClass or even higher: module, global etc), we create a temporary SQLite file, point our application's engine at it, and run the full Alembic migrations. After that, the file on disk is the empty-but-migrated schema we want every test to see. We never touch this file again from the application — it is purely a snapshot.

At the lowest level, which is the test level (setUp), we create a brand-new in-memory SQLite database for the test, copy the contents of the snapshot file into it using SQLite's built-in .backup() API, and rebind our engine to that fresh in-memory database. SQLite's .backup() copies the database byte-for-byte at the storage level, which is essentially free for an empty schema. The test then runs against an in-memory database that already contains every table, index, constraint and seed row Alembic produced, but no test-specific data.

In tearDown, the in-memory database is simply forgotten (no garbage to clean up — that is the nice side of an in-memory engine), and the engine is reset.

The teardown at the highest level (eg. tearDownClass) just deletes the temporary snapshot file.

This gives us:

schema parity, because the snapshot was produced by the real Alembic migrations,
test isolation because each test gets a fresh database and there is absolutely no possible leakage from the test to the setup/teardown or next test,
speed, because migrations time is amortized by moving them at higher level (for instance run once per test class) and not once per test, and the DB actually used for testing is in memory,
production-faithful API, because we are just swapping the singleton engine in place; the code under test calls Session() exactly as it does in production. Plus, there is nothing limits the code one can write within the test, because everything from the engine, transactions, sessions etc are accessible as in real production code.

Of course, one limitation is that it uses the SQLite dialect and we'll come back to this at the end of the article.

The whole thing is roughly fifty lines of code.

A minimal SQLAlchemy engine singleton

To set the scene, here is the kind of engine accessor I assume in the rest of the article. The important property is that the engine and session factory are reachable through a single, replaceable entry point.

# myapp/db/common.py
import os
from sqlalchemy import create_engine, MetaData
from sqlalchemy.orm import declarative_base, sessionmaker

meta = MetaData(naming_convention={
    "ix": "ix_%(column_0_label)s",
    "uq": "uq_%(table_name)s_%(column_0_name)s",
    "fk": "fk_%(table_name)s_%(column_0_name)s_%(referred_table_name)s",
    "pk": "pk_%(table_name)s",
})

Base = declarative_base(metadata=meta)


class SQLEngine:
    engine = None
    session_factory = None

    @staticmethod
    def sql_engine():
        if SQLEngine.engine is None:
            url = os.environ["APP_DB_URL"]
            SQLEngine.engine = create_engine(url, future=True)
        return SQLEngine.engine

    @staticmethod
    def session():
        if SQLEngine.session_factory is None:
            SQLEngine.session_factory = sessionmaker(future=True)
        return SQLEngine.session_factory(bind=SQLEngine.sql_engine())

There is nothing surprising here. In production, APP_DB_URL points at PostgreSQL or MySQL or any backend; in tests, we will replace SQLEngine.engine with a SQLite engine and let everything else flow through the same accessors. The only requirement on the production code is that it does not cache the engine in random modules — it always goes through SQLEngine.sql_engine() or SQLEngine.session(). If your codebase already enforces that, the pattern below is a drop-in. If it does not, the very first refactor to do is to introduce that indirection, because it pays off in many places beyond tests.

Driving Alembic from Python

The other building block is calling Alembic programmatically. The migrations are normally driven from the command line via alembic upgrade head, but Alembic exposes the same operations as a Python API. We need two pieces: the path to the alembic.ini configuration file, and the path to the migrations script directory.

# tests/alembic_helpers.py
from pathlib import Path
from alembic import command, config

def _alembic_root() -> Path:
    # adjust to where alembic.ini lives in your project
    return Path(__file__).parent.parent

def upgrade_to_head(engine):
    # this is the correct folder by definition of _alembic_root
    cfg = config.Config(str(_alembic_root() / "alembic.ini"))

    # this may require adjustment depending on your setup
    cfg.set_main_option("script_location",
                        str(_alembic_root() / "migrations"))

    with engine.begin() as connection:
        cfg.attributes["connection"] = connection
        command.upgrade(cfg, "head")

Two details worth pointing out. First, the migrations are applied inside an explicit transaction opened by engine.begin(). We pass the connection to Alembic via cfg.attributes["connection"], which is the documented hook for sharing a connection with migrations/env.py. Inside the project's migrations/env.py, the standard boilerplate looks for that attribute and uses it instead of opening a new connection. If you are reusing an existing project, that boilerplate is almost certainly already in place — Alembic's init template adds it by default.

Second, we never spawn a subprocess for running the migrations. Running Alembic in-process means we share the working directory, the Python path, the logging configuration, and the model metadata. It also means we can capture exceptions cleanly when a migration fails.

The `DatabaseTest` base class

We are now ready to put it together: the approach we took below is to amortize the migration cost per test class, but feel free to move this amortization higher up (module or global).

The class below extends unittest.TestCase and implements the two-level cache.

# tests/helpers.py
# needs also an empty __init__.py in the same folder
import random
import sqlite3
import string
import tempfile
import unittest
from contextlib import ExitStack
from sqlalchemy import create_engine

# our singleton
from myapp.db.common import SQLEngine

# our test helper
from .alembic_helpers import upgrade_to_head


class DatabaseTest(unittest.TestCase):
    """Base class for tests that need a real migrated database.

    Every test sees an empty database whose schema is exactly the one
    produced by running ``alembic upgrade head`` against an empty file.
    """

    @classmethod
    def setUpClass(cls) -> None:
        # remember the previous engine so we can restore it after the suite
        cls._previous_engine = SQLEngine.engine

        # the snapshot lives in a real temporary file so SQLite can ``.backup()``
        # from it later; tempfile gives us automatic cleanup linked to the ExitStack
        cls._snapshot_file = tempfile.NamedTemporaryFile(suffix=".db")
        cls._exit_stack = ExitStack()
        cls._exit_stack.enter_context(cls._snapshot_file)

        # build the snapshot engine, apply all the migrations once
        snapshot_engine = create_engine(
            "sqlite:///" + cls._snapshot_file.name, future=True
        )

        # here used for running the migration, will be replaced before the test
        SQLEngine.engine = snapshot_engine
        upgrade_to_head(snapshot_engine)

    @classmethod
    def tearDownClass(cls) -> None:
        SQLEngine.engine = cls._previous_engine
        cls._exit_stack.close()

    def setUp(self) -> None:
        # use a randomly-named SQLite in-memory database with the URI form
        # and a shared cache, so that two independent ``sqlite3.connect``
        # calls can see the same in-memory storage
        shared_name = "test_db_" + "".join(random.sample(string.ascii_letters, 8))
        memory_url = (
            f"file:{shared_name}?mode=memory&cache=shared&uri=true"
        )

        # rebind the application's engine to the fresh memory database
        SQLEngine.engine = create_engine("sqlite:///" + memory_url, future=True)
        self._held_connection = SQLEngine.engine.connect()  # keep the cache alive

        # copy the migrated snapshot into the memory database, byte-for-byte
        source = sqlite3.connect(self._snapshot_file.name)
        dest = sqlite3.connect(
            f"file:{shared_name}?mode=memory&cache=shared", uri=True
        )
        source.backup(dest)
        source.close()
        dest.close()

    def tearDown(self) -> None:
        self._held_connection.close()

That is the whole pattern. Let me walk through the parts that are not obvious.

Why a file on disk for the snapshot?

Simple answer: for debugging purposes. I believe at the time I was developing this, I wanted to know what is the state of the clean DB and inspect the migrations were doing the job. For this: put a breakpoint after the migration, start your preferred SQLite inspection tool, and point it to the file. Done: you know what is going on with the DB.

Since the DB is empty, usually the DB file itself remains very small and the overhead for having it on disk is unnoticeable (OS page cache and so on). If this is a problem for your setup, just try moving this as an in memory DB.

With every temporary file, make sure you clean your mess up (always do this, this is about hygiene): tempfile.NamedTemporaryFile together with ExitStack lets us use the NamedTemporaryFile without the usual with construct and permits the delaying the cleanup to the tearDownClass part (plus it makes it easy to extend with extra resources later).

Why the URI-based shared cache for the per-test database?

The reason is that two sqlite3.connect(":memory:") calls in the same process produce two independent databases — they are not aliases of the same storage. We need two connections that share storage: one is the application's engine connection (created by SQLAlchemy), the other is the raw sqlite3 connection we hand to source.backup(dest). The file:<name>?mode=memory&cache=shared&uri=true form is the documented way to obtain that. The name is randomized per test to avoid cross-test interference if two test methods accidentally overlap (in particular when using parallel test runners).

Why keep a connection open in `_held_connection`?

A shared in-memory SQLite database persists only as long as at least one connection to it is open. If the last connection closes between source.backup(dest) and the first query of the test, the database disappears, and holding a dedicated connection for the duration of the test is the simplest way to keep the storage alive. The connection is released in tearDown, which also lets the database be reclaimed whenever the garbage collection kicks in.

Why swap `SQLEngine.engine` rather than passing the engine to the test?

As said before, the code under test reaches for its engine via the singleton, exactly as it does in production. The test does not call any application function with a magical extra engine argument. This is a strong property: the test exercises the production code path, including the part that acquires a session. The price to pay is global mutable state, which is why we carefully save the previous value in setUpClass and restore it in tearDownClass.

Using the base class

A test inheriting from DatabaseTest reads exactly like production code. There is no engine argument to thread through, no special session factory, no manual create_all. The session is opened with the same helper the application uses, and commit() actually commits.

# tests/test_example.py
import datetime
import uuid

from sqlalchemy import select, func

from myapp.db.common import SQLEngine
from myapp.db.models import Organization, Order

from .helpers import DatabaseTest


class TestOrderQueries(DatabaseTest):

    def setUp(self):
        super().setUp()

        with SQLEngine.session() as session:
            self.orgs = [
                Organization(
                    name=f"test_org_{i}",
                    uuid=str(uuid.uuid4()).replace("-", ""),
                )
                for i in range(3)
            ]
            session.add_all(self.orgs)
            session.commit()

    def test_orders_are_counted_per_organization(self):
        org = self.orgs[0]

        with SQLEngine.session() as session:
            # objects created in another session are "detached"; merge re-attaches them
            org = session.merge(org)

            session.add_all([
                Order(
                    organization=org,
                    product_id="p_1",
                    location="loc",
                    placed_time=datetime.datetime.utcnow(),
                    placed_time_offset="Z",
                    demand=1.0,
                )
                for _ in range(5)
            ])
            session.commit()

        with SQLEngine.session() as session:
            org = session.merge(org)
            count = session.execute(
                select(func.count(Order.uuid)).where(Order.organization == org)
            ).scalar()

        self.assertEqual(count, 5)

A few notes on the test itself, because they are part of the cost of this approach and worth being aware of:

Detached objects. Each with SQLEngine.session() as session: block opens a fresh session. Once the block exits, every ORM object created inside it is detached — it is not tracked by any session anymore. To use it in another session, you call session.merge(obj). This is not a quirk of the testing pattern, it is how SQLAlchemy works in general; it just becomes very visible in tests that span several sessions in sequence.

This allows to prepare a state of the DB in the setUp, and transfer that state to the underlying tests.

commit() really commits. Because we use a real, separate database per test, the test can freely commit. There is no outer transaction to roll back, no savepoint to restart. If a piece of production code commits halfway through, the test sees exactly what production would see.
Inspecting the database from the outside. During development it is sometimes useful to look at the actual rows the test produced. The pattern makes this easy: replace the in-memory database with a temporary file (just change the URL in setUp), run the failing test once, and open the file with sqlite3 or a GUI tool. The migrated schema is already there, exactly as production has it.

Caveats and known limitations

This is not a silver bullet, and pretending otherwise would be misleading.

SQLite is not your production database. The most important limitation is also the most obvious: the tests run against SQLite, but the production system runs against PostgreSQL, MySQL or similar. SQLite is permissive — it does not enforce some constraints the same way, its ALTER TABLE story is different, and its concurrency model is unique. Anything that depends on database-specific behaviour (JSON operators, partial indexes with predicates, advisory locks, listen/notify, real-write isolation levels) cannot be exercised here. Any extensions to the DB (GIS for instance) might be tricky to get into this approach. The pattern works for the bulk of your tests — the ORM calls, the query construction, the application-level transactions — but a small set of integration tests against the real DBMS is still necessary. Treating them as two separate test suites in CI is a healthy split: the SQLite suite stays fast and runs on every commit, the real-DBMS suite runs on merges to main.

Migrations must be runnable on SQLite. If a migration uses a Postgres-only construct (e.g. USING clauses in ALTER COLUMN, custom types, CREATE EXTENSION), the snapshot build will fail. The pragmatic workaround is to guard those operations on the dialect inside the migration, which is what every project doing dual-target work ends up doing anyway. The fact that the test suite fails loudly when you forget is a feature, not a bug.

Test parallelism. If you run your test bed with any parallel runner, you should make sure that each parallel instance gets it own engine singleton. The rest is already handled as the file returned by NamedTemporaryFile is unique and thread safe (usually, contract delegated to the OS calls), and the random name for the in-memory DB has little probability of collision with existing ones, as soon as you use a different seed per thread/instance. In our case, the snapshot file is opened once per test class, so it is fine as long as each class lives on a single worker.

Class-level state. The pattern caches the snapshot per class. If you mutate SQLEngine.engine or other singletons inside a test in ways that survive tearDown, you will leak state into the next test. The base class restores what it set, but it cannot know about every side effect your production code introduces. The usual rule applies: keep global state to a minimum, and when you have to mutate it, restore it in a finally clause.

Closing thoughts

The pattern is not clever. It applies migrations on a temporary file once, copies that file into a fresh in-memory database for each test, and lets the application's normal engine accessor do its job. It may require changes in your code (singleton, migration, etc) but in a very small amount. The cleverness is in what it does not do: it does not try to be smart with nested transactions, it does not introduce a parallel "testing schema", and it does not ask the test author to remember a special session-acquisition idiom.

The result, after a few thousand tests, is a suite that runs in seconds, exercises the project's real migrations, and almost never produces a "works on my machine" surprise. The test code itself reads like production code with a few extra session.merge calls. When something fails, it fails for the reason you would expect: a constraint violation, a missing index, a query that returns the wrong rows. The infrastructure stays out of the way, which is, in the end, the only thing one wants from test infrastructure.

If you take only one thing from this article, take this: when in doubt about how to isolate database state across tests, the simplest correct answer is to give each test its own database. Modern SQLite makes that essentially free as described in the article. Everything else — savepoints, nested transactions, fixture factories — is an optimization on top, and one you very rarely need.

Send me an email if you find this article useful!

yayi C++, python, image processing, hacking, etc