Skip to main content

Codebase structure

Let's take a look at how the codebase is structured and how to work within it.

Quick tour

Top-level

When you open the repository, the primary files and folders you should keep in mind are the following:

monolith
├── apps/
├── libs/
├── mono.toml
├── pyproject.toml
└── uv.lock
  • apps/ and libs/ contain most of our code. Each folder within is a separate "project". Apps are user-oriented projects: a web application, a server, a complex script... Libraries contain shared code reusable across other projects
  • mono.toml is the configuration file automatically created by the Mono CLI.
  • pyproject.toml defines the monorepository configuration. Each project has their own pyproject.toml as well to define its dependencies.
  • uv.lock locks the exact versions of Python packages to be used across all the projects in everyone's development environments and deployments. This ensures everyone uses the same package versions.

Project structure

Let's dive into a specific project - let's say stoneware (in libs).

monolith
├── apps
├── libs
│ ├── shared
│ └── stoneware
│ ├── src
│ │ └── stoneware
│ │ └── code.py
│ ├── scripts
│ └── pyproject.toml
├── pyproject.toml
└── uv.lock

The core structure for a project folder is:

  1. A pyproject.toml that defines the dependencies of the project
  2. A src/{project_name} folder containing the main code of the project

In this example, you can import from the code.py visible above using from stoneware.code import ...

Note that the project name is defined in pyproject.toml, but generally the same name is used for the top level folder (libs/stoneware) here. The only conversion is that dashes (-) get converted to underscores (_).

Certain projects, notably in apps/, may have a different structure if they are very simple or not using Python.

After that, you can define any other files or folders in there. It's common to have scripts/ and notebooks/ folders for example.

Deep dive

Here's the full structure you will likely have in your file explorer:

monolith
├── .dagster/ <-- Dagster configuration
├── .devcontainer/ <-- Dev containers configuration
├── .git/ <-- Git folder (automatic)
├── .github/ <-- GitHub workflows (= deployment logic)
├── .venv/ <-- Python virtual environment (automatic)
├── apps/ <-- Deliverables (apps, servers...)
├── infra/ <-- Special project for cloud & database infrastructure
├── libs/ <-- Shared code libraries
├── scripts/ <-- Scripts that don't fit neatly within one project
├── .dockerignore <-- Docker configuration
├── .gitattributes <-- Git configuration
├── .gitignore <-- What should be ignored by Git
├── .python-version <-- Exact Python version for uv
├── README.md <-- README for GitHub
├── mono.toml <-- Mono configuration file (see below)
├── pyproject.toml <-- Monorepository configuration
├── uv.lock <-- Exact dependency versions for all Python packages
└── workspace.yaml <-- Dagster configuration

Some additional notes:

  • The top-level pyproject.toml contains configuration for the monorepository, development dependencies that should be installed for all projects, and dependency groups that are used for Mono profiles.
  • Ideally put your scripts in a project, not the top-level scripts/.
  • mono.toml is produced by the Mono CLI.
  • .gitattributes is used to make Git ignore the cell outputs in Jupyter notebooks.

Tips and tricks

  • You can create a data folder in the root of the repository, it will be automatically ignored by Git. This is useful to cache downloaded or large data locally if you're rerunning a script multiple times.

Full view dive

  • .dagster/ contains the local Dagster configuration and is used as the DAGSTER_HOME directory in GitHub Codespaces.
  • .devcontainer/ contains configuration to create a dev container for this repository. Dev containers can be used in GitHub Codespaces or locally.
  • .github/ contains our deployment pipelines (= CI/CD pipelines as they are often called)
  • infra/ is a unique project containing database and cloud infrastructure code.
  • scripts/ is a mix of various scripts kept here for historical reasons. Ideally, scripts should be placed within the relevant project.

database/, datasmart/, and shared/ are our primary code folders. For any file or subfolder containing Python code in there, you can import them from anywhere: from shared.xxx import aaa

More specifically, database/ and datasmart/ are our two Dagster code locations, while shared/ acts as a project containing common code. See the Dagster guide for more information on how code locations are structured.

pyproject.toml and poetry.lock define our Python dependencies. pyproject.toml is the main file and poetry.lock is automatically generated from it, containing exact package versions to ensure everyone is on the same page. See Managing packages for more information.

Deep dive

Let's take a complete look now:

database
├── .dagster/ <- Dagster configuration
├── .devcontainer/ <- Dev ContainerscConfiguration
├── .github/ <- GitHub Actions configuration
├── database/ <- Dagster code location
├── datasmart/ <- Dagster code location
├── db/ <- Database migration system, unused
├── docs/ <- Docs site
├── examples/ <- Quick examples, will be replaced by these docs
├── automation/ <- Automation services for Microsoft Graph sync, scales, and Felt
├── notebooks/ <- Jupyter notebooks
├── scripts/ <- Various scripts
├── shared/ <- Common code
├── tests/ <- Very limited number of tests
├── wheels/ <- Pre-built Python packages
├── .env <- (optional) Environment variables
├── .gitignore <- Git ignore patterns
├── .gitpod.Dockerfile <- Dockerfile for the Gitpod workspace
├── .gitpod.yml <- Gitpod workspace configuration
├── poetry.lock <- Auto-generated
├── pyproject.toml <- Declares Python dependencies
├── README.md <- Good ol' README.md
└── workspace.yaml <- Configures Dagster code locations

Some additional notes about the more confusing ones:

  • automation/: contains automation services for SharePoint sync, scale polling, Microsoft Graph proxying, and Felt
    • Note: Felt is a service we use to build and share maps, this server provides the ability to calculate multimodal routes, across trucking, rail, and barging, from within certain Felt maps.
  • wheels/: some Python packages can't be installed in a regular manner in our environments, or require super heavy dependencies during installation. To avoid bloating our environments, they are pre-packaged through a script and put here.
  • .env: optional file to define environment variables. Mostly used to define a DOPPLER_TOKEN, see Setting up a dev environment.

Managing packages

We use Poetry as our package manager. I highly recommend reading the "Basic usage" section of the Poetry docs in addition to the current documentation.

High-level overview

Here's a quick review of the main concepts:

  • pyproject.toml is where dependencies are defined
  • poetry.lock is a lockfile, which specifies the exact dependencies that should be installed to match the requirements defined in pyproject.toml. It is auto-generated by Poetry when adding packages
  • poetry add <package> adds a package
  • poetry remove <package> removes a package
  • poetry install installs the packages specified in pyproject.toml and poetry.lock into the virtual environment, creating it if necessary.
  • poetry lock [--no-update] refreshes the lockfile. The --no-update tells Poetry to not update installed packages while refreshing the lockfile. This flag is generally recommended.

Dependency groups and extras

If you look at the pyproject.toml, you'll notice a layout similar to this:

[tool.poetry]
...

# Default dependency group, shared across all projects
[tool.poetry.dependencies]
numpy = "^1.25.2"
...

# SFA dependencies, used in `database`
[tool.poetry.group.sfa.dependencies]
pvlib = "^0.10.1"
...

# Exploration model dependencies, used in `database`
[tool.poetry.group.exploration_model.dependencies]
rasterio = "^1.3.8"
...

# Dev dependencies, only used during development
[tool.poetry.group.dev.dependencies]
matplotlib = "3.8.0"
...

# Optional PyTorch dependency
[tool.poetry.group.torch]
optional = true

[tool.poetry.group.torch.dependencies]
torch = "^2.5.1"
...

The Poetry feature used here is called dependency groups.

It allows us to have a single pyproject.toml file and a single virtual environment in development, while being able to deploy our different services with different sets of dependencies. For example, the datasmart code location does not need some of the heavyweight dependencies in the sfa or exploration_model groups, allowing it to deploy faster.

By default:

  • poetry add install dependencies in [tool.poetry.dependencies], the main dependency group shared by all projects.
  • poetry add --group dev installs dependencies in the specified dependency group.

In general, when you install dependencies, especially heavy ones, it's worth thinking in which dependency group they should go into.

You can also move dependencies between dependency groups in pyproject.toml manually and run poetry lock --no-update to update the lockfile. This is useful, for example, if you install a dependency in the main group by default and want to reorganize things afterwards.

Troubleshooting

Common scenario: merge conflict on pyproject.toml and poetry.lock

This can happen if both you and someone else installed dependencies separately before rebasing or merging. Here's how to fix it:

  • Open pyproject.toml and resolve the conflict there by manually updating the dependency requirements to what you want.
  • Delete poetry.lock
  • Run poetry lock --no-update to regenerate the lockfile
  • Run poetry install to make sure you have everything installed

Environment variables

Environment variables are values like database URLs, API tokens, and AWS credentials that vary between development and production. They are generally sensitive, kept secret, and never committed to the repository.

Importing

Apps expose a local env object from their own env.py:

from datasmart.env import env

pg_url = env.PGURL

For notebooks and ad-hoc scripts, shared.env remains available as a compatibility facade:

from shared.env import PGURL, ASSETS_BUCKET

Defining

Shared capabilities declare their own EnvVars class near the code that owns the variables:

from shared.utils.environment_variables import EnvVars, env_var


class DatabaseEnv(EnvVars):
PGURL: str
PROD_PGURL: str = env_var(fallback="PGURL")

Applications compose the capabilities they need into one app-local ProjectEnv:

from shared.db.env import DatabaseEnv
from shared.s3.env import S3Env
from shared.sharepoint.env import SharePointEnv
from shared.utils.environment_variables import ProjectEnv, RuntimeEnv


class DatabaseProjectEnv(ProjectEnv, RuntimeEnv, DatabaseEnv, S3Env, SharePointEnv):
_doppler_project = "database"


env = DatabaseProjectEnv.current()

What's happening:

  • EnvVars classes describe a set of typed variables owned by a module or capability.
  • ProjectEnv.current() loads the app's Doppler project and development fallback sources.
  • Local development falls back through monolith-dev, which aggregates variables needed across the repository.
  • Doppler also has a shared project for cross-deployment credentials such as SharePoint and Microsoft values.
  • Required variables fail early when the env object is created.

See Secrets & environments for the full policy and loading order.