Codebase structure

Let's take a look at how the codebase is structured and how to work within it.

Quick tour

Top-level

When you open the repository, the primary files and folders you should keep in mind are the following:

monolith
├── apps/
├── libs/
├── mono.toml
├── pyproject.toml
└── uv.lock

apps/ and libs/ contain most of our code. Each folder within is a separate "project". Apps are user-oriented projects: a web application, a server, a complex script... Libraries contain shared code reusable across other projects
mono.toml is the configuration file automatically created by the Mono CLI.
pyproject.toml defines the monorepository configuration. Each project has their own pyproject.toml as well to define its dependencies.
uv.lock locks the exact versions of Python packages to be used across all the projects in everyone's development environments and deployments. This ensures everyone uses the same package versions.

Project structure

Let's dive into a specific project - let's say stoneware (in libs).

monolith
├── apps
├── libs
│   ├── shared
│   └── stoneware
│       ├── src
│       │   └── stoneware
│       │       └── code.py
│       ├── scripts
│       └── pyproject.toml
├── pyproject.toml
└── uv.lock

The core structure for a project folder is:

A pyproject.toml that defines the dependencies of the project
A src/{project_name} folder containing the main code of the project

In this example, you can import from the code.py visible above using from stoneware.code import ...

Note that the project name is defined in pyproject.toml, but generally the same name is used for the top level folder (libs/stoneware) here. The only conversion is that dashes (-) get converted to underscores (_).

Certain projects, notably in apps/, may have a different structure if they are very simple or not using Python.

After that, you can define any other files or folders in there. It's common to have scripts/ and notebooks/ folders for example.

Deep dive

Here's the full structure you will likely have in your file explorer:

monolith
├── .dagster/       <-- Dagster configuration
├── .devcontainer/  <-- Dev containers configuration
├── .git/           <-- Git folder (automatic)
├── .github/        <-- GitHub workflows (= deployment logic)
├── .venv/          <-- Python virtual environment (automatic)
├── apps/           <-- Deliverables (apps, servers...)
├── infra/          <-- Special project for cloud & database infrastructure
├── libs/           <-- Shared code libraries
├── scripts/        <-- Scripts that don't fit neatly within one project
├── .dockerignore   <-- Docker configuration
├── .gitattributes  <-- Git configuration
├── .gitignore      <-- What should be ignored by Git
├── .python-version <-- Exact Python version for uv
├── README.md       <-- README for GitHub
├── mono.toml       <-- Mono configuration file (see below)
├── pyproject.toml  <-- Monorepository configuration
├── uv.lock         <-- Exact dependency versions for all Python packages
└── workspace.yaml  <-- Dagster configuration

Some additional notes:

The top-level pyproject.toml contains configuration for the monorepository, development dependencies that should be installed for all projects, and dependency groups that are used for Mono profiles.
Ideally put your scripts in a project, not the top-level scripts/.
mono.toml is produced by the Mono CLI.
.gitattributes is used to make Git ignore the cell outputs in Jupyter notebooks.

Tips and tricks

You can create a data folder in the root of the repository, it will be automatically ignored by Git. This is useful to cache downloaded or large data locally if you're rerunning a script multiple times.

Full view dive

.dagster/ contains the local Dagster configuration and is used as the DAGSTER_HOME directory in GitHub Codespaces.
.devcontainer/ contains configuration to create a dev container for this repository. Dev containers can be used in GitHub Codespaces or locally.
.github/ contains our deployment pipelines (= CI/CD pipelines as they are often called)
infra/ is a unique project containing database and cloud infrastructure code.
scripts/ is a mix of various scripts kept here for historical reasons. Ideally, scripts should be placed within the relevant project.

database/, datasmart/, and shared/ are our primary code folders. For any file or subfolder containing Python code in there, you can import them from anywhere: from shared.xxx import aaa

More specifically, database/ and datasmart/ are our two Dagster code locations, while shared/ acts as a project containing common code. See the Dagster guide for more information on how code locations are structured.

pyproject.toml and poetry.lock define our Python dependencies. pyproject.toml is the main file and poetry.lock is automatically generated from it, containing exact package versions to ensure everyone is on the same page. See Managing packages for more information.

Deep dive

Let's take a complete look now:

database
├── .dagster/           <- Dagster configuration
├── .devcontainer/      <- Dev ContainerscConfiguration
├── .github/            <- GitHub Actions configuration
├── database/           <- Dagster code location
├── datasmart/          <- Dagster code location
├── db/                 <- Database migration system, unused
├── docs/               <- Docs site
├── examples/           <- Quick examples, will be replaced by these docs
├── automation-server/  <- Automation endpoints for Felt and SharePoint
├── notebooks/          <- Jupyter notebooks
├── scripts/            <- Various scripts
├── shared/             <- Common code
├── tests/              <- Very limited number of tests
├── wheels/             <- Pre-built Python packages
├── .env                <- (optional) Environment variables
├── .gitignore          <- Git ignore patterns
├── .gitpod.Dockerfile  <- Dockerfile for the Gitpod workspace
├── .gitpod.yml         <- Gitpod workspace configuration
├── poetry.lock         <- Auto-generated
├── pyproject.toml      <- Declares Python dependencies
├── README.md           <- Good ol' README.md
└── workspace.yaml      <- Configures Dagster code locations

Some additional notes about the more confusing ones:

automation-server/: contains automation endpoints for SharePoint and Felt
- Note: Felt is a service we use to build and share maps, this server provides the ability to calculate multimodal routes, across trucking, rail, and barging, from within certain Felt maps.
wheels/: some Python packages can't be installed in a regular manner in our environments, or require super heavy dependencies during installation. To avoid bloating our environments, they are pre-packaged through a script and put here.
.env: optional file to define environment variables. Mostly used to define a DOPPLER_TOKEN, see Setting up a dev environment.

Managing packages

We use Poetry as our package manager. I highly recommend reading the "Basic usage" section of the Poetry docs in addition to the current documentation.

High-level overview

Here's a quick review of the main concepts:

pyproject.toml is where dependencies are defined
poetry.lock is a lockfile, which specifies the exact dependencies that should be installed to match the requirements defined in pyproject.toml. It is auto-generated by Poetry when adding packages
poetry add <package> adds a package
poetry remove <package> removes a package
poetry install installs the packages specified in pyproject.toml and poetry.lock into the virtual environment, creating it if necessary.
poetry lock [--no-update] refreshes the lockfile. The --no-update tells Poetry to not update installed packages while refreshing the lockfile. This flag is generally recommended.

Dependency groups and extras

If you look at the pyproject.toml, you'll notice a layout similar to this:

[tool.poetry]
...

# Default dependency group, shared across all projects
[tool.poetry.dependencies]
numpy = "^1.25.2"
...

# SFA dependencies, used in `database`
[tool.poetry.group.sfa.dependencies]
pvlib = "^0.10.1"
...

# Exploration model dependencies, used in `database`
[tool.poetry.group.exploration_model.dependencies]
rasterio = "^1.3.8"
...

# Dev dependencies, only used during development
[tool.poetry.group.dev.dependencies]
matplotlib = "3.8.0"
...

# Optional PyTorch dependency
[tool.poetry.group.torch]
optional = true

[tool.poetry.group.torch.dependencies]
torch = "^2.5.1"
...

The Poetry feature used here is called dependency groups.

It allows us to have a single pyproject.toml file and a single virtual environment in development, while being able to deploy our different services with different sets of dependencies. For example, the datasmart code location does not need some of the heavyweight dependencies in the sfa or exploration_model groups, allowing it to deploy faster.

By default:

poetry add install dependencies in [tool.poetry.dependencies], the main dependency group shared by all projects.
poetry add --group dev installs dependencies in the specified dependency group.

In general, when you install dependencies, especially heavy ones, it's worth thinking in which dependency group they should go into.

You can also move dependencies between dependency groups in pyproject.toml manually and run poetry lock --no-update to update the lockfile. This is useful, for example, if you install a dependency in the main group by default and want to reorganize things afterwards.

Troubleshooting

Common scenario: merge conflict on `pyproject.toml` and `poetry.lock`

This can happen if both you and someone else installed dependencies separately before rebasing or merging. Here's how to fix it:

Open pyproject.toml and resolve the conflict there by manually updating the dependency requirements to what you want.
Delete poetry.lock
Run poetry lock --no-update to regenerate the lockfile
Run poetry install to make sure you have everything installed

Environment variables

We care more about environment variables than most people. Both when developing and deploying code, it's critical to clearly define which ones need to exist and make sure they are available.

For this reason, we have a dedicated file shared/env.py that defines all our environment variables.

As a reminder, environment variables are values like the database URL, or AWS credentials, that vary based on the environment: development, production... They are generally sensitive, kept secret, and never written down explicitly in the codebase, whose history is kept forever.

Importing

Accessing environment variables in your code is super simple:

from shared.env import PGURL, ASSETS_BUCKET

If this completes without error, you are guaranteed to have access to all necessary environment variables.

Defining

In env.py, environment variables are defined like this :

import os
from shared.utils.env_utils import load_env

env = load_env()

IS_PROD = (
    os.getenv("DAGSTER_CLOUD_DEPLOYMENT_NAME") is not None
    or os.getenv("RAILWAY_PROJECT_ID") is not None
)
IS_DEV = not IS_PROD

# PostgreSQL
PGURL = env.require("PGURL")
# Allow the dev environment to read from the database (this is a read-only connection string)
if IS_DEV:
    PROD_PGURL = env.require("PROD_PGURL")
else:
    PROD_PGURL = PGURL

# AWS
AWS_ACCESS_KEY_ID = env.require("AWS_ACCESS_KEY_ID", set_in_process=True)
AWS_SECRET_ACCESS_KEY = env.require("AWS_SECRET_ACCESS_KEY", set_in_process=True)

# more environment variables...

What's happening here:

load_env loads environment variables and returns an EnvironmentVariables object
Environment variables are loaded from the following locations, in order:
1. A .env file at the root of the repository
2. The Doppler CLI if it's available and authenticated (either through doppler login or a DOPPLER_TOKEN in the .env file)
3. AWS Secrets Manager for the production environment
Required environment variables are declared using env.require()
- Passing in set_in_process=True means the variable will be set within the process environment, so that anything else that runs in the same process as the current program will have access to it. This can be useful if external programs or Python libraries need to access a given environment variable. However, this adds security risk and should be used only when necessary.

Adding new environment variables is done by defining them in Doppler and adding an env.require() call in shared/env.py. If you do not have access to Doppler, reach out to Erwin.

Quick tour​

Top-level​

Project structure​

Deep dive​

Tips and tricks​

Full view dive​

Deep dive​

Managing packages​

High-level overview​

Dependency groups and extras​

Troubleshooting​

Common scenario: merge conflict on pyproject.toml and poetry.lock​

Environment variables​

Importing​

Defining​

Quick tour

Top-level

Project structure

Deep dive

Tips and tricks

Full view dive

Deep dive

Managing packages

High-level overview

Dependency groups and extras

Troubleshooting

Common scenario: merge conflict on `pyproject.toml` and `poetry.lock`

Environment variables

Importing

Defining