Codebase structure
Let's take a look at how the codebase is structured and how to work within it.
Quick tour
Top-level
When you open the repository, the primary files and folders you should keep in mind are the following:
monolith
├── apps/
├── libs/
├── mono.toml
├── pyproject.toml
└── uv.lock
apps/
andlibs/
contain most of our code. Each folder within is a separate "project". Apps are user-oriented projects: a web application, a server, a complex script... Libraries contain shared code reusable across other projectsmono.toml
is the configuration file automatically created by the Mono CLI.pyproject.toml
defines the monorepository configuration. Each project has their ownpyproject.toml
as well to define its dependencies.uv.lock
locks the exact versions of Python packages to be used across all the projects in everyone's development environments and deployments. This ensures everyone uses the same package versions.
Project structure
Let's dive into a specific project - let's say stoneware
(in libs
).
monolith
├── apps
├── libs
│ ├── shared
│ └── stoneware
│ ├── src
│ │ └── stoneware
│ │ └── code.py
│ ├── scripts
│ └── pyproject.toml
├── pyproject.toml
└── uv.lock
The core structure for a project folder is:
- A
pyproject.toml
that defines the dependencies of the project - A
src/{project_name}
folder containing the main code of the project
In this example, you can import from the code.py
visible above using from stoneware.code import ...
Note that the project name is defined in pyproject.toml
, but generally the same name is used for the top level folder (libs/stoneware
) here. The only conversion is that dashes (-
) get converted to underscores (_
).
Certain projects, notably in apps/
, may have a different structure if they are very simple or not using Python.
After that, you can define any other files or folders in there. It's common to have scripts/
and notebooks/
folders for example.
Deep dive
Here's the full structure you will likely have in your file explorer:
monolith
├── .dagster/ <-- Dagster configuration
├── .devcontainer/ <-- Dev containers configuration
├── .git/ <-- Git folder (automatic)
├── .github/ <-- GitHub workflows (= deployment logic)
├── .venv/ <-- Python virtual environment (automatic)
├── apps/ <-- Deliverables (apps, servers...)
├── infra/ <-- Special project for cloud & database infrastructure
├── libs/ <-- Shared code libraries
├── scripts/ <-- Scripts that don't fit neatly within one project
├── .dockerignore <-- Docker configuration
├── .gitattributes <-- Git configuration
├── .gitignore <-- What should be ignored by Git
├── .python-version <-- Exact Python version for uv
├── README.md <-- README for GitHub
├── mono.toml <-- Mono configuration file (see below)
├── pyproject.toml <-- Monorepository configuration
├── uv.lock <-- Exact dependency versions for all Python packages
└── workspace.yaml <-- Dagster configuration
Some additional notes:
- The top-level
pyproject.toml
contains configuration for the monorepository, development dependencies that should be installed for all projects, and dependency groups that are used for Mono profiles. - Ideally put your scripts in a project, not the top-level
scripts/
. mono.toml
is produced by the Mono CLI..gitattributes
is used to make Git ignore the cell outputs in Jupyter notebooks.
Tips and tricks
- You can create a
data
folder in the root of the repository, it will be automatically ignored by Git. This is useful to cache downloaded or large data locally if you're rerunning a script multiple times.
Full view dive
.dagster/
contains the local Dagster configuration and is used as theDAGSTER_HOME
directory in GitHub Codespaces..devcontainer/
contains configuration to create a dev container for this repository. Dev containers can be used in GitHub Codespaces or locally..github/
contains our deployment pipelines (= CI/CD pipelines as they are often called)infra/
is a unique project containing database and cloud infrastructure code.scripts/
is a mix of various scripts kept here for historical reasons. Ideally, scripts should be placed within the relevant project.
database/
, datasmart/
, and shared/
are our primary code folders. For any file or subfolder containing Python code in there, you can import them from anywhere: from shared.xxx import aaa
More specifically, database/
and datasmart/
are our two Dagster code locations, while shared/
acts as a project containing common code. See the Dagster guide for more information on how code locations are structured.
pyproject.toml
and poetry.lock
define our Python dependencies. pyproject.toml
is the main file and poetry.lock
is automatically generated from it, containing exact package versions to ensure everyone is on the same page. See Managing packages for more information.
Deep dive
Let's take a complete look now:
database
├── .dagster/ <- Dagster configuration
├── .devcontainer/ <- Dev ContainerscConfiguration
├── .github/ <- GitHub Actions configuration
├── database/ <- Dagster code location
├── datasmart/ <- Dagster code location
├── db/ <- Database migration system, unused
├── docs/ <- Docs site
├── examples/ <- Quick examples, will be replaced by these docs
├── automation-server/ <- Automation endpoints for Felt and SharePoint
├── notebooks/ <- Jupyter notebooks
├── scripts/ <- Various scripts
├── shared/ <- Common code
├── tests/ <- Very limited number of tests
├── wheels/ <- Pre-built Python packages
├── .env <- (optional) Environment variables
├── .gitignore <- Git ignore patterns
├── .gitpod.Dockerfile <- Dockerfile for the Gitpod workspace
├── .gitpod.yml <- Gitpod workspace configuration
├── poetry.lock <- Auto-generated
├── pyproject.toml <- Declares Python dependencies
├── README.md <- Good ol' README.md
└── workspace.yaml <- Configures Dagster code locations
Some additional notes about the more confusing ones:
automation-server/
: contains automation endpoints for SharePoint and Felt- Note: Felt is a service we use to build and share maps, this server provides the ability to calculate multimodal routes, across trucking, rail, and barging, from within certain Felt maps.
wheels/
: some Python packages can't be installed in a regular manner in our environments, or require super heavy dependencies during installation. To avoid bloating our environments, they are pre-packaged through a script and put here..env
: optional file to define environment variables. Mostly used to define aDOPPLER_TOKEN
, see Setting up a dev environment.
Managing packages
We use Poetry as our package manager. I highly recommend reading the "Basic usage" section of the Poetry docs in addition to the current documentation.
High-level overview
Here's a quick review of the main concepts:
pyproject.toml
is where dependencies are definedpoetry.lock
is a lockfile, which specifies the exact dependencies that should be installed to match the requirements defined inpyproject.toml
. It is auto-generated by Poetry when adding packagespoetry add <package>
adds a packagepoetry remove <package>
removes a packagepoetry install
installs the packages specified inpyproject.toml
andpoetry.lock
into the virtual environment, creating it if necessary.poetry lock [--no-update]
refreshes the lockfile. The--no-update
tells Poetry to not update installed packages while refreshing the lockfile. This flag is generally recommended.
Dependency groups and extras
If you look at the pyproject.toml
, you'll notice a layout similar to this:
[tool.poetry]
...
# Default dependency group, shared across all projects
[tool.poetry.dependencies]
numpy = "^1.25.2"
...
# SFA dependencies, used in `database`
[tool.poetry.group.sfa.dependencies]
pvlib = "^0.10.1"
...
# Exploration model dependencies, used in `database`
[tool.poetry.group.exploration_model.dependencies]
rasterio = "^1.3.8"
...
# Dev dependencies, only used during development
[tool.poetry.group.dev.dependencies]
matplotlib = "3.8.0"
...
# Optional PyTorch dependency
[tool.poetry.group.torch]
optional = true
[tool.poetry.group.torch.dependencies]
torch = "^2.5.1"
...
The Poetry feature used here is called dependency groups.
It allows us to have a single pyproject.toml
file and a single virtual environment in development, while being able to deploy our different services with different sets of dependencies. For example, the datasmart
code location does not need some of the heavyweight dependencies in the sfa
or exploration_model
groups, allowing it to deploy faster.
By default:
poetry add
install dependencies in[tool.poetry.dependencies]
, the main dependency group shared by all projects.poetry add --group dev
installs dependencies in the specified dependency group.
In general, when you install dependencies, especially heavy ones, it's worth thinking in which dependency group they should go into.
You can also move dependencies between dependency groups in pyproject.toml
manually and run poetry lock --no-update
to update the lockfile. This is useful, for example, if you install a dependency in the main group by default and want to reorganize things afterwards.
Troubleshooting
Common scenario: merge conflict on pyproject.toml
and poetry.lock
This can happen if both you and someone else installed dependencies separately before rebasing or merging. Here's how to fix it:
- Open
pyproject.toml
and resolve the conflict there by manually updating the dependency requirements to what you want. - Delete
poetry.lock
- Run
poetry lock --no-update
to regenerate the lockfile - Run
poetry install
to make sure you have everything installed
Environment variables
We care more about environment variables than most people. Both when developing and deploying code, it's critical to clearly define which ones need to exist and make sure they are available.
For this reason, we have a dedicated file shared/env.py
that defines all our environment variables.
As a reminder, environment variables are values like the database URL, or AWS credentials, that vary based on the environment: development, production... They are generally sensitive, kept secret, and never written down explicitly in the codebase, whose history is kept forever.
Importing
Accessing environment variables in your code is super simple:
from shared.env import PGURL, ASSETS_BUCKET
If this completes without error, you are guaranteed to have access to all necessary environment variables.
Defining
In env.py
, environment variables are defined like this :
import os
from shared.utils.env_utils import load_env
env = load_env()
IS_PROD = (
os.getenv("DAGSTER_CLOUD_DEPLOYMENT_NAME") is not None
or os.getenv("RAILWAY_PROJECT_ID") is not None
)
IS_DEV = not IS_PROD
# PostgreSQL
PGURL = env.require("PGURL")
# Allow the dev environment to read from the database (this is a read-only connection string)
if IS_DEV:
PROD_PGURL = env.require("PROD_PGURL")
else:
PROD_PGURL = PGURL
# AWS
AWS_ACCESS_KEY_ID = env.require("AWS_ACCESS_KEY_ID", set_in_process=True)
AWS_SECRET_ACCESS_KEY = env.require("AWS_SECRET_ACCESS_KEY", set_in_process=True)
# more environment variables...
What's happening here:
load_env
loads environment variables and returns anEnvironmentVariables
object- Environment variables are loaded from the following locations, in order:
- A
.env
file at the root of the repository - The Doppler CLI if it's available and authenticated (either through
doppler login
or aDOPPLER_TOKEN
in the.env
file) - AWS Secrets Manager for the production environment
- A
- Required environment variables are declared using
env.require()
- Passing in
set_in_process=True
means the variable will be set within the process environment, so that anything else that runs in the same process as the current program will have access to it. This can be useful if external programs or Python libraries need to access a given environment variable. However, this adds security risk and should be used only when necessary.
- Passing in
Adding new environment variables is done by defining them in Doppler and adding an env.require()
call in shared/env.py
. If you do not have access to Doppler, reach out to Erwin.