Skip to main content

SharePoint automation (v2)

Context

SharePointClientV2 is a redesign of our SharePoint automation from scratch.

It aims to solve the following problems in the V1:

  • Poor performance of the underlying Office 365 library
  • Lack of clarity around amount of network requests
  • Improve reliability by using the more modern Microsoft Graph API
  • Provide new features, such as updating spreadsheets collaboratively while other users have them open

It enables you to read, write, navigate, and manage files and folders, as well as interact with Excel spreadsheets. The differences with the V1 are:

  • Written from scratch, providing us with full control and using the modern Microsoft Graph API
  • Async interface, enabling parallelization of network requests. This adds a little complexity, but pushes our code towards best performance.
  • Read & update spreadsheets directly in SharePoint, even if other users have them open.
  • Single client for all SharePoint sites
  • Continuous synchronization of changes to database tables with the new SharePointDB
  • Fast recursive listing of files and folders with the new SharePointDB
  • More robust through better retry mechanisms, proper throttling to avoid hitting rate limits, and respect of "Retry-After" delays returned by the API.
  • Batch queries (advanced, mostly used internally)

General considerations

  • A single client can be used across all SharePoint sites. All methods on the SharePoint client have an optional site keyword argument. A default site can be passed when initializing the client, to avoid having to pass it explicitly every time.
  • Initializing multiple clients is also not a problem, they will share the same HTTP session and rate limit to maximize performance and avoid throttling.
  • If you are processing multiple files in parallel, it's worth defining a process_file async function and running it concurrently on all files with asyncio.gather
  • Try to work in safe locations when testing out your code during development. The recommended way to do that is to import from shared.env import IS_PROD, IS_DEV and rely on those flags to either output to the real folder or to a test folder. The recommended location for testing is the 99. Tests/Outputs folder in the Stoneware SharePoint site.

Working with files and folders

Full example

sp = SharePointClientV2("PSA Data")

# Navigate folders
folder = await sp.get_folder("/Data/Reports")
file = await sp.get_file("/Data/Reports/report.pdf")

# Get file/folder metadata
print(file.name, file.size, file.created_at, file.modified_at)
print(folder.name, folder.item_count, folder.created_at)


new_folder = await sp.create_folder("/Data/Reports/2024")
# `ensure_folder` creates the folder if it doesn't exist and doesn't error otherwise
new_folder = await sp.ensure_folder("/Data/Reports/2024")

file = await file.move("/Data/Reports/2024/report.pdf")

copy_file = await file.copy("/Data/Reports/2024/report_copy.pdf")
copy_file = await copy_file.rename("new_name.pdf")
await copy_file.delete()

folder = await folder.rename("Reports 2024")

# `recursive=True` is available, but not always fast. For large folder
# and file structures, using the `SharePointDB` is recommended (see below).
subfolders = await folder.list_folders()
files = await folder.list_files()

await folder.delete()

SharePointClientV2

This is the main entry point for SharePoint interactions.

Key Responsibilities:

  • Providing methods to retrieve, list, and create sites, files, folders
  • Handling uploads and downloads (chunked for large files)
  • Managing workbook sessions
  • Managing authentication tokens
  • Managing concurrency through built-in rate limits, retries, and backoff

Constructor:

def __init__(
default_site: str | None = None,
monitoring: Monitoring | None = None,
logger: logging.Logger | PrintLogger | None = None,
) -> None
  • default_site: Name of the default SharePoint site to use. If provided, methods won't need an explicit site parameter.
  • monitoring: Optional configuration for long-running operations (upload/copy). Defaults to Monitoring().
  • logger: Optional logger for debug/warning messages. Defaults to PrintLogger(level=logging.INFO).

Methods:

  1. get_file(path: str | list[str], site: str | None = None, allow_not_found=False) -> SharePointFile | None

    • Retrieves a file by path, returning a SharePointFile.
    • Accepts an optional site to choose which SharePoint Site to use, otherwise uses the default.
    • If allow_not_found=True, returns None if the file is not found.
  2. get_folder(path: str | list[str], site: str | None = None, allow_not_found=False) -> SharePointFolder | None

    • Retrieves a folder by path, returning a SharePointFolder.
    • If allow_not_found=True, returns None if the folder is not found.
  3. ensure_folder(path: str | list[str], site: str | None = None) -> SharePointFolder

    • If the folder does not exist, creates it. Otherwise returns the existing folder.
  4. item_exists(path: str | list[str], site: str | None = None) -> bool

    • Checks if a file or folder exists at the given path.
  5. list_sites()

    • Lists all available SharePoint sites.
  6. _create_link(...)

    • Internal method to create a sharing link. Typically used through SharePointFile.create_link(...) or SharePointFolder.create_link(...).
  7. _read_file(file_id: str, site: str) -> bytes

    • Internal method that fetches the file’s binary contents.
  8. _upload_file(folder_id: str, name: str, file: Path|bytes, ...) -> None

    • Internal method to handle small or chunked file uploads.
  9. close()

    • Closes the underlying async session (good practice for long-running scripts).

SharePointFile

Represents a file in SharePoint.

Attributes:

  • id: Unique ID of the file in SharePoint
  • name: Name of the file (filename)
  • path: Path of the file, e.g. /Some/Folder/filename.docx
  • url: SharePoint web URL for the file
  • size: File size in bytes
  • deleted: Boolean indicating if the file has been marked deleted in SharePoint

Methods:

  1. read() -> bytes
    • Reads the file content into memory.
  2. **read_dataframe(sheet_name: int | str = 0, **kwargs) -> pd.DataFrame | dict[str, pd.DataFrame]**
    • Downloads the file and reads it with pandas.read_excel.
    • Great when you want the sheet contents as a DataFrame and do not need workbook-session features.
  3. read_workbook() -> openpyxl.Workbook
    • Downloads the file and opens it with openpyxl.load_workbook.
    • This is the easiest path when you want the full workbook object locally.
  4. download(out: str | Path) -> None
    • Downloads the file locally to the specified path.
  5. rename(name: str | None, description: str | None = None) -> SharePointFile
    • Renames the file (and optionally sets a description).
  6. move(path: str | list[str], new_name: str | None = None)
    • Moves the file to a different parent folder and optionally renames it.
  7. copy(path: str | list[str], new_name: str | None = None)
    • Copies the file to a different parent folder and optionally renames it.
  8. delete()
    • Permanently deletes the file.
  9. create_workbook_session(persist_changes=True, auto_refresh=False) -> WorkbookSession
    • Initiates a workbook session if the file is an Excel workbook.
    • persist_changes means changes persist once the session closes.
    • auto_refresh automatically refreshes the session token.

SharePointFolder

Represents a folder in SharePoint.

Attributes:

  • id: Unique folder ID
  • name: Folder name
  • path: Full path for the folder, e.g. /Documents/Reports
  • children: Number of immediate children in the folder
  • deleted: Boolean indicating if the folder is marked as deleted

Methods:

  1. list_children() -> (List[SharePointFile], List[SharePointFolder])
    • Lists all child items in the folder.
  2. list_files() -> List[SharePointFile] and list_folders() -> List[SharePointFolder]
    • Helper methods to list files only or folders only.
  3. upload_file(name: str, file: str | Path | bytes, chunksize=50_000_000)
    • Uploads a new file to the folder; large files are chunked by default.
  4. delete_child(name: str) -> None
    • Deletes a file or folder by name.
  5. ensure_folder(path: str | list[str]) -> SharePointFolder
    • Creates subfolders as needed.
  6. rename(name: str | None, description: str | None = None)
    • Renames the folder.
  7. move(path: str | list[str], new_name: str | None = None)
    • Moves the folder to a different parent folder and optionally renames it.
  8. copy(path: str | list[str], new_name: str | None = None)
    • Copies the folder to a different parent folder and optionally renames it.
  9. delete()
    • Permanently deletes the folder.

Working with spreadsheets

SharePoint v2 gives you two different ways to work with Excel files, and it helps to pick the right one up front.

  • Use read_dataframe() when you just want sheet data as pandas and do not need to edit the workbook in place.
  • Use read_workbook() when you want a local openpyxl workbook object and are happy to download the file.
  • Use create_workbook_session() when you want to edit the workbook directly in SharePoint without downloading and re-uploading the whole file.

That last option is the interesting one. It lets your code behave more like another live Excel client. In practice, that means you can update values, formatting, tables, and sheet structure while the workbook still lives in SharePoint.

A good mental model

If you are not changing anything, start simple:

sp = SharePointClientV2("Analytical Data")
file = await sp.get_file("/Reports/weekly_summary.xlsx")

# Quick read into pandas
df = await file.read_dataframe("Summary")

# Full local workbook if you want openpyxl features
wb = await file.read_workbook()
summary_sheet = wb["Summary"]

When you do want to edit the workbook in place, open a workbook session:

sp = SharePointClientV2("Analytical Data")
file = await sp.get_file("/Reports/weekly_summary.xlsx")

async with file.create_workbook_session(persist_changes=True) as workbook:
sheet = await workbook.get_worksheet("Summary")
await sheet.write_cells("A1", values=[["Updated by automation"]])

That session is the entry point for everything that follows.

A full workbook session example

This example touches most of the workbook features people actually reach for: reading values, writing values, formatting, table creation, and worksheet protection.

from datetime import datetime

import pandas as pd

from shared.sharepoint.client_v2 import SharePointClientV2

sp = SharePointClientV2(default_site="Analytical Data")
file = await sp.get_file("/Reports/weekly_summary.xlsx")

async with file.create_workbook_session(
persist_changes=True,
auto_refresh=True,
) as workbook:
sheet = await workbook.get_worksheet("Summary")

existing = await sheet.read_cells("A1:D8")
print(existing.values)

await sheet.write_cells(
"A1:D3",
values=[
["Region", "Sales", "Updated At", "Status"],
["North", 50000, datetime.now(), "Ready"],
["South", 75000, datetime.now(), "Review"],
],
number_format=[
[None, "$#,##0.00", "yyyy-mm-dd hh:mm:ss", None],
[None, "$#,##0.00", "yyyy-mm-dd hh:mm:ss", None],
[None, "$#,##0.00", "yyyy-mm-dd hh:mm:ss", None],
],
)

await sheet.set_font("A1:D1", bold=True, size=12, color="#1F2937")
await sheet.fill_cells("A1:D1", color="#E5E7EB")
await sheet.draw_borders(
"A1:D3",
color="#9CA3AF",
style="Continuous",
weight="Thin",
sides=["EdgeTop", "EdgeBottom", "EdgeLeft", "EdgeRight", "InsideHorizontal"],
)
await sheet.autofit_columns("A:D")

table_df = pd.DataFrame(
{
"Sample": ["alpha", "beta", "gamma"],
"Result": ["pass", "retry", "pass"],
"Score": [91.2, 74.5, 88.0],
}
)
table = await sheet.create_table_from_dataframe(
table_df,
start="F1",
number_formats=["@", "@", "0.0"],
alignments=["Left", "Left", "Right"],
column_widths=["auto", "auto", 80],
style="TableStyleMedium2",
table_name="ResultsTable",
)

print(table.name)

await sheet.protect(
allow_format_cells=True,
allow_format_columns=True,
allow_sort=True,
allow_auto_filter=True,
)

Workbook sessions in practice

A workbook session is a live connection to the Excel workbook through Microsoft Graph.

There are two flags worth knowing well:

  • persist_changes=True: save changes back to the workbook when the session closes.
  • persist_changes=False: useful when you want to experiment, inspect behavior, or do a dry run without keeping the edits.
  • auto_refresh=True: useful for longer jobs. The client will keep the session alive for you.

If you leave auto_refresh=False, short tasks are still fine. The session logic will reopen a stale session before the next workbook operation when needed. For quick scripts, that is often enough. For longer multi-step jobs, I would still use auto_refresh=True and stop thinking about it.

WorkbookSession API

Once you have a session, these are the main things you can do with it:

  1. list_worksheets()
    • Returns the worksheets in the workbook.
  2. get_worksheet(name_or_id: str)
    • Fetches a worksheet by name or ID.
  3. add_worksheet(name: str)
    • Creates a new worksheet.
  4. delete_worksheet(name_or_id: str)
    • Deletes a worksheet by name or ID.
  5. recalculate(calculation_type="Recalculate")
    • Triggers workbook recalculation.
    • Useful after formula-heavy edits.
  6. refresh_session()
    • Explicitly refreshes the session.
    • Most callers will not need this if auto_refresh=True.
  7. batch_request(requests, retry=True)
    • Advanced escape hatch for batching workbook API calls.
    • Mostly useful when you already understand the underlying Graph endpoints.

Here is a small example that creates a new sheet and recalculates formulas afterward:

async with file.create_workbook_session(auto_refresh=True) as workbook:
report_sheet = await workbook.add_worksheet("Fresh Output")
await report_sheet.write_cells("A1:B2", values=[["Metric", "Value"], ["Count", 42]])
await workbook.recalculate("Full")

Working with worksheets

The Worksheet object is where most day-to-day workbook work happens.

Read cells and inspect ranges

read_cells() returns a WorksheetRange, not just raw values. That object includes dimensions, values, formulas, text, and number formats when the API provides them.

async with file.create_workbook_session() as workbook:
sheet = await workbook.get_worksheet("Summary")
cell_range = await sheet.read_cells("B2:D6")

print(cell_range.address)
print(cell_range.row_count, cell_range.column_count)
print(cell_range.values)
print(cell_range.number_format)

If you only need to know how much of a sheet is being used, get_bounds() is the most direct option:

bounds = await sheet.get_bounds()
print(bounds.row_index, bounds.row_count, bounds.column_index, bounds.column_count)

Write cells, formulas, and datetimes

write_cells() accepts a 2D array of values. Formula strings beginning with = are sent as formulas, and datetime values are formatted with the datetime_format argument before they go over the wire.

await sheet.write_cells(
"A1:C3",
values=[
["Date", "Amount", "Formula"],
[datetime.now(), 1250, "=B2*0.2"],
[datetime.now(), 1800, "=B3*0.2"],
],
number_format=[
["yyyy-mm-dd hh:mm:ss", "$#,##0.00", None],
["yyyy-mm-dd hh:mm:ss", "$#,##0.00", None],
["yyyy-mm-dd hh:mm:ss", "$#,##0.00", None],
],
)

Two practical notes:

  • Shapes matter. values and number_format need to line up.
  • The API wants rectangular writes. If you need sparse edits, split them into separate calls.

Insert, delete, clear, merge

These methods are handy when you are laying out a report rather than just dropping data into a fixed template.

await sheet.insert_cells("A2:B4", shift="Down")
await sheet.delete_cells("D5:E5", shift="Left")
await sheet.clear_cells("F1:H20", apply_to="Contents")
await sheet.merge_cells("A1:D1")
await sheet.unmerge_cells("A1:D1")

Formatting and presentation

The formatting helpers are straightforward, and together they cover most report-building needs:

  • format_cells() for alignment, row height, column width, and wrapping
  • set_font() for bold, italic, size, color, font name, and underline
  • fill_cells() for background fills
  • draw_borders() for borders
  • autofit_columns() and autofit_rows() for cleanup after writing data
await sheet.format_cells(
"A1:D20",
horizontal_alignment="Center",
vertical_alignment="Center",
wrap_text=True,
)
await sheet.set_font("A1:D1", bold=True, size=11)
await sheet.fill_cells("A1:D1", color="#F3F4F6")
await sheet.autofit_rows("A1:D20")
await sheet.autofit_columns("A:D")

One small implementation detail that is worth knowing: draw_borders() is internally batched because Excel border updates are naturally multi-call operations. You get the nicer API without having to think about the request fan-out.

Sheet structure and visibility

You can also rename and reposition sheets, or hide them:

await sheet.update(name="Summary Output", position=0, visibility="Visible")

Accepted visibility values are:

  • "Visible"
  • "Hidden"
  • "VeryHidden"

Worksheet protection

Protection is built in, and it is granular enough to be practical.

await sheet.protect(
allow_format_cells=True,
allow_format_columns=True,
allow_insert_rows=False,
allow_delete_rows=False,
allow_sort=True,
allow_auto_filter=True,
allow_pivot_tables=True,
)

# Later, if needed
await sheet.unprotect()

This is especially useful for generated reports where you want people to filter and sort, but not accidentally rewrite the layout.

Tables

If you already have data on the sheet, you can turn a range into an Excel table:

table = await sheet.create_table("A1:C10", has_headers=True)

If your data starts as a pandas DataFrame, create_table_from_dataframe() is usually the better choice because it writes the values, applies optional formatting, and creates the Excel table in one go:

table = await sheet.create_table_from_dataframe(
df,
start="A1",
header=True,
columns=["sample_id", "status", "score"],
number_formats=["@", "@", "0.00"],
column_widths=["auto", "auto", 90],
alignments=["Left", "Left", "Right"],
table_name="SampleResults",
style="TableStyleLight1",
)

This method is one of the nicest parts of the workbook API. If your end goal is “put this DataFrame into Excel and make it look respectable,” it does a lot of the fiddly work for you.

Table API

Once you have a Table, you can manage its shape and contents directly.

Inspect a table

table = await sheet.get_table("SampleResults")

columns = await table.list_columns()
rows = await table.list_rows()
table_range = await table.get_range()

print(columns)
print(rows)
print(table_range.address)

Add, update, and delete rows

await table.add_row(["sample-004", "pass", 93.1])
await table.add_rows(
[
["sample-005", "retry", 70.4],
["sample-006", "pass", 89.8],
]
)
await table.update_row(0, ["sample-001", "pass", 95.0])
await table.delete_row(1)

Add, update, and delete columns

new_column = await table.add_column(name="Reviewer", values=[["EA"], ["MB"], ["TT"]])
await table.update_column(new_column.id, name="Reviewed By")
await table.delete_column(new_column.id)

Update table properties and filters

await table.update(name="FinalResults", style="TableStyleMedium2", show_totals=True)
await table.clear_filters()
await table.reapply_filters()

And if you need to remove the table entirely:

await table.delete()

Choosing between workbook sessions and local workbook reads

This is the part that usually saves the most time:

  • Reach for create_workbook_session() when you want in-place edits in SharePoint.
  • Reach for read_workbook() when you need features from openpyxl that are outside the Graph workbook API.
  • Reach for read_dataframe() when you just need sheet data and want the shortest path to pandas.

It is completely normal to use both styles in the same project. A lot of workflows read source data with pandas or openpyxl, then use workbook sessions only for the final polished output workbook.

A few gotchas worth knowing

  • create_workbook_session(...) is the current API. Older examples may refer to open_workbook(...); that is stale.
  • write_cells() expects a rectangular 2D array.
  • number_format should match the shape of the written values.
  • For longer jobs, prefer auto_refresh=True.
  • If you are testing formatting or output logic, use a safe destination workbook in a test folder rather than a team workbook people rely on every day.

Usage in Streamlit

Using the SharePointClientV2 in Streamlit requires one specific trick:

from data_tools.async_streamlit import run_async_main

async def main():
# actual logic
...

if __name__ == "__main__":
# you can pass eventual arguments and keyword arguments for `main`
# to `run_async_main` itself.
run_async_main(main)

The reason for it is a bit technical, but this essentially prevents having multiple versions of the page executing code at the same time if Streamlit reruns the page while an async task is going on (like a file upload). I invite you to read the comment for run_async_entrypoint for more detail.

In addition to this, two utilities are available to replace st.cache_resource and st.cache_data, for async functions:

from data_tools.async_streamlit import st_cache_resource_async, st_cache_resource_data

Gotchas

  • The SharePointDB connects to the production database by default, where the automation server syncs changes every 30 seconds. This means you may not see changes immediately with its methods. If the automation server has an issue, the files and folders table may not sync at all (should be rare, but worth knowing).

Advanced: SharePointDB

SharePointDB provides a PostgreSQL-cached view of SharePoint file and folder structures. It solves several key challenges:

  • Fast recursive listing: Query thousands of files in milliseconds instead of making hundreds of API calls
  • Change tracking: Continuously synchronize the latest SharePoint changes using Microsoft's delta API
  • Filtering: Fast filtering by modification date, path patterns, and other criteria
  • Deletion tracking: Keep track of deleted files (the API doesn't return deleted items after some time)

Why Use SharePointDB?

The Microsoft Graph API has limitations:

  • Rate limits (throttling after ~6 requests/second)
  • No recursive listing endpoint (must query each folder individually)
  • Paginated responses require multiple round-trips
  • Delta tokens expire, requiring full re-sync

SharePointDB solves these by syncing changes to PostgreSQL tables, enabling fast SQL queries.

Getting Started

from shared.sharepoint.client_v2 import SharePointClientV2

sp = SharePointClientV2(default_site="Analytical Data")

# Use as context manager for automatic connection handling
async with sp.db_client() as db:
# All queries go here
files = await db.list_files("/Reports", recursive=True)

Listing Files and Folders

async with sp.db_client() as db:
# List files in a folder (non-recursive)
files = await db.list_files("/Reports")

# List files recursively
files = await db.list_files("/Reports", recursive=True)

# Filter by modification date
from datetime import datetime, timedelta
recent = await db.list_files(
"/Reports",
recursive=True,
modified_since=datetime.now() - timedelta(days=7)
)

# Include deleted files (useful for tracking removals)
all_files = await db.list_files(
"/Reports",
recursive=True,
include_deleted=True
)

# List folders
folders = await db.list_folders("/Reports", recursive=True)

Querying Multiple Paths

Use MultiPath to query multiple folders efficiently in a single query:

from shared.sharepoint.db_client import MultiPath

async with sp.db_client() as db:
files = await db.list_files(
MultiPath("/Folder1", "/Folder2", "/Folder3"),
recursive=True
)

Getting Specific Items

async with sp.db_client() as db:
# Get file by path
file = await db.get_file("/Reports/Q4/summary.xlsx")

# Get folder by path
folder = await db.get_folder("/Reports/Q4")

# Check existence
exists = await db.file_exists("/Reports/Q4/summary.xlsx")
exists = await db.folder_exists("/Reports/Q4")

# Get multiple files by ID (useful in sync callbacks)
files = await db.get_files_by_ids(["id1", "id2", "id3"])

Syncing Changes

The automation server syncs changes every 30 seconds automatically. For manual sync or custom processing:

async with sp.db_client() as db:
# Setup schema (only needed once per site)
await db.setup_sync_schema("Analytical Data")

# Incremental sync (continues from last token)
has_more = await db.sync_changes("Analytical Data")

# Full resync from scratch
has_more = await db.sync_changes("Analytical Data", from_scratch=True)

# Limit iterations (useful for testing)
has_more = await db.sync_changes("Analytical Data", max_iters=10)

Processing Changes with Callbacks

Process each batch of changes as they're synced:

from shared.sharepoint.db_client import SyncBatch

async def on_batch(db: SharePointDB, batch: SyncBatch):
# batch.changes contains ChangeSplit with:
# - folders: list[FolderUpdate]
# - files: list[FileUpdate]
# - folder_deletes: list[str] (IDs)
# - file_deletes: list[str] (IDs)

for file_update in batch.changes.files:
print(f"File updated: {file_update.name}")
print(f" Path: in folder {file_update.parent_id}")
print(f" Modified: {file_update.last_modified}")
print(f" Size: {file_update.size}")

for file_id in batch.changes.file_deletes:
print(f"File deleted: {file_id}")

# Fetch full file objects if needed
files = await db.get_files_by_ids([f.id for f in batch.changes.files])

async with sp.db_client() as db:
await db.sync_changes("Analytical Data", on_batch=on_batch)

Change Data Types

from shared.sharepoint.db_client import (
FileUpdate, # File created or modified
FileDelete, # File deleted (contains only id)
FolderUpdate, # Folder created or modified
FolderDelete, # Folder deleted (contains only id)
ChangeSplit, # Container for all change types
SyncBatch, # Batch info passed to callbacks
)

# FileUpdate fields:
# - id, name, url, created_at, created_by, last_modified, last_modified_by, size, parent_id

# FolderUpdate fields:
# - id, name, url, created_at, created_by, last_modified, last_modified_by, size, children, parent_id

Database Schema

SharePointDB creates tables in the sharepoint schema:

  • sharepoint.deltas - Stores delta sync tokens per site
  • sharepoint.{site}_folders - Folder metadata with path hierarchy
  • sharepoint.{site}_files - File metadata linked to parent folders

The folder table uses a recursive CTE to maintain full paths, enabling fast path-based queries.

Best Practices

  1. Use context manager: Always use async with sp.db_client() as db: for proper connection handling
  2. Rely on automation sync: The server syncs every 30 seconds; manual sync is rarely needed
  3. Use modified_since for incremental processing: Filter by date to process only new/changed files
  4. Batch operations: Use MultiPath when querying multiple folders
  5. Handle deletions: Use include_deleted=True when tracking file removals

Connection Configuration

By default, SharePointDB connects to the production database. For custom connections:

# Custom PostgreSQL URL
async with sp.db_client(pg_url="postgresql://...") as db:
...