Excel automation

ExcelReader

The ExcelReader class provides a convenient way to extract data from Excel worksheets, with support for fuzzy matching labels and various data types. This documentation covers the main functionality and usage patterns.

Basic Usage

First, create an ExcelReader instance by passing an openpyxl Worksheet object:

from openpyxl import load_workbook
from shared.excel.reader import ExcelReader

wb = load_workbook("my_file.xlsx")
ws = wb["Sheet1"]
reader = ExcelReader(ws)

The ExcelReader constructor accepts the following parameters:

ws: The openpyxl Worksheet object
row: Starting row (default: 1)
col: Starting column (default: 1)
auto_evaluate: Whether to automatically evaluate Excel formulas (default: False)

Reading Values

The ExcelReader provides several methods for reading values relative to labeled cells. The different methods automatically convert to the specified data type.

Basic Read Methods

# Read any value
value = reader.read(label="Temperature", position="bottom")

# Read with type conversion
str_value = reader.read_str("Operator", position="bottom")
float_value = reader.read_float("pH", position="right")
int_value = reader.read_int("Count", position="left")
datetime_value = reader.read_datetime("Date", position="top")

Each read method accepts:

label: The text label to search for
position: Where to read the value relative to the label ("top", "bottom", "left", "right")
similarity: Fuzzy matching threshold (0-1, default varies by method)
row: Optional specific row to search in
col: Optional specific column to search in
skip: Number of cells to skip between label and value (default: 0)

Optional Reading

Each read method has an _optional variant that returns None instead of raising errors:

value = reader.read_str_optional("Comments", position="bottom",
                               label_optional=True, safe=True)

Additional parameters for optional reads:

label_optional: If True, return None when label isn't found
safe: If True, return None on conversion errors

Reading Tables

The ExcelReader can extract tabular data:

# Read a table as nested lists
data = reader.read_table(
    row=5,
    col=2,
    end_row_label="Total",
    end_col_label="Comments"
)

# Read a table as a pandas DataFrame
df = reader.read_df(
    row=5,
    col=2,
    end_row_label="Total",
    end_col_label="Comments",
    header=True,
    columns=["Date", "Value", "Notes"]
)

Table reading parameters:

row: Starting row
col: Starting column (can be int or Excel column letter)
end_row_label: Label that marks the end row (empty string for first empty cell)
end_col_label: Label that marks the end column
end_row_index: Alternative to end_row_label - specific row number
end_col_index: Alternative to end_col_label - specific column number
header: Whether first row contains column names
columns: Optional list of column names

Finding Cell Locations

To find the location of a labeled cell:

# Get row, col tuple of cell containing label
row, col = reader.find_cell("Temperature", similarity=0.9)

# Get optional location (returns None if not found)
location = reader.find_cell_optional(
    "Temperature",
    similarity=0.9,
    row=None,  # optional row constraint
    col=None,  # optional column constraint
    safe=False,
    ignore_spaces=False
)

Real World Example

Here's an example from aluminum precipitation data processing:

def exp_info(r: ExcelReader, sample_id: str, original_sample_id: str, stoneware_link: str):
    return {
        "operator": r.read_str_optional(
            "operator", "bottom", safe=True, label_optional=True
        ),
        "date": r.read_str_optional(
            "Date", "bottom", safe=True, label_optional=True, row=1
        ),
        "test_designation": r.read_str_optional(
            "test designation", "bottom", safe=True, label_optional=True
        ),
        "temperature": r.read_float_optional(
            "Temperature (°C)", "bottom", safe=True, label_optional=True
        ),
        "stir_speed": r.read_float_optional(
            "Stir Speed (rpm)", "bottom", safe=True, label_optional=True
        )
    }

Error Handling

Non-optional read methods raise ValueError when labels aren't found or values can't be converted
Optional methods return None in these cases when safe=True
similarity threshold controls how strict the label matching is (1.0 = exact match)
Setting label_optional=True prevents errors when labels aren't found

Best Practices

Use optional reads with safe=True when data might be missing or malformed
Set appropriate similarity thresholds based on your label consistency
Use type-specific methods (read_float, read_int, etc.) when you know the expected type
Consider ignore_spaces=True when dealing with inconsistent spacing in labels
Use find_cell() when you need to know the location of data for more complex processing

ExcelReader​

Basic Usage​

Reading Values​

Basic Read Methods​

Optional Reading​

Reading Tables​

Finding Cell Locations​

Real World Example​

Error Handling​

Best Practices​