Skip to main content

Excel automation

ExcelReader

The ExcelReader class provides a convenient way to extract data from Excel worksheets, with support for fuzzy matching labels and various data types. This documentation covers the main functionality and usage patterns.

Basic Usage

First, create an ExcelReader instance by passing an openpyxl Worksheet object:

from openpyxl import load_workbook
from shared.excel.reader import ExcelReader

wb = load_workbook("my_file.xlsx")
ws = wb["Sheet1"]
reader = ExcelReader(ws)

The ExcelReader constructor accepts the following parameters:

  • ws: The openpyxl Worksheet object
  • row: Starting row (default: 1)
  • col: Starting column (default: 1)
  • auto_evaluate: Whether to automatically evaluate Excel formulas (default: False)

Reading Values

The ExcelReader provides several methods for reading values relative to labeled cells. The different methods automatically convert to the specified data type.

Basic Read Methods

# Read any value
value = reader.read(label="Temperature", position="bottom")

# Read with type conversion
str_value = reader.read_str("Operator", position="bottom")
float_value = reader.read_float("pH", position="right")
int_value = reader.read_int("Count", position="left")
datetime_value = reader.read_datetime("Date", position="top")

Each read method accepts:

  • label: The text label to search for
  • position: Where to read the value relative to the label ("top", "bottom", "left", "right")
  • similarity: Fuzzy matching threshold (0-1, default varies by method)
  • row: Optional specific row to search in
  • col: Optional specific column to search in
  • skip: Number of cells to skip between label and value (default: 0)

Optional Reading

Each read method has an _optional variant that returns None instead of raising errors:

value = reader.read_str_optional("Comments", position="bottom",
label_optional=True, safe=True)

Additional parameters for optional reads:

  • label_optional: If True, return None when label isn't found
  • safe: If True, return None on conversion errors

Reading Tables

The ExcelReader can extract tabular data:

# Read a table as nested lists
data = reader.read_table(
row=5,
col=2,
end_row_label="Total",
end_col_label="Comments"
)

# Read a table as a pandas DataFrame
df = reader.read_df(
row=5,
col=2,
end_row_label="Total",
end_col_label="Comments",
header=True,
columns=["Date", "Value", "Notes"]
)

Table reading parameters:

  • row: Starting row
  • col: Starting column (can be int or Excel column letter)
  • end_row_label: Label that marks the end row (empty string for first empty cell)
  • end_col_label: Label that marks the end column
  • end_row_index: Alternative to end_row_label - specific row number
  • end_col_index: Alternative to end_col_label - specific column number
  • header: Whether first row contains column names
  • columns: Optional list of column names

Finding Cell Locations

To find the location of a labeled cell:

# Get row, col tuple of cell containing label
row, col = reader.find_cell("Temperature", similarity=0.9)

# Get optional location (returns None if not found)
location = reader.find_cell_optional(
"Temperature",
similarity=0.9,
row=None, # optional row constraint
col=None, # optional column constraint
safe=False,
ignore_spaces=False
)

Real World Example

Here's an example from aluminum precipitation data processing:

def exp_info(r: ExcelReader, sample_id: str, original_sample_id: str, stoneware_link: str):
return {
"operator": r.read_str_optional(
"operator", "bottom", safe=True, label_optional=True
),
"date": r.read_str_optional(
"Date", "bottom", safe=True, label_optional=True, row=1
),
"test_designation": r.read_str_optional(
"test designation", "bottom", safe=True, label_optional=True
),
"temperature": r.read_float_optional(
"Temperature (°C)", "bottom", safe=True, label_optional=True
),
"stir_speed": r.read_float_optional(
"Stir Speed (rpm)", "bottom", safe=True, label_optional=True
)
}

Error Handling

  • Non-optional read methods raise ValueError when labels aren't found or values can't be converted
  • Optional methods return None in these cases when safe=True
  • similarity threshold controls how strict the label matching is (1.0 = exact match)
  • Setting label_optional=True prevents errors when labels aren't found

Best Practices

  • Use optional reads with safe=True when data might be missing or malformed
  • Set appropriate similarity thresholds based on your label consistency
  • Use type-specific methods (read_float, read_int, etc.) when you know the expected type
  • Consider ignore_spaces=True when dealing with inconsistent spacing in labels
  • Use find_cell() when you need to know the location of data for more complex processing