Excel automation
ExcelReader
The ExcelReader
class provides a convenient way to extract data from Excel worksheets, with support for fuzzy matching labels and various data types. This documentation covers the main functionality and usage patterns.
Basic Usage
First, create an ExcelReader
instance by passing an openpyxl Worksheet object:
from openpyxl import load_workbook
from shared.excel.reader import ExcelReader
wb = load_workbook("my_file.xlsx")
ws = wb["Sheet1"]
reader = ExcelReader(ws)
The ExcelReader constructor accepts the following parameters:
ws
: The openpyxl Worksheet objectrow
: Starting row (default: 1)col
: Starting column (default: 1)auto_evaluate
: Whether to automatically evaluate Excel formulas (default: False)
Reading Values
The ExcelReader provides several methods for reading values relative to labeled cells. The different methods automatically convert to the specified data type.
Basic Read Methods
# Read any value
value = reader.read(label="Temperature", position="bottom")
# Read with type conversion
str_value = reader.read_str("Operator", position="bottom")
float_value = reader.read_float("pH", position="right")
int_value = reader.read_int("Count", position="left")
datetime_value = reader.read_datetime("Date", position="top")
Each read method accepts:
label
: The text label to search forposition
: Where to read the value relative to the label ("top", "bottom", "left", "right")similarity
: Fuzzy matching threshold (0-1, default varies by method)row
: Optional specific row to search incol
: Optional specific column to search inskip
: Number of cells to skip between label and value (default: 0)
Optional Reading
Each read method has an _optional
variant that returns None instead of raising errors:
value = reader.read_str_optional("Comments", position="bottom",
label_optional=True, safe=True)
Additional parameters for optional reads:
label_optional
: If True, return None when label isn't foundsafe
: If True, return None on conversion errors
Reading Tables
The ExcelReader can extract tabular data:
# Read a table as nested lists
data = reader.read_table(
row=5,
col=2,
end_row_label="Total",
end_col_label="Comments"
)
# Read a table as a pandas DataFrame
df = reader.read_df(
row=5,
col=2,
end_row_label="Total",
end_col_label="Comments",
header=True,
columns=["Date", "Value", "Notes"]
)
Table reading parameters:
row
: Starting rowcol
: Starting column (can be int or Excel column letter)end_row_label
: Label that marks the end row (empty string for first empty cell)end_col_label
: Label that marks the end columnend_row_index
: Alternative to end_row_label - specific row numberend_col_index
: Alternative to end_col_label - specific column numberheader
: Whether first row contains column namescolumns
: Optional list of column names
Finding Cell Locations
To find the location of a labeled cell:
# Get row, col tuple of cell containing label
row, col = reader.find_cell("Temperature", similarity=0.9)
# Get optional location (returns None if not found)
location = reader.find_cell_optional(
"Temperature",
similarity=0.9,
row=None, # optional row constraint
col=None, # optional column constraint
safe=False,
ignore_spaces=False
)
Real World Example
Here's an example from aluminum precipitation data processing:
def exp_info(r: ExcelReader, sample_id: str, original_sample_id: str, stoneware_link: str):
return {
"operator": r.read_str_optional(
"operator", "bottom", safe=True, label_optional=True
),
"date": r.read_str_optional(
"Date", "bottom", safe=True, label_optional=True, row=1
),
"test_designation": r.read_str_optional(
"test designation", "bottom", safe=True, label_optional=True
),
"temperature": r.read_float_optional(
"Temperature (°C)", "bottom", safe=True, label_optional=True
),
"stir_speed": r.read_float_optional(
"Stir Speed (rpm)", "bottom", safe=True, label_optional=True
)
}
Error Handling
- Non-optional read methods raise
ValueError
when labels aren't found or values can't be converted - Optional methods return None in these cases when
safe=True
similarity
threshold controls how strict the label matching is (1.0 = exact match)- Setting
label_optional=True
prevents errors when labels aren't found
Best Practices
- Use optional reads with
safe=True
when data might be missing or malformed - Set appropriate similarity thresholds based on your label consistency
- Use type-specific methods (
read_float
,read_int
, etc.) when you know the expected type - Consider
ignore_spaces=True
when dealing with inconsistent spacing in labels - Use
find_cell()
when you need to know the location of data for more complex processing