OiO.lk Blog python PyArrow Dataset filtering not working with partitioned parquet files
python

PyArrow Dataset filtering not working with partitioned parquet files


I have parquet files stored in a partitioned directory structure like this:

bfl/pnr_group=0/319a1fb5557a342c1b55356ce5123123-0.parquet

When I read an individual parquet file directly using pq.read_table(), I can see the data. However, when trying to read it using PyArrow’s Dataset API with filtering, I get empty results:

import pyarrow.dataset as ds
import pyarrow as pa

# This works - has data
import pyarrow.parquet as pq
file_path="data/bfl/pnr_group=0/319a1fb5557a342c1b55356ce5123123-0.parquet"
table = pq.read_table(file_path)
print(len(table))  # Shows rows

# This finds the correct files but returns empty data
dataset = ds.dataset(
    'data/bfl',
    format="parquet",
    partitioning=ds.DirectoryPartitioning.discover(['pnr_group'])
)

filter_expr = ds.field('pnr_group') == '0'
filtered_dataset = dataset.filter(filter_expr)
df = filtered_dataset.to_table().to_pandas()  # Returns empty dataframe

The dataset schema shows ‘pnr_group’ as a string type, and dataset.files correctly lists all the parquet files. However, after filtering and converting to pandas, the resulting dataframe is empty.
How can I correctly read and filter partitioned parquet files using PyArrow’s Dataset API?



You need to sign in to view this answers

Exit mobile version