I have parquet files stored in a partitioned directory structure like this:
bfl/pnr_group=0/319a1fb5557a342c1b55356ce5123123-0.parquet
When I read an individual parquet file directly using pq.read_table(), I can see the data. However, when trying to read it using PyArrow’s Dataset API with filtering, I get empty results:
import pyarrow.dataset as ds
import pyarrow as pa
# This works - has data
import pyarrow.parquet as pq
file_path="data/bfl/pnr_group=0/319a1fb5557a342c1b55356ce5123123-0.parquet"
table = pq.read_table(file_path)
print(len(table)) # Shows rows
# This finds the correct files but returns empty data
dataset = ds.dataset(
'data/bfl',
format="parquet",
partitioning=ds.DirectoryPartitioning.discover(['pnr_group'])
)
filter_expr = ds.field('pnr_group') == '0'
filtered_dataset = dataset.filter(filter_expr)
df = filtered_dataset.to_table().to_pandas() # Returns empty dataframe
The dataset schema shows ‘pnr_group’ as a string type, and dataset.files correctly lists all the parquet files. However, after filtering and converting to pandas, the resulting dataframe is empty.
How can I correctly read and filter partitioned parquet files using PyArrow’s Dataset API?
You need to sign in to view this answers