delta-rs writing with Pandas gets stuck for big dataset

I am using delta-rs to read and process some delta tables with Pandas.

I made several experiments with the following pretty simple code:

from deltalake import write_deltalake, DeltaTable
df =  DeltaTable(s3_table_URI, storage_options={"AWS_REGION": "eu-west-1"}).to_pandas()
write_deltalake(s3_table_URI, df, mode="overwrite", schema_mode="overwrite")

I tried with two different tables:

A quite big table, 500 million rows and 12 columns.
A smaller table, 145000 rows and 5 columns.

For the smaller table, the code works fine with no problems at all.

However, for the bigger one, the code reads in memory the data successfully and I am able to process it with some transformations (that here I omitted) but the code gets stuck in writing. It looks like it doesn’t really matter which transformations I am doing as the code gets stuck also with just read/writing operation with no intermediate transformations.

I am using a m4.10xlarge with 160GB of ram and 40 cores single node on Databricks, I thought it could have been an OOM issue, but I still have plenty of memory available when it gets stuck in writing (more than 50GB). After 3 hours, the cluster is still there doing apparently nothing when the writing command is executed.

The same code executed with Polars (that under the hood uses delta-rs) works perfectly, with no problems:

import polars as pl
df = pl.read_delta(s3_table_URI, storage_options={"AWS_REGION": "eu-west-1"})
df.write_delta(s3_table_URI, mode="overwrite", storage_options={"AWS_REGION": "eu-west-1"})

Anyone experienced a similar problem? It looks like some sort of bug related to the interaction between Pandas and delta-rs, as Polars works fine.

I am forced to use Pandas because of some other dependencies I am not able to update inside the company I am working for.

You need to sign in to view this answers

Related Post