OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

What are the advantages and disadvantages of using PyArrow RecordBatchReader vs BufferReader?

  • Thread starter Thread starter František Hluchník
  • Start date Start date
F

František Hluchník

Guest
Currently, I'm using PyArrow RecordBatchReader for processing possibly quite large datasets. What I need is to create an appropriate reader for the PyArrow Dataset's write_dataset function. I'm considering the use of PyArrow BufferReader instead, so I can skip the step where I create batches from my dataset. However, according to the write_dataset function's docstring, it seems to me that I can't use the BufferReader as the data parameter:

Code:
data : Dataset, Table/RecordBatch, RecordBatchReader, list of Table/RecordBatch, or iterable of RecordBatch

Do you have any experience with these two alternative approaches? Could you please share your opinion on this matter?

I have tried to use the RecordBatchReader, but it requires batching of the dataset. I have tried to use the BufferReader, but it seems to me that I have to load the whole dataset in memory while creating the stream that the BufferReader requires.
<p>Currently, I'm using PyArrow <code>RecordBatchReader</code> for processing possibly quite large datasets. What I need is to create an appropriate reader for the PyArrow Dataset's <code>write_dataset</code> function.
I'm considering the use of PyArrow <code>BufferReader</code> instead, so I can skip the step where I create batches from my dataset.
However, according to the <code>write_dataset</code> function's docstring, it seems to me that I can't use the <code>BufferReader</code> as the data parameter:</p>
<pre><code>data : Dataset, Table/RecordBatch, RecordBatchReader, list of Table/RecordBatch, or iterable of RecordBatch
</code></pre>
<p>Do you have any experience with these two alternative approaches? Could you please share your opinion on this matter?</p>
<p>I have tried to use the <code>RecordBatchReader</code>, but it requires batching of the dataset.
I have tried to use the <code>BufferReader</code>, but it seems to me that I have to load the whole dataset in memory while creating the stream that the <code>BufferReader</code> requires.</p>
 

Latest posts

Top