OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Polars: How to map records one to one in a dataframe? Any way to enforce 1:1 mapping in dataframes join operation?

  • Thread starter Thread starter barrelquentin997
  • Start date Start date
B

barrelquentin997

Guest
TLDR: I need to map duplicated rows 1:1 within a dataframe. Is there any way to make join operation result in 1:1 mapping (enforce 'validate' parameter)?:

Example:

Code:
schema = {
    "id": pl.Utf8,
    "partition_number": pl.Utf8,
    "type": pl.Utf8,
    "state": pl.Int64,
    "price": pl.Float64,
    "quantity": pl.Int64,
    "source": pl.Int64,
    "some_other_random_column": pl.Int64,
}

df = pl.read_csv("data.csv", schema=schema)

source1 = df.filter(pl.col('source') == 1)
source2 = df.filter(pl.col('source') == 2)

overlap = source1.join(source2, on=['partition_number', 'type', 'state', 'price', 
    'quantity'], how='inner', suffix='_other')

If there are duplicated rows in either of these, it will result in one_to_many / many_to_many mapping. So for example shapes (50,7) and (55,7) can result in (78,7) dataframe after join. I want it to be always (50,7) at best. Then I can check which rows are not present in the other dataframe by doing (but I don't need to):

Code:
all_rows= source1.join(source2, on=['partition_number', 'type', 'state', 'price',
   'quantity'], how='full', suffix='_other')
only_in_source1 = all_rows.filter(pl.col('quantity_other').is_null()) # (0,7)
only_in_source2 = all_rows.filter(pl.col('quantity').is_null())  # (5,7)

I think in theory by doing the second code block I can calculate how many unique 1:1 mappings there are, but I want them to be mapped not just know the number, regardless of the order. I can then assign them a "mapped_id" value.

It doesn't have to be done this way, I thought similar approach could probably be the fastest, but any suggestion that will help me map rows 1:1 in an optimized way will be great.

EDIT: Added input example and also added 'id' column to schema, sorry I forgot to put it there at start.

Exmaple Input:

Code:
id,partition_number,type,state,price,quantity,source,some_other_random_column
s1_1,p1,A,1,100.0,10,1,1001
s1_2,p1,A,1,100.0,10,1,1001
s1_3,p2,B,2,150.0,20,1,1002
s1_4,p3,C,1,200.0,30,1,1003
s2_1,p1,A,1,100.0,10,2,1004
s2_2,p1,A,1,100.0,10,2,1004
s2_3,p1,A,1,100.0,10,2,1004
s2_3,p2,B,2,150.0,20,2,1005
s2_4,p4,D,1,250.0,40,2,1006

And example output:

Code:
shape: (3, 9)
┌──────┬──────────────────┬──────┬───────┬───────┬──────────┬────────┬──────────────────┬──────────┐
│ id   ┆ partition_number ┆ type ┆ state ┆ price ┆ quantity ┆ source ┆ some_other_rando ┆ id_other │
│ ---  ┆ ---              ┆ ---  ┆ ---   ┆ ---   ┆ ---      ┆ ---    ┆ m_column         ┆ ---      │
│ str  ┆ str              ┆ str  ┆ i64   ┆ f64   ┆ i64      ┆ i64    ┆ ---              ┆ str      │
│      ┆                  ┆      ┆       ┆       ┆          ┆        ┆ i64              ┆          │
╞══════╪══════════════════╪══════╪═══════╪═══════╪══════════╪════════╪══════════════════╪══════════╡
│ s1_1 ┆ p1               ┆ A    ┆ 1     ┆ 100.0 ┆ 10       ┆ 1      ┆ 1001             ┆ s2_1     │
│ s1_2 ┆ p1               ┆ A    ┆ 1     ┆ 100.0 ┆ 10       ┆ 1      ┆ 1001             ┆ s2_2     │
│ s1_3 ┆ p2               ┆ B    ┆ 2     ┆ 150.0 ┆ 20       ┆ 1      ┆ 1002             ┆ s2_4     │
└──────┴──────────────────┴──────┴───────┴───────┴──────────┴────────┴──────────────────┴──────────┘

Or even shorter is sufficient (but this can be just filtered from the output above):

Code:
shape: (3, 2)
┌──────┬──────────┐
│ id   ┆ id_other │
│ ---  ┆ ---      │
│ str  ┆ str      │
╞══════╪══════════╡
│ s1_1 ┆ s2_1     │
│ s1_2 ┆ s2_2     │
│ s1_3 ┆ s2_4     │
└──────┴──────────┘
<p>TLDR: I need to map duplicated rows 1:1 within a dataframe. Is there any way to make join operation result in 1:1 mapping (enforce 'validate' parameter)?:</p>
<p>Example:</p>
<pre><code>schema = {
"id": pl.Utf8,
"partition_number": pl.Utf8,
"type": pl.Utf8,
"state": pl.Int64,
"price": pl.Float64,
"quantity": pl.Int64,
"source": pl.Int64,
"some_other_random_column": pl.Int64,
}

df = pl.read_csv("data.csv", schema=schema)

source1 = df.filter(pl.col('source') == 1)
source2 = df.filter(pl.col('source') == 2)

overlap = source1.join(source2, on=['partition_number', 'type', 'state', 'price',
'quantity'], how='inner', suffix='_other')
</code></pre>
<p>If there are duplicated rows in either of these, it will result in one_to_many / many_to_many mapping. So for example shapes (50,7) and (55,7) can result in (78,7) dataframe after join. I want it to be always (50,7) at best. Then I can check which rows are not present in the other dataframe by doing (but I don't need to):</p>
<pre><code>all_rows= source1.join(source2, on=['partition_number', 'type', 'state', 'price',
'quantity'], how='full', suffix='_other')
only_in_source1 = all_rows.filter(pl.col('quantity_other').is_null()) # (0,7)
only_in_source2 = all_rows.filter(pl.col('quantity').is_null()) # (5,7)

</code></pre>
<p>I think in theory by doing the second code block I can calculate how many unique 1:1 mappings there are, but I want them to be mapped not just know the number, regardless of the order. I can then assign them a "mapped_id" value.</p>
<p>It doesn't have to be done this way, I thought similar approach could probably be the fastest, but any suggestion that will help me map rows 1:1 in an optimized way will be great.</p>
<p>EDIT: Added input example and also added 'id' column to schema, sorry I forgot to put it there at start.</p>
<p>Exmaple Input:</p>
<pre><code>id,partition_number,type,state,price,quantity,source,some_other_random_column
s1_1,p1,A,1,100.0,10,1,1001
s1_2,p1,A,1,100.0,10,1,1001
s1_3,p2,B,2,150.0,20,1,1002
s1_4,p3,C,1,200.0,30,1,1003
s2_1,p1,A,1,100.0,10,2,1004
s2_2,p1,A,1,100.0,10,2,1004
s2_3,p1,A,1,100.0,10,2,1004
s2_3,p2,B,2,150.0,20,2,1005
s2_4,p4,D,1,250.0,40,2,1006
</code></pre>
<p>And example output:</p>
<pre><code>shape: (3, 9)
┌──────┬──────────────────┬──────┬───────┬───────┬──────────┬────────┬──────────────────┬──────────┐
│ id ┆ partition_number ┆ type ┆ state ┆ price ┆ quantity ┆ source ┆ some_other_rando ┆ id_other │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ m_column ┆ --- │
│ str ┆ str ┆ str ┆ i64 ┆ f64 ┆ i64 ┆ i64 ┆ --- ┆ str │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ i64 ┆ │
╞══════╪══════════════════╪══════╪═══════╪═══════╪══════════╪════════╪══════════════════╪══════════╡
│ s1_1 ┆ p1 ┆ A ┆ 1 ┆ 100.0 ┆ 10 ┆ 1 ┆ 1001 ┆ s2_1 │
│ s1_2 ┆ p1 ┆ A ┆ 1 ┆ 100.0 ┆ 10 ┆ 1 ┆ 1001 ┆ s2_2 │
│ s1_3 ┆ p2 ┆ B ┆ 2 ┆ 150.0 ┆ 20 ┆ 1 ┆ 1002 ┆ s2_4 │
└──────┴──────────────────┴──────┴───────┴───────┴──────────┴────────┴──────────────────┴──────────┘
</code></pre>
<p>Or even shorter is sufficient (but this can be just filtered from the output above):</p>
<pre><code>shape: (3, 2)
┌──────┬──────────┐
│ id ┆ id_other │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ s1_1 ┆ s2_1 │
│ s1_2 ┆ s2_2 │
│ s1_3 ┆ s2_4 │
└──────┴──────────┘
</code></pre>
 

Latest posts

Top