OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Python Pandas: Join on case-insensitive and stripped key

  • Thread starter Thread starter J. N.
  • Start date Start date
J

J. N.

Guest
Left Dataframe MASTER =

FnameLnameAmount
JohnSmith-Richards
DavidO'Brien

Right Dataframe ORDERS =

FnameLnameAmount
DAVIDOBRIEN36
johnsmith richards11

Expected output Dataframe =

FnameLnameAmount
JohnSmith-Richards11
DavidO'Brien36

The tables MASTER and ORDERS are LEFT-OUTER joined with KEY = [Fname, Lname]

Both tables have the same columns but the rows are not in the same order. I want the join operation to ignore case and ignore characters like dashes, spaces and apostrophes (e.g. the tables would be joined on the values "johnsmithrichards" and "davidobrien")

(In practice, the actual tables have many more columns and the resulting output table must pull some values from the MASTER table and others from the ORDERS table, which is why I used mdfcols and odfcols respectively to only include those specific columns.)

My attempt at removing just spaces doesn't work:

Code:
key = ["Fname", "Lname"]

mdfkey = (mdf["First Name"].str.lower() + mdf["Last Name"].str.lower()).replace(' ', '')
mdfcols = ["Fname", "Lname"]

odfkey = (odf["First Name"].str.lower() + odf["Last Name"].str.lower()).replace(' ', '')
odfCols = ["Amount"]

outputdf = pd.merge(
    mdf.loc[:, mdfcols],
    odf.loc[:, odfCols],
    how='left', left_on=mdfkey, right_on=odfkey)
<p>Left Dataframe MASTER =</p>
<div class="s-table-container"><table class="s-table">
<thead>
<tr>
<th>Fname</th>
<th>Lname</th>
<th>Amount</th>
</tr>
</thead>
<tbody>
<tr>
<td>John</td>
<td>Smith-Richards</td>
<td></td>
</tr>
<tr>
<td>David</td>
<td>O'Brien</td>
<td></td>
</tr>
</tbody>
</table></div>
<p>Right Dataframe ORDERS =</p>
<div class="s-table-container"><table class="s-table">
<thead>
<tr>
<th>Fname</th>
<th>Lname</th>
<th>Amount</th>
</tr>
</thead>
<tbody>
<tr>
<td>DAVID</td>
<td>OBRIEN</td>
<td>36</td>
</tr>
<tr>
<td>john</td>
<td>smith richards</td>
<td>11</td>
</tr>
</tbody>
</table></div>
<p>Expected output Dataframe =</p>
<div class="s-table-container"><table class="s-table">
<thead>
<tr>
<th>Fname</th>
<th>Lname</th>
<th>Amount</th>
</tr>
</thead>
<tbody>
<tr>
<td>John</td>
<td>Smith-Richards</td>
<td>11</td>
</tr>
<tr>
<td>David</td>
<td>O'Brien</td>
<td>36</td>
</tr>
</tbody>
</table></div>
<p>The tables MASTER and ORDERS are LEFT-OUTER joined with KEY = [Fname, Lname]</p>
<p>Both tables have the same columns but the rows are not in the same order. I want the join operation to ignore case and ignore characters like dashes, spaces and apostrophes (e.g. the tables would be joined on the values "johnsmithrichards" and "davidobrien")</p>
<p>(In practice, the actual tables have many more columns and the resulting output table must pull some values from the MASTER table and others from the ORDERS table, which is why I used <code>mdfcols</code> and <code>odfcols</code> respectively to only include those specific columns.)</p>
<p>My attempt at removing just spaces doesn't work:</p>
<pre><code>key = ["Fname", "Lname"]

mdfkey = (mdf["First Name"].str.lower() + mdf["Last Name"].str.lower()).replace(' ', '')
mdfcols = ["Fname", "Lname"]

odfkey = (odf["First Name"].str.lower() + odf["Last Name"].str.lower()).replace(' ', '')
odfCols = ["Amount"]

outputdf = pd.merge(
mdf.loc[:, mdfcols],
odf.loc[:, odfCols],
how='left', left_on=mdfkey, right_on=odfkey)
</code></pre>
 
Top