OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Is there a better way to generate my dataframe?

  • Thread starter Thread starter MPJ567
  • Start date Start date
M

MPJ567

Guest
I have a set of data that I need to perform some transformations on. The raw form of the data is as follows (actual dataset will have many more columns, reduced set here for simplicity):

lVoterUniqueIDsElectionAbbr1sElectionAbbr2sElectionAbbr3sElectionAbbr4sElectionAbbr5
3715272024PR2022Gen2022PR2020GEN2020PR
18439492024PRnullnullnullnull
28133982024PR2022Gennull2020GEN2020PR

The output I'm looking for is:


  • lVoterUniqueID relabeled as ID


  • Each sElectionAbbr[X] column relabeled to Election_Code
IDElection_Code
3715272024PR
18439492024PR

I can accomplish this via sql:

Code:
select distinct 
lVoterUniqueID,
sElectionAbbr1 as Election_Code
from data_set
where sElectionAbbr1 != ''
union 
select distinct 
lVoterUniqueID,
sElectionAbbr2
from data_set
where sElectionAbbr2 != ''
union ...

However, I wanted to do this within Python and I've accomplished that goal, but it feels like the process is a bit heavy handed and I'm wondering if there is a more efficient way to do this.

Working Python below:

Code:
import pandas as pd

df = pd.read_csv([path to data file])
#find number of election columns. 
selected_columns = [column for column in df.columns if column.startswith("sElection")]
#add voter id column
selected_columns.insert(0,'lVoterUniqueID')

#create dataframe schema
rcdp_voters_prod = pd.DataFrame(columns=['ID','Election_Code'])

def access_elements(list_var, list_index):
    # Use list comprehension to create a new list containing elements from 'list_var' at the specified indices in 'list_index'
    result = [list_var[i] for i in list_index]
    return result

#indices of list values that hold the election column
python_indices  = [index for (index, item) in enumerate(selected_columns) if item != "lVoterUniqueID"]

union_dict = {}

#Create dictionary of lists containing unique voterid and each election abbreviation column
# ['lVoterUniqueID', 'sElectionAbbr1']
# ['lVoterUniqueID', 'sElectionAbbr2']
# ['lVoterUniqueID', 'sElectionAbbr3']
# ['lVoterUniqueID', 'sElectionAbbr4']
# ['lVoterUniqueID', 'sElectionAbbr5']
for index in python_indices:
    union_dict.update({index:access_elements(selected_columns,[0,index])})

dataframe_list = []

#create list of dataframes containing only distinct columns from column list
#rename columns for concating
for key in union_dict:
    loop_df = df.filter(items=union_dict[key]).drop_duplicates()
    loop_df.rename(columns={'lVoterUniqueID':'ID',union_dict[key][1]:'Election_Code'}, inplace= True)
    dataframe_list.append(loop_df)


final_frame_form = pd.concat(dataframe_list)
#drop rows where election code is null
final_frame_form=final_frame_form[final_frame_form['Election_Code'].notnull()]
<p>I have a set of data that I need to perform some transformations on. The raw form of the data is as follows (actual dataset will have many more columns, reduced set here for simplicity):</p>
<div class="s-table-container"><table class="s-table">
<thead>
<tr>
<th>lVoterUniqueID</th>
<th>sElectionAbbr1</th>
<th>sElectionAbbr2</th>
<th>sElectionAbbr3</th>
<th>sElectionAbbr4</th>
<th>sElectionAbbr5</th>
</tr>
</thead>
<tbody>
<tr>
<td>371527</td>
<td>2024PR</td>
<td>2022Gen</td>
<td>2022PR</td>
<td>2020GEN</td>
<td>2020PR</td>
</tr>
<tr>
<td>1843949</td>
<td>2024PR</td>
<td>null</td>
<td>null</td>
<td>null</td>
<td>null</td>
</tr>
<tr>
<td>2813398</td>
<td>2024PR</td>
<td>2022Gen</td>
<td>null</td>
<td>2020GEN</td>
<td>2020PR</td>
</tr>
</tbody>
</table></div>
<p>The output I'm looking for is:</p>
<ul>
<li><p>lVoterUniqueID relabeled as ID</p>
</li>
<li><p>Each sElectionAbbr[X] column relabeled to Election_Code</p>
</li>
</ul>
<div class="s-table-container"><table class="s-table">
<thead>
<tr>
<th>ID</th>
<th>Election_Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>371527</td>
<td>2024PR</td>
</tr>
<tr>
<td>1843949</td>
<td>2024PR</td>
</tr>
</tbody>
</table></div>
<p>I can accomplish this via sql:</p>
<pre><code>select distinct
lVoterUniqueID,
sElectionAbbr1 as Election_Code
from data_set
where sElectionAbbr1 != ''
union
select distinct
lVoterUniqueID,
sElectionAbbr2
from data_set
where sElectionAbbr2 != ''
union ...
</code></pre>
<p>However, I wanted to do this within Python and I've accomplished that goal, but it feels like the process is a bit heavy handed and I'm wondering if there is a more efficient way to do this.</p>
<p>Working Python below:</p>
<pre><code>import pandas as pd

df = pd.read_csv([path to data file])
#find number of election columns.
selected_columns = [column for column in df.columns if column.startswith("sElection")]
#add voter id column
selected_columns.insert(0,'lVoterUniqueID')

#create dataframe schema
rcdp_voters_prod = pd.DataFrame(columns=['ID','Election_Code'])

def access_elements(list_var, list_index):
# Use list comprehension to create a new list containing elements from 'list_var' at the specified indices in 'list_index'
result = [list_var for i in list_index]
return result

#indices of list values that hold the election column
python_indices = [index for (index, item) in enumerate(selected_columns) if item != "lVoterUniqueID"]

union_dict = {}

#Create dictionary of lists containing unique voterid and each election abbreviation column
# ['lVoterUniqueID', 'sElectionAbbr1']
# ['lVoterUniqueID', 'sElectionAbbr2']
# ['lVoterUniqueID', 'sElectionAbbr3']
# ['lVoterUniqueID', 'sElectionAbbr4']
# ['lVoterUniqueID', 'sElectionAbbr5']
for index in python_indices:
union_dict.update({index:access_elements(selected_columns,[0,index])})

dataframe_list = []

#create list of dataframes containing only distinct columns from column list
#rename columns for concating
for key in union_dict:
loop_df = df.filter(items=union_dict[key]).drop_duplicates()
loop_df.rename(columns={'lVoterUniqueID':'ID',union_dict[key][1]:'Election_Code'}, inplace= True)
dataframe_list.append(loop_df)


final_frame_form = pd.concat(dataframe_list)
#drop rows where election code is null
final_frame_form=final_frame_form[final_frame_form['Election_Code'].notnull()]
</code></pre>
 

Latest posts

M
Replies
0
Views
1
Meliodas Dragon
M
Top