OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Finding probabilities of each value in all categorical columns across a dataframe

  • Thread starter Thread starter DeltaIV
  • Start date Start date
D

DeltaIV

Guest
My question is nearly identical to

Finding frequency of each value in all categorical columns across a dataframe

but I need the probabilities, instead of the frequencies. We can use the same example dataframe:

Code:
df = pd.DataFrame(
    {'sub_code' : ['CSE01', 'CSE01', 'CSE01', 
                   'CSE02', 'CSE03', 'CSE04',
                   'CSE05', 'CSE06'],
     'stud_level' : [101, 101, 101, 101, 
                  101, 101, 101, 101],
     'grade' : ['STA','STA','PSA','STA','STA','SSA','PSA','QSA']})

I tried adapting this answer

https://stackoverflow.com/a/70811258

in the following way:

Code:
out = (df.select_dtypes(object)
       .melt(var_name="Variable", value_name="Class")
       .value_counts(dropna=False, normalize=True)
       .reset_index(name="Probability")
       .sort_values(by=['Variable','Class'], ascending=[True,True])
       .reset_index(drop=True))

However, the code doesn't work, because the sum of the class probabilities for each variable is not 1. What am I doing wrong?
<p>My question is nearly identical to</p>
<p><a href="https://stackoverflow.com/q/70811240">Finding frequency of each value in all categorical columns across a dataframe</a></p>
<p>but I need the probabilities, instead of the frequencies. We can use the same example dataframe:</p>
<pre><code>df = pd.DataFrame(
{'sub_code' : ['CSE01', 'CSE01', 'CSE01',
'CSE02', 'CSE03', 'CSE04',
'CSE05', 'CSE06'],
'stud_level' : [101, 101, 101, 101,
101, 101, 101, 101],
'grade' : ['STA','STA','PSA','STA','STA','SSA','PSA','QSA']})
</code></pre>
<p>I tried adapting this answer</p>
<p><a href="https://stackoverflow.com/a/70811258">https://stackoverflow.com/a/70811258</a></p>
<p>in the following way:</p>
<pre><code>out = (df.select_dtypes(object)
.melt(var_name="Variable", value_name="Class")
.value_counts(dropna=False, normalize=True)
.reset_index(name="Probability")
.sort_values(by=['Variable','Class'], ascending=[True,True])
.reset_index(drop=True))
</code></pre>
<p>However, the code doesn't work, because the sum of the class probabilities for each variable is not 1. What am I doing wrong?</p>
 

Latest posts

Top