OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Using PySpark what is the fastest way of finding most frequent combinations that appears in list of list?

  • Thread starter Thread starter joelion2
  • Start date Start date
J

joelion2

Guest
I am using a Databricks PySpark notebook. I am trying to find the most efficient way of finding the most frequent combinations in a list of lists. The number of combinations is 3.8 million and number of list of lists is 170000. I wrote an algorithm to do is this in stanard python, but when I fed in 10 lists this took 1 minute to process, so expanding that it would take 280 hours to process all lists. But since I have PySpark I think there may be a more efficient way of handling this, or a more efficient way of using standard python code - whichever works. I'd like to keep the processing time under 30 minutes, ideally 10.

Here is my current algorithm but inefficient at handling large number of lists:

Code:
from collections import defaultdict

combinations_of_n = [ ('a','b','c'), ('e','f','g'), ('h','i','j') ] # 3.8 million combinations
df_list = [ ['b','v','e','a','b','c'], ['g','g','a','b','c','f','b'], ['i','k','l','a','i','k'] ] # 170000 lists

# Count occurrences of each combination in lst_of_lsts
combination_count = defaultdict(int)

for sublist in df_list:
    for comb in combinations_of_n:
        if all(elem in sublist for elem in comb):
            combination_count[comb] += 1

# Find the top 5 most frequent combinations
top_combinations = sorted(combination_count.items(), key=lambda x: x[1], reverse=True)[:5]

# Print the results
print("Top 5 most frequent combinations:")
for comb, count in top_combinations:
    print(f"{comb}: {count} occurrences")

Any help appreciated
<p>I am using a Databricks PySpark notebook. I am trying to find the most efficient way of finding the most frequent combinations in a list of lists. The number of combinations is 3.8 million and number of list of lists is 170000. I wrote an algorithm to do is this in stanard python, but when I fed in 10 lists this took 1 minute to process, so expanding that it would take 280 hours to process all lists. But since I have PySpark I think there may be a more efficient way of handling this, or a more efficient way of using standard python code - whichever works. I'd like to keep the processing time under 30 minutes, ideally 10.</p>
<p>Here is my current algorithm but inefficient at handling large number of lists:</p>
<pre><code>from collections import defaultdict

combinations_of_n = [ ('a','b','c'), ('e','f','g'), ('h','i','j') ] # 3.8 million combinations
df_list = [ ['b','v','e','a','b','c'], ['g','g','a','b','c','f','b'], ['i','k','l','a','i','k'] ] # 170000 lists

# Count occurrences of each combination in lst_of_lsts
combination_count = defaultdict(int)

for sublist in df_list:
for comb in combinations_of_n:
if all(elem in sublist for elem in comb):
combination_count[comb] += 1

# Find the top 5 most frequent combinations
top_combinations = sorted(combination_count.items(), key=lambda x: x[1], reverse=True)[:5]

# Print the results
print("Top 5 most frequent combinations:")
for comb, count in top_combinations:
print(f"{comb}: {count} occurrences")
</code></pre>
<p>Any help appreciated</p>
 

Latest posts

S
Replies
0
Views
1
Safwan Aipuram
S

Online statistics

Members online
0
Guests online
4
Total visitors
4
Top