OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Convert a python UDF to pandas UDF to improve performance in PySpark

  • Thread starter Thread starter Zafar Waris
  • Start date Start date
Z

Zafar Waris

Guest
I have multiple functions in Python which I am using as UDFs in PySpark, but the problem is that my data is too big and applying all these UDFs takes a long time to complete the transformation. The syntax for all the UDFs are same. I have heard that Pandas UDFs are a lot faster than Python UDFs. I tried to apply them in my code but since I have not much experience in Pandas I have been getting a lot of error.

Python UDF is as follows:

Code:
def python_udf(row):
    if row is None:
        return '[]'
    x = json.dumps(yaml.safe_load("".join(row.split('""'))))
    return json.dumps(yaml.safe_load(x.replace('"', '')))

The Pandas UDF I tried to generate:

Code:
@pandas_udf(returnType=StringType())
def python_udf(row):
    if row is None:
        return '[]'
    x = json.dumps(yaml.safe_load("".join(row.str.split('""'))))
    return json.dumps(yaml.safe_load(x.replace('"', '')))

The syntax for input of this function is:

Code:
"[{""abc"": ""abc"", ""def"": ""18"", ""ghi"": 3, ""jkl"": 0, ""mno"": []}]"

I would really appreciate if someone could help me how to change my current UDFs to use as Pandas UDF.

Using the same function with pandas_udf gave me error stating Type Mismatch 'expected str got list'.
<p>I have multiple functions in Python which I am using as UDFs in PySpark, but the problem is that my data is too big and applying all these UDFs takes a long time to complete the transformation. The syntax for all the UDFs are same. I have heard that Pandas UDFs are a lot faster than Python UDFs. I tried to apply them in my code but since I have not much experience in Pandas I have been getting a lot of error.</p>
<p>Python UDF is as follows:</p>
<pre><code>def python_udf(row):
if row is None:
return '[]'
x = json.dumps(yaml.safe_load("".join(row.split('""'))))
return json.dumps(yaml.safe_load(x.replace('"', '')))
</code></pre>
<p>The Pandas UDF I tried to generate:</p>
<pre><code>@pandas_udf(returnType=StringType())
def python_udf(row):
if row is None:
return '[]'
x = json.dumps(yaml.safe_load("".join(row.str.split('""'))))
return json.dumps(yaml.safe_load(x.replace('"', '')))
</code></pre>
<p>The syntax for input of this function is:</p>
<pre><code>"[{""abc"": ""abc"", ""def"": ""18"", ""ghi"": 3, ""jkl"": 0, ""mno"": []}]"
</code></pre>
<p>I would really appreciate if someone could help me how to change my current UDFs to use as Pandas UDF.</p>
<p>Using the same function with pandas_udf gave me error stating Type Mismatch 'expected str got list'.</p>
 

Latest posts

I
Replies
0
Views
1
Isaac P. Liu
I
U
Replies
0
Views
1
user3658366
U
G
Replies
0
Views
1
Giampaolo Levorato
G
M
Replies
0
Views
1
Marcelo Rodrigo Nascimento
M
Top