I have a column in a table with strings of variable length:
|value |
|-------------|
|abcdefgh |
|1234567891011|
I need to split the strings into arrays of strings, where each string is of length 2 (except for the last string in case of an odd number of characters). Like so:
|value |split_value |
|-------------|---------------------------|
|abcdefgh |[ab, cd, ef, gh, ] |
|1234567891011|[12, 34, 56, 78, 91, 01, 1]|
This works in pyspark:
# Sample data
data = [("abcdefgh",), ("1234567891011",)]
df = spark.createDataFrame(data, ["value"])
# Register the DataFrame as a temporary view
df.createOrReplaceTempView("strings")
# Use Spark SQL to add a delimiter every 2 characters and then split the string
result = spark.sql("""
SELECT
value,
split(regexp_replace(value, '(.{2})', '$1,'), ',') AS split_value
FROM strings
""")
# Show the result
result.show(truncate=False)
… giving the resulting table above as expected.
However, when I execute the excact same sql statement in an sql cell in a Databricks notebook, I get an array of empty strings:
%sql
SELECT
value,
split(regexp_replace(value, '(.{2})', '$1,'), ',') AS split_value
FROM strings
|value |split_value |
|-------------|----------------------------|
|abcdefgh |["", "", "", "", ] |
|1234567891011|["", "", "", "", "", "", ""]|
It also gives me this warning:
How can I achieve the desired result in sql on Databricks?
You need to sign in to view this answers