I’m trying to clean and organize my data from a CSV file using Python and Pandas. Specifically, I want to extract structured information (like Social Security Numbers, Date of Birth, and Relationships) from the ‘Notes’ column of my DataFrame. However, I keep encountering this error:
PS C:\Users\hokop\Documents\GitHub\Tina-Agency-of-Texas-Data> python test2.py
Traceback (most recent call last):
File "C:\Users\hokop\Documents\GitHub\Tina-Agency-of-Texas-Data\test2.py", line 80, in <module>
df['SSN'],df['DOB'],df['Relationship'] = zip(*df['Notes'].apply(extract_info))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 3, got 2)
I’m confident that my extract_info function is returning three values (SSN, DOB, Relationship). When I print the output within the function, all three variables are there. Here’s a simplified version of my code:
import re
import pandas as pd
# Sample input data
df = pd.read_csv('contacts.csv')
# Define regex patterns for DOB and SSN
dob_pattern = r'\b(?:DOB:|DOB;|DOB: |DOB;)\s*:? ?([0-9]{2}/[0-9]{2}/[0-9]{4})\b'
ssn_pattern = r'\b(?:SS|SS |SS#|SS:|SS: |SS;|SS; |SS# |SS#:|SS#: )\s*:? ?([0-9]{3}-[0-9]{2}-[0-9]{4}|[0-9]{9})\b'
name_pattern3 = r'(?P<first>[A-Za-z]+)(?:\s+(?P<middle>[A-Za-z]+))?\s+(?P<last>[A-Za-z]+)'
name_pattern2 = r'(?P<first>[A-Za-z\'-]+)\s+(?P<last>[A-Za-z\'-]+)'
# Define a list of relationship keywords
relationship_keywords = [
"father",
"mother",
"brother",
"sister",
"friend",
"spouse",
"partner",
"child",
"aunt",
"uncle",
"cousin"
]
# Compile a regex pattern for the relationships
relationship_pattern = r'\b(?:' + '|'.join(relationship_keywords) + r')\b'
# Function to extract structured information
def extract_info(entry):
if not isinstance(entry, str): # Check if the entry is a string
return '','' # Return empty values for non-strings
# Initialize variables
name = ""
dob = ""
ssn = ""
relationship = "asd"
# Split entry into lines
lines = entry.splitlines()
for line in lines:
line = line.strip()
# if re.match(relationship_pattern, line):
# relationship = re.search(relationship_pattern, line).group(1)
# if re.match(name_pattern3, line):
# name = re.search(name_pattern3, line).group(1)
# if re.match(name_pattern2, line):
# name = re.search(name_pattern2, line).group(1)
# elif not relationship:
# relationship = 'asd'
if re.match(name_pattern3, line):
name = re.search(name_pattern3, line).group(1)
elif re.match(name_pattern2, line):
name = re.search(name_pattern2, line).group(1)
elif re.match(ssn_pattern, line):
# Extract SSN
ssn = re.search(ssn_pattern, line).group(1)
elif re.match(dob_pattern, line):
# Extract DOB
dob = re.search(dob_pattern, line).group(1)
else:
# Assume the remaining line is the name
if line.strip() != '':
name = line
else:
name=""
relationship = "asd"
return ssn, dob, relationship
# Process each entry and create a list of dictionaries
df['SSN'],df['DOB'],df['Relationship'] = zip(*df['Notes'].apply(extract_info))
# Convert structured data to a DataFrame for better visualization
df.to_csv('ssn.csv', index=False)
# Display the DataFrame
print(df)
I’m expecting the extract_info function to return a tuple of three values, which should be unpacked into three new columns (SSN, DOB, Relationship). But the error suggests that sometimes only two values are returned.
Here are a few details about my setup:
I’m using regex to extract specific patterns.
If an entry doesn’t match the expected patterns, I want the corresponding values to default to empty strings.
What could be causing the function to return only two values instead of three in some cases? Any advice on how to debug or fix this issue would be greatly appreciated!
You need to sign in to view this answers
Leave feedback about this