October 26, 2024
Chicago 12, Melborne City, USA
python

ValueError: not enough values to unpack (expected 3, got 2) when extracting data using zip() in Pandas


I’m trying to clean and organize my data from a CSV file using Python and Pandas. Specifically, I want to extract structured information (like Social Security Numbers, Date of Birth, and Relationships) from the ‘Notes’ column of my DataFrame. However, I keep encountering this error:

PS C:\Users\hokop\Documents\GitHub\Tina-Agency-of-Texas-Data> python test2.py
Traceback (most recent call last):
  File "C:\Users\hokop\Documents\GitHub\Tina-Agency-of-Texas-Data\test2.py", line 80, in <module>
    df['SSN'],df['DOB'],df['Relationship'] = zip(*df['Notes'].apply(extract_info))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 3, got 2)

I’m confident that my extract_info function is returning three values (SSN, DOB, Relationship). When I print the output within the function, all three variables are there. Here’s a simplified version of my code:

import re
import pandas as pd

# Sample input data
df = pd.read_csv('contacts.csv')

# Define regex patterns for DOB and SSN
dob_pattern = r'\b(?:DOB:|DOB;|DOB: |DOB;)\s*:? ?([0-9]{2}/[0-9]{2}/[0-9]{4})\b'
ssn_pattern = r'\b(?:SS|SS |SS#|SS:|SS: |SS;|SS; |SS# |SS#:|SS#: )\s*:? ?([0-9]{3}-[0-9]{2}-[0-9]{4}|[0-9]{9})\b'
name_pattern3 = r'(?P<first>[A-Za-z]+)(?:\s+(?P<middle>[A-Za-z]+))?\s+(?P<last>[A-Za-z]+)'
name_pattern2 = r'(?P<first>[A-Za-z\'-]+)\s+(?P<last>[A-Za-z\'-]+)'

# Define a list of relationship keywords
relationship_keywords = [
    "father",
    "mother",
    "brother",
    "sister",
    "friend",
    "spouse",
    "partner",
    "child",
    "aunt",
    "uncle",
    "cousin"
]

# Compile a regex pattern for the relationships
relationship_pattern = r'\b(?:' + '|'.join(relationship_keywords) + r')\b'

# Function to extract structured information
def extract_info(entry):
    if not isinstance(entry, str):  # Check if the entry is a string
        return '',''  # Return empty values for non-strings

    
    # Initialize variables
    name = ""
    dob = ""
    ssn = ""
    relationship = "asd"
    
    # Split entry into lines
    lines = entry.splitlines()
    for line in lines:
        line = line.strip()
        
        # if re.match(relationship_pattern, line): 
        #     relationship = re.search(relationship_pattern, line).group(1)
            
        #     if re.match(name_pattern3, line): 
        #         name = re.search(name_pattern3, line).group(1)
        #     if re.match(name_pattern2, line):
        #         name = re.search(name_pattern2, line).group(1)
        # elif not relationship:
        #     relationship = 'asd'
        if re.match(name_pattern3, line): 
            
            name = re.search(name_pattern3, line).group(1)
        elif re.match(name_pattern2, line):
            name = re.search(name_pattern2, line).group(1)
        elif re.match(ssn_pattern, line):
            # Extract SSN
            ssn = re.search(ssn_pattern, line).group(1)
        elif re.match(dob_pattern, line):
            # Extract DOB
            dob = re.search(dob_pattern, line).group(1)
        else:
            # Assume the remaining line is the name
            if line.strip() != '':
                name = line
            else:
                name=""
    relationship = "asd"


    return ssn, dob, relationship
# Process each entry and create a list of dictionaries

df['SSN'],df['DOB'],df['Relationship'] = zip(*df['Notes'].apply(extract_info))

# Convert structured data to a DataFrame for better visualization
df.to_csv('ssn.csv', index=False)

# Display the DataFrame
print(df)

I’m expecting the extract_info function to return a tuple of three values, which should be unpacked into three new columns (SSN, DOB, Relationship). But the error suggests that sometimes only two values are returned.

Here are a few details about my setup:

I’m using regex to extract specific patterns.
If an entry doesn’t match the expected patterns, I want the corresponding values to default to empty strings.
What could be causing the function to return only two values instead of three in some cases? Any advice on how to debug or fix this issue would be greatly appreciated!



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video