K

#### Kacper

##### Guest

`ds2.sample(...,weights = ds1_kde)`

). However, when mass variable in ds2 is not uniformly distributed, the skew is always reflected in the sample kde and the distribution of masses in the sample does not reflect the ds1 masses distribution.This is a simplified version of a problem I face when drawing a sample from a population and trying to control for the distribution of a variable, so that it matches the distribution of that same variable in a subpopulation I'm investigating.

The approach works (kde of sample is similar to kde of ds1) when the distribution of mass in ds2 is uniform:

Code:

```
# Dataset 1
ds1 = pd.DataFrame({'mass':[0,1,1,2,2,2,2,3,3,3,4,4,5,6,7,7,8,8,8,8,9,9,9,9,9,9,9,10,10,10,10,10,10,11,11,11,12,13,14,15]})
# Find kde of dataset 1
ds1_kde = stats.gaussian_kde(ds1.mass.to_list())
ds1_kde.set_bandwidth(bw_method=0.2)
# dataset 2 with a uniform distribution
ds2 = pd.DataFrame({'mass':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
ds2['ds1_kde'] = ds2.mass.apply(lambda x: ds1_kde(x)[0])
smpl = ds2.sample(200, weights='ds1_kde',replace=True)
smpl_kde = stats.gaussian_kde(smpl.mass.to_list())
smpl_kde.set_bandwidth(bw_method=0.2)
bins = np.linspace(0, 16, 15)
fig = plt.figure()
sns.histplot(ds2.mass.to_list(),stat='density',bins=bins,color='blue',label='ds2',alpha=0.5)
sns.histplot(ds1.mass.to_list(),stat='density',bins=bins,color='orange',label='ds1 (goal distribution)',alpha=0.5)
sns.histplot(smpl.mass.to_list(),stat='density',bins=bins,color='green',label='sample from ds2',alpha=0.5)
sns.lineplot(x=np.linspace(0,15,100),y=ds1_kde(np.linspace(0,15,100)),color='orange',label='ds1_kde')
sns.lineplot(x=np.linspace(0,15,100),y=smpl_kde(np.linspace(0,15,100)),color='green',label='sample_kde')
plt.xlabel("mass")
plt.title("Success when mass distribution in ds2 is uniform")
```

However, when mass is skewed in ds2, the approach fails:

Code:

```
# Dataset 2 with a non-uniform distribution
ds2 = pd.DataFrame({'mass':[0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,4,4,5,6,7,8,9,10,11,12,13,14,15]})
# Add dataset1 kde to dataset2 to use as weights when sampling
ds2['ds1_kde'] = ds2.mass.apply(lambda x: ds1_kde(x)[0])
smpl = ds2.sample(200, weights='ds1_kde',replace=True)
# Find kde of sample
smpl_kde = stats.gaussian_kde(smpl.mass.to_list())
smpl_kde.set_bandwidth(bw_method=0.2)
# Plot
bins = np.linspace(0, 16, 15)
fig = plt.figure()
sns.histplot(ds2.mass.to_list(),stat='density',bins=bins,color='blue',label='ds2',alpha=0.5)
sns.histplot(ds1.mass.to_list(),stat='density',bins=bins,color='orange',label='ds1 (goal distribution)',alpha=0.5)
sns.histplot(smpl.mass.to_list(),stat='density',bins=bins,color='green',label='sample from ds2',alpha=0.5)
sns.lineplot(x=np.linspace(0,15,100),y=ds1_kde(np.linspace(0,15,100)),color='orange',label='ds1_kde')
sns.lineplot(x=np.linspace(0,15,100),y=smpl_kde(np.linspace(0,15,100)),color='green',label='sample_kde')
plt.xlabel("mass")
plt.title("Failure when mass distribution in ds2 is skewed")
```

<p>Dataset #1 (ds1) has a particular distribution of a variable "mass". I would like to draw a sample from dataset #2 (ds2) such that the sample has a distribution of the "mass" variable that matches the ds1 distribution. This works when the "mass" in ds2 is uniformly distributed and I use the kde function from ds1 as weights when sampling ds2 (using Pandas: <code>ds2.sample(...,weights = ds1_kde)</code>). However, when mass variable in ds2 is not uniformly distributed, the skew is always reflected in the sample kde and the distribution of masses in the sample does not reflect the ds1 masses distribution.</p>

<p>This is a simplified version of a problem I face when drawing a sample from a population and trying to control for the distribution of a variable, so that it matches the distribution of that same variable in a subpopulation I'm investigating.</p>

<p>The approach works (kde of sample is similar to kde of ds1) when the distribution of mass in ds2 is uniform:</p>

<pre><code># Dataset 1

ds1 = pd.DataFrame({'mass':[0,1,1,2,2,2,2,3,3,3,4,4,5,6,7,7,8,8,8,8,9,9,9,9,9,9,9,10,10,10,10,10,10,11,11,11,12,13,14,15]})

# Find kde of dataset 1

ds1_kde = stats.gaussian_kde(ds1.mass.to_list())

ds1_kde.set_bandwidth(bw_method=0.2)

# dataset 2 with a uniform distribution

ds2 = pd.DataFrame({'mass':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})

ds2['ds1_kde'] = ds2.mass.apply(lambda x: ds1_kde(x)[0])

smpl = ds2.sample(200, weights='ds1_kde',replace=True)

smpl_kde = stats.gaussian_kde(smpl.mass.to_list())

smpl_kde.set_bandwidth(bw_method=0.2)

bins = np.linspace(0, 16, 15)

fig = plt.figure()

sns.histplot(ds2.mass.to_list(),stat='density',bins=bins,color='blue',label='ds2',alpha=0.5)

sns.histplot(ds1.mass.to_list(),stat='density',bins=bins,color='orange',label='ds1 (goal distribution)',alpha=0.5)

sns.histplot(smpl.mass.to_list(),stat='density',bins=bins,color='green',label='sample from ds2',alpha=0.5)

sns.lineplot(x=np.linspace(0,15,100),y=ds1_kde(np.linspace(0,15,100)),color='orange',label='ds1_kde')

sns.lineplot(x=np.linspace(0,15,100),y=smpl_kde(np.linspace(0,15,100)),color='green',label='sample_kde')

plt.xlabel("mass")

plt.title("Success when mass distribution in ds2 is uniform")

</code></pre>

<p><a href="https://i.sstatic.net/I8IVszWk.png" rel="nofollow noreferrer"><img src="https://i.sstatic.net/I8IVszWk.png" alt="Succesful example plot." /></a></p>

<p>However, when mass is skewed in ds2, the approach fails:</p>

<pre><code># Dataset 2 with a non-uniform distribution

ds2 = pd.DataFrame({'mass':[0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,4,4,5,6,7,8,9,10,11,12,13,14,15]})

# Add dataset1 kde to dataset2 to use as weights when sampling

ds2['ds1_kde'] = ds2.mass.apply(lambda x: ds1_kde(x)[0])

smpl = ds2.sample(200, weights='ds1_kde',replace=True)

# Find kde of sample

smpl_kde = stats.gaussian_kde(smpl.mass.to_list())

smpl_kde.set_bandwidth(bw_method=0.2)

# Plot

bins = np.linspace(0, 16, 15)

fig = plt.figure()

sns.histplot(ds2.mass.to_list(),stat='density',bins=bins,color='blue',label='ds2',alpha=0.5)

sns.histplot(ds1.mass.to_list(),stat='density',bins=bins,color='orange',label='ds1 (goal distribution)',alpha=0.5)

sns.histplot(smpl.mass.to_list(),stat='density',bins=bins,color='green',label='sample from ds2',alpha=0.5)

sns.lineplot(x=np.linspace(0,15,100),y=ds1_kde(np.linspace(0,15,100)),color='orange',label='ds1_kde')

sns.lineplot(x=np.linspace(0,15,100),y=smpl_kde(np.linspace(0,15,100)),color='green',label='sample_kde')

plt.xlabel("mass")

plt.title("Failure when mass distribution in ds2 is skewed")

</code></pre>

<p><a href="https://i.sstatic.net/19SLIBu3.png" rel="nofollow noreferrer"><img src="https://i.sstatic.net/19SLIBu3.png" alt="Failed example plot." /></a></p>