OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Encoder-Decoder Transformer good training performance, poor autoregressive performance

  • Thread starter Thread starter Michał Kalbarczyk
  • Start date Start date
M

Michał Kalbarczyk

Guest
I am working on a full encoder-decoder transformer model to synthesize speech from EEG signals. Specifically, for a window of EEG activity of length x=100, I predict a window of length x=100 of mel spectrograms. The EEG and mel spectrograms are aligned in time, with total data set dimensions (43265, 107) for EEG and (43264, 80) for mel spectrograms.

I divided the dataset into training and testing sets with an 80/20 split. This results in 6902 training sequences, each with dimensions (100, 107) for EEG and (100, 80) for mel spectrograms.

My model architecture includes:

  • Two prenets (one for the encoder and one for the decoder) to extract features from the EEG and mel spectrograms, projecting them into embeddings.
  • A postnet to refine the predicted mel spectrograms.

Model overview

The issue I'm facing is that while the training loss decreases, the model performs poorly during inference. The predictions on the validation set are very poor, and the model also underperforms on the training set during inference.

During inference I predict the data the following way:

Code:
eeg_val = eeg_val.to(device)
mel_val = mel_val.to(device)

mel_input = torch.zeros([modelArgs.batch_size, 1, 80]).to(device)
pos_eeg = torch.arange(1, eeg_context_length + 1).repeat(modelArgs.batch_size, 1).to(device)

pbar = tqdm(range(config["TR"]["context_length"]), desc=f"Validating...", position=0, leave=False)
with torch.no_grad():
    for _ in pbar:
        pos_mel = torch.arange(1, mel_input.size(1)+1).repeat(modelArgs.batch_size, 1).to(device)
        mel_out, postnet_pred, attn, _, attn_dec = model.forward(eeg_val, mel_input, pos_eeg, pos_mel)
        mel_input = torch.cat([mel_input, mel_out[:,-1:,:]], dim=1)

batch_loss = criterion(postnet_pred, mel_val)

where:


  • config["TR"]["context_length"] is the length of the window i.e. 100


  • pos_eeg and pos_mel are used to create masks for attention


  • mel_out is the output of the decoder, postnet_pred is the output of the postnet

Training history

The loss is calculated with nn.L1Loss() on the output of the decoder and the output of the postnet: batch_loss = mel_loss + post_mel_loss

Some of the predictions that the model makes:

Model prediction 1

Model prediction 2

My model is based on the Neural Speech Synthesis with Transformer Network and I use the following implementation:

The only difference between my setup and the text2Speech is that:

  • I use EEG instead of text
  • I don't have a stop token as I predict for a fixed time window.
  • I create positional embeddings using the following class instead of the nn.Embeddings module:

Code:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=200):
        super(PositionalEncoding, self).__init__()

        self.dropout = nn.Dropout(p=dropout)
        self.alpha = nn.Parameter(t.ones(1))
        pe = t.zeros(max_len, d_model)
        position = t.arange(0, max_len, dtype=t.float).unsqueeze(1)
        div_term = t.exp(t.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        pe[:, 0::2] = t.sin(position * div_term)
        pe[:, 1::2] = t.cos(position * div_term)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        pos = self.pe[:x.shape[1]]
        pos = t.stack([pos]*x.shape[0], 0) # [bs x seq_len(x) x n_pos]
        x = pos * self.alpha + x
        return self.dropout(x)

I also use NoamOpt learning scheduler from this tutorial: https://nlp.seas.harvard.edu/2018/04/03/attention.html

Question: What surprises me the most is that despite a drastically increasing loss and very good correlation scores (between prediction and ground truth) during training, the network performs very poorly on the same training sequences during inference -> What could be the reason for the model's poor autoregressive performance?

I tried different variations of the architecture and I right-shifted the inputs to the decoder (as it is done in the original paper for the transformer). The changes to the architecture didn't really result in better predictions but the shifting of the decoder inputs by one slightly improved the predictions (from around 0% correlation to less than 20%).

Any insights or suggestions?
<p>I am working on a full encoder-decoder transformer model to synthesize speech from EEG signals. Specifically, for a window of EEG activity of length <code>x=100</code>, I predict a window of length <code>x=100</code> of mel spectrograms. The EEG and mel spectrograms are aligned in time, with total data set dimensions <code>(43265, 107)</code> for EEG and <code>(43264, 80)</code> for mel spectrograms.</p>
<p>I divided the dataset into training and testing sets with an <code>80/20</code> split. This results in <code>6902</code> training sequences, each with dimensions <code>(100, 107)</code> for EEG and <code>(100, 80)</code> for mel spectrograms.</p>
<p>My model architecture includes:</p>
<ul>
<li>Two prenets (one for the encoder and one for the decoder) to extract features from the EEG and mel spectrograms, projecting them into embeddings.</li>
<li>A postnet to refine the predicted mel spectrograms.</li>
</ul>
<p><a href="https://i.sstatic.net/8uK2ltTK.png" rel="nofollow noreferrer">Model overview</a></p>
<p>The issue I'm facing is that while the training loss decreases, the model performs poorly during inference. The predictions on the validation set are very poor, and the model also underperforms on the training set during inference.</p>
<p>During inference I predict the data the following way:</p>
<pre><code>eeg_val = eeg_val.to(device)
mel_val = mel_val.to(device)

mel_input = torch.zeros([modelArgs.batch_size, 1, 80]).to(device)
pos_eeg = torch.arange(1, eeg_context_length + 1).repeat(modelArgs.batch_size, 1).to(device)

pbar = tqdm(range(config["TR"]["context_length"]), desc=f"Validating...", position=0, leave=False)
with torch.no_grad():
for _ in pbar:
pos_mel = torch.arange(1, mel_input.size(1)+1).repeat(modelArgs.batch_size, 1).to(device)
mel_out, postnet_pred, attn, _, attn_dec = model.forward(eeg_val, mel_input, pos_eeg, pos_mel)
mel_input = torch.cat([mel_input, mel_out[:,-1:,:]], dim=1)

batch_loss = criterion(postnet_pred, mel_val)
</code></pre>
<p>where:</p>
<ul>
<li><p><code>config["TR"]["context_length"]</code> is the length of the window i.e. <code>100</code></p>
</li>
<li><p><code>pos_eeg</code> and <code>pos_mel</code> are used to create masks for attention</p>
</li>
<li><p><code>mel_out</code> is the output of the decoder, <code>postnet_pred</code> is the output of the postnet</p>
</li>
</ul>
<p><a href="https://i.sstatic.net/GsgWVe5Q.png" rel="nofollow noreferrer">Training history</a></p>
<p>The loss is calculated with <code>nn.L1Loss()</code> on the output of the decoder and the output of the postnet: <code>batch_loss = mel_loss + post_mel_loss</code></p>
<p>Some of the predictions that the model makes:</p>
<p><a href="https://i.sstatic.net/JQbNH82C.png" rel="nofollow noreferrer">Model prediction 1</a></p>
<p><a href="https://i.sstatic.net/Hul76jOy.png" rel="nofollow noreferrer">Model prediction 2</a></p>
<p>My model is based on the <em><a href="https://arxiv.org/abs/1809.08895" rel="nofollow noreferrer">Neural Speech Synthesis with Transformer Network</a></em> and I use the following <a href="https://github.com/soobinseo/Transformer-TTS" rel="nofollow noreferrer">implementation</a>:</p>
<p>The only difference between my setup and the text2Speech is that:</p>
<ul>
<li>I use EEG instead of text</li>
<li>I don't have a stop token as I predict for a fixed time window.</li>
<li>I create positional embeddings using the following class instead of the <code>nn.Embeddings</code> module:</li>
</ul>
<pre><code>class PositionalEncoding(nn.Module):
def __init__(self, d_model, dropout=0.1, max_len=200):
super(PositionalEncoding, self).__init__()

self.dropout = nn.Dropout(p=dropout)
self.alpha = nn.Parameter(t.ones(1))
pe = t.zeros(max_len, d_model)
position = t.arange(0, max_len, dtype=t.float).unsqueeze(1)
div_term = t.exp(t.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
pe[:, 0::2] = t.sin(position * div_term)
pe[:, 1::2] = t.cos(position * div_term)
self.register_buffer('pe', pe)

def forward(self, x):
pos = self.pe[:x.shape[1]]
pos = t.stack([pos]*x.shape[0], 0) # [bs x seq_len(x) x n_pos]
x = pos * self.alpha + x
return self.dropout(x)
</code></pre>
<p>I also use NoamOpt learning scheduler from this tutorial: <a href="https://nlp.seas.harvard.edu/2018/04/03/attention.html" rel="nofollow noreferrer">https://nlp.seas.harvard.edu/2018/04/03/attention.html</a></p>
<p><strong>Question:</strong> What surprises me the most is that despite a drastically increasing loss and very good correlation scores (between prediction and ground truth) during training, the network performs very poorly on the same training sequences during inference -> <strong>What could be the reason for the model's poor autoregressive performance?</strong></p>
<p>I tried different variations of the architecture and I right-shifted the inputs to the decoder (as it is done in the original paper for the transformer). The changes to the architecture didn't really result in better predictions but the shifting of the decoder inputs by one slightly improved the predictions (from around 0% correlation to less than 20%).</p>
<p>Any insights or suggestions?</p>
 

Latest posts

I
Replies
0
Views
1
impact christian
I
Top