OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

How Can I Optimize Machine Translation Model Training to Overcome GPU Memory Overflow Issues?

  • Thread starter Thread starter dsb
  • Start date Start date
D

dsb

Guest
I'm trying to train a fairly standard machine translation transformer model using PyTorch. It's based on the "Attention is All You Need" paper. When I ran it on my PC with standard hyperparameters and a batch size of 128 segments (pairs of source and target language sentences), it worked fine but was slow, as expected.

Now, I'm running it on an AWS p2.xlarge instance with a Tesla K80 GPU, and the program crashes quickly due to GPU memory overflow. I've tried everything to free up GPU memory, but I've had to reduce the batch size to 8, which is obviously inefficient for learning.

Even with a batch size of 8, I occasionally get this error message:

File "C:\Projects\MT004.venv\Lib\site-packages\torch\autograd\graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU

I've tried both SpaCy's tokenizer and the XLM-R tokenizer. With the XLM-R tokenizer, I can only use a batch size of 2, and even then, it sometimes crashes.

Here is the code where things crash:

Code:
def train_epoch(src_train_sent, tgt_train_sent, model, optimizer):
    model.train()
    losses = 0

    torch.cuda.empty_cache()  # Clear cache before forward pass

    train_dataloader = SrcTgtIterable(src_train_sent, tgt_train_sent, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in train_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :].long()
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))

        # Delete unnecessary variables before backward pass
        del src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, logits, tgt_out
        torch.cuda.empty_cache()  # Clear cache after deleting variables

        loss.backward()

        optimizer.step()
        losses += loss.item()

        # Free GPU memory
        del loss
        torch.cuda.empty_cache()  # Clear cache after each batch

Things crash on loss.backward()

Unfortunately, I cannot use a bigger server since I don't have enough quota on EC2.

Any idea what I might be doing wrong? Any suggestions on how to optimize things?
<p>I'm trying to train a fairly standard machine translation transformer model using PyTorch. It's based on the "Attention is All You Need" paper. When I ran it on my PC with standard hyperparameters and a batch size of 128 segments (pairs of source and target language sentences), it worked fine but was slow, as expected.</p>
<p>Now, I'm running it on an AWS p2.xlarge instance with a Tesla K80 GPU, and the program crashes quickly due to GPU memory overflow. I've tried everything to free up GPU memory, but I've had to reduce the batch size to 8, which is obviously inefficient for learning.</p>
<p>Even with a batch size of 8, I occasionally get this error message:</p>
<blockquote>
<p>File
"C:\Projects\MT004.venv\Lib\site-packages\torch\autograd\graph.py",
line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate
1.95 GiB. GPU</p>
</blockquote>
<p>I've tried both SpaCy's tokenizer and the XLM-R tokenizer. With the XLM-R tokenizer, I can only use a batch size of 2, and even then, it sometimes crashes.</p>
<p>Here is the code where things crash:</p>
<pre><code>def train_epoch(src_train_sent, tgt_train_sent, model, optimizer):
model.train()
losses = 0

torch.cuda.empty_cache() # Clear cache before forward pass

train_dataloader = SrcTgtIterable(src_train_sent, tgt_train_sent, batch_size=BATCH_SIZE, collate_fn=collate_fn)

for src, tgt in train_dataloader:
src = src.to(DEVICE)
tgt = tgt.to(DEVICE)

tgt_input = tgt[:-1, :]

src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)

optimizer.zero_grad()

tgt_out = tgt[1:, :].long()
loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))

# Delete unnecessary variables before backward pass
del src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, logits, tgt_out
torch.cuda.empty_cache() # Clear cache after deleting variables

loss.backward()

optimizer.step()
losses += loss.item()

# Free GPU memory
del loss
torch.cuda.empty_cache() # Clear cache after each batch
</code></pre>
<p>Things crash on <code>loss.backward()</code></p>
<p>Unfortunately, I cannot use a bigger server since I don't have enough quota on EC2.</p>
<p>Any idea what I might be doing wrong? Any suggestions on how to optimize things?</p>
 

Latest posts

B
Replies
0
Views
1
Blundering Ecologist
B
Top