OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Parameter tuning with Slurm, Optuna, PyTorch Lightning, and KFold

  • Thread starter Thread starter zhihao_li
  • Start date Start date
Z

zhihao_li

Guest
With the following toy script, I am trying to tune a hyper-parameter of "learning rate" for a perceptron with "Optuna" and 5-fold cross validation. My cluster has multiple GPUs on each node, so I am using "Slurm" as the work load manager and "ddp" for distributed computation. However, this same script behaves differently when launched by "Slurm" versus running directly on a computational node.

Code:
import torch
import pytorch_lightning as pl
import numpy as np
from optuna import Trial, create_study
from sklearn.model_selection import KFold

class DummyModel(pl.LightningModule):
    def __init__(self, lr):
        super().__init__()
        self.linear = torch.nn.Linear(5000, 1); self.lr = lr

    def forward(self, x):
        return self.linear(x)

    def calculate_loss(self, batch, mode):
        x, y = batch
        y_hat = self(x)
        loss = torch.nn.functional.mse_loss(y_hat, y)
        self.log(mode + '_loss', loss, sync_dist=True, on_step=False, on_epoch=True)
        return loss

    def training_step(self, batch, batch_idx):
        return self.calculate_loss(batch, mode='train')

    def validation_step(self, batch, batch_idx):
        self.calculate_loss(batch, mode='val')

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)  
        return optimizer 

def objective(trial: Trial, x, y):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
    loss = []
    for i, (trn_ids, val_ids) in enumerate(KFold(n_splits=5).split(X=x, y=y)):
        print(f"train and val ids are {trn_ids}, {val_ids}")
        trn_dt = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(x[trn_ids], y[trn_ids]), batch_size=1000)
        val_dt = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(x[val_ids], y[val_ids]), batch_size=1000)
    
        model = DummyModel(lr=learning_rate)
        trainer = pl.Trainer(devices=2, accelerator="gpu", strategy="ddp", max_epochs=10, log_every_n_steps=5)
        trainer.fit(model, trn_dt, val_dt)
        loss.append(trainer.callback_metrics["val_loss"])
    return np.mean(loss)

if __name__ == "__main__":
    print("===Script Start===")
    batch_size = 1000; x = torch.randn(batch_size*100, 5000); y = torch.randn(batch_size*100, 1)  
    study = create_study(direction="minimize")
    study.optimize(lambda trial: objective(trial, x, y), n_trials=5)

If launched by Slurm with srun --gres=gpu:2 python ./ttt.py:
The "===Script Start===" is printed out once, but the script only uses one GPU (verified by nvida-smi).

If launched locally on a computational node, simply with python ./ttt.py:
The "===Script Start===" is printed out twice, but the script also uses two GPU.

So what's the cause of these different behaviors? How can I launch it by Slurm but also use two GPUs?
<p>With the following toy script, I am trying to tune a hyper-parameter of "learning rate" for a perceptron with "Optuna" and 5-fold cross validation. My cluster has multiple GPUs on each node, so I am using "Slurm" as the work load manager and "ddp" for distributed computation. However, this same script behaves differently when launched by "Slurm" versus running directly on a computational node.</p>
<pre><code>import torch
import pytorch_lightning as pl
import numpy as np
from optuna import Trial, create_study
from sklearn.model_selection import KFold

class DummyModel(pl.LightningModule):
def __init__(self, lr):
super().__init__()
self.linear = torch.nn.Linear(5000, 1); self.lr = lr

def forward(self, x):
return self.linear(x)

def calculate_loss(self, batch, mode):
x, y = batch
y_hat = self(x)
loss = torch.nn.functional.mse_loss(y_hat, y)
self.log(mode + '_loss', loss, sync_dist=True, on_step=False, on_epoch=True)
return loss

def training_step(self, batch, batch_idx):
return self.calculate_loss(batch, mode='train')

def validation_step(self, batch, batch_idx):
self.calculate_loss(batch, mode='val')

def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
return optimizer

def objective(trial: Trial, x, y):
learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
loss = []
for i, (trn_ids, val_ids) in enumerate(KFold(n_splits=5).split(X=x, y=y)):
print(f"train and val ids are {trn_ids}, {val_ids}")
trn_dt = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(x[trn_ids], y[trn_ids]), batch_size=1000)
val_dt = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(x[val_ids], y[val_ids]), batch_size=1000)

model = DummyModel(lr=learning_rate)
trainer = pl.Trainer(devices=2, accelerator="gpu", strategy="ddp", max_epochs=10, log_every_n_steps=5)
trainer.fit(model, trn_dt, val_dt)
loss.append(trainer.callback_metrics["val_loss"])
return np.mean(loss)

if __name__ == "__main__":
print("===Script Start===")
batch_size = 1000; x = torch.randn(batch_size*100, 5000); y = torch.randn(batch_size*100, 1)
study = create_study(direction="minimize")
study.optimize(lambda trial: objective(trial, x, y), n_trials=5)
</code></pre>
<p>If launched by Slurm with <code>srun --gres=gpu:2 python ./ttt.py</code>:<br />
The "===Script Start===" is printed out once, but the script only uses one GPU (verified by <code>nvida-smi</code>).</p>
<p>If launched locally on a computational node, simply with <code>python ./ttt.py</code>:<br />
The "===Script Start===" is printed out twice, but the script also uses two GPU.</p>
<p>So what's the cause of these different behaviors? How can I launch it by Slurm but also use two GPUs?</p>
 

Latest posts

Online statistics

Members online
0
Guests online
2
Total visitors
2
Ads by Eonads
Top