OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Unable to reproduce PyTorch model training performance

  • Thread starter Thread starter Matthias
  • Start date Start date
M

Matthias

Guest
I have trained a RegNet model on a custom dataset for an image classification task. That was in August 2023. Now I want to train exactly the same model again, using the same dataset. I would expect this new model to achieve about the same performance as the previous one from August 2023, since nothing has changed:

  • I use exactly the same PyTorch and Torchvision versions (1.13 and 0.14)
  • I use exactly the same image dataset for training/validation/test
  • I use exactly the same script to train the model via torch
  • And I use exactly the same training hyperparams as before

However, even though nothing has changed, the newly trained model performs significantly worse the the original model from last year. Where the first model from august 2023 achieves an test accuracy of 0.97, now the new model only achives 0.94 on the very same test dataset. During training the train and validation accuracy is about the same though as before.

I understand that two models will not achieve exactly the same performance, but three % difference seems too much. Whatever I do, I cannot get close to these 0.97 test accuracy from last year, about 0.94 is all I get. Even though everything is exactly the same, as described. Even the machine with its four GPUs and the Ubuntu version running on that machine are exactly the same as before in 2023.

I know there is a random seed involved, but I doubt that could lead to such a large test accuracy difference of 3%. Also I know that maybe Nvidia / CUDA driver has been updated on that machine, and of course some dependencies and packages (e.g. numpy). But can that lead to such a huge difference?
<p>I have trained a RegNet model on a custom dataset for an image classification task. That was in August 2023. Now I want to train exactly the same model again, using the same dataset. I would expect this new model to achieve about the same performance as the previous one from August 2023, since nothing has changed:</p>
<ul>
<li>I use exactly the same PyTorch and Torchvision versions (1.13 and 0.14)</li>
<li>I use exactly the same image dataset for training/validation/test</li>
<li>I use exactly the same script to train the model via torch</li>
<li>And I use exactly the same training hyperparams as before</li>
</ul>
<p>However, even though nothing has changed, the newly trained model performs significantly worse the the original model from last year. Where the first model from august 2023 achieves an test accuracy of 0.97, now the new model only achives 0.94 on the very same test dataset. During training the train and validation accuracy is about the same though as before.</p>
<p>I understand that two models will not achieve exactly the same performance, but three % difference seems too much. Whatever I do, I cannot get close to these 0.97 test accuracy from last year, about 0.94 is all I get. Even though everything is exactly the same, as described. Even the machine with its four GPUs and the Ubuntu version running on that machine are exactly the same as before in 2023.</p>
<p>I know there is a random seed involved, but I doubt that could lead to such a large test accuracy difference of 3%. Also I know that maybe Nvidia / CUDA driver has been updated on that machine, and of course some dependencies and packages (e.g. numpy). But can that lead to such a huge difference?</p>
 
Top