Re: [AMBER] cudaMemcpy GpuBuffer ERROR

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 20 Jul 2021 19:09:35 -0400

Hi Gera,

Power supply is often an issue if it isn't beefy enough but another thing that can cause the behavior you see is a bad PCI-E riser, either on the motherboard or as an adapter that provides for multiple PCI-E slots. As such you may want to try a different PCI-E slot if you can. Although I would also try ruling out the power supply first.

All the best
Ross

> On Jul 20, 2021, at 12:15, Gerardo Zerbetto De Palma <g.zerbetto.gmail.com> wrote:
>
> Plot twist & update:
> We managed to perform the GPU swap and, for now, it appears that the
> problem is the power supply. The TITAN V GPU is now in a computer with
> better (80+ bronze) power supply and has been running during the last 24
> hours without any kind of problem. On the contrary, the RTX2080 GPU that is
> now in the former TITAN V computer, has crashed after running for 12 hours.
> Remarkably, no cudaMemcpy error appeared in the output, but it seemed that
> the hard drive could't be accessed at all. We are going to run some
> additional tests just to confirm this, but it is clear that the problem is
> not in the code nor the GPU nor the simulated system.
> Thank you all for all the help!
> Regards
> Gera!
>
> El lun, 19 jul 2021 a las 12:13, Gerardo Zerbetto De Palma (<
> g.zerbetto.gmail.com>) escribió:
>
>> We tried older simulations that we had run on GTX1080 and RTX2080 and they
>> all show the same problem. Another remarkable thing is that we had run
>> similar simulations in the TITAN V without any trouble, and this type of
>> crash started to appear more often. Additionally, when the sim crashes,
>> also the whole computer crashes and we have to force reboot it. That is why
>> we will try to swap the GPUs.
>> Thanks a lot, Carlos!´
>> Regards
>>
>> El lun, 19 jul 2021 a las 12:04, Carlos Simmerling (<
>> carlos.simmerling.gmail.com>) escribió:
>>
>>> that's interesting that it sounds like it is the GPU and not the system
>>> setup itself. Do other simulations of similar size work ok on the Titan V?
>>>
>>> On Mon, Jul 19, 2021 at 11:00 AM Gerardo Zerbetto De Palma <
>>> g.zerbetto.gmail.com> wrote:
>>>
>>>> Hi Carlos. Thanks for the help. The system we are trying to simulate is
>>> a
>>>> nPT membrane embedded protein tetramer. We are just running a plain MD
>>> sim,
>>>> so no additional parameters are set. The initial coordinates were
>>> obtained
>>>> from a previous simulation that we had run with the same parameters, so
>>>> those coordinates are from a very well thermalized system. We just made
>>> a
>>>> single point mutation and made a minimization just to relax any
>>> clashes. It
>>>> is quite remarkable that the same system is being simulated in an
>>> RTX2080
>>>> (in another computer) without any trouble but when we run it in the
>>> TITAN V
>>>> it randomly crashes. Now we will try swapping the GPUs just to discard
>>> that
>>>> it is not a problem with any other computer component.
>>>> Thanks a lot, again.
>>>> Regards
>>>>
>>>> The original post was this one:
>>>> *Hi everyone.*
>>>> *We were trying to run some simulations of a membrane protein on an
>>> NVIDIA
>>>> TITAN V and got stuck by some cudaMemcpy that came in different
>>> flavors:*
>>>>
>>>>
>>>>
>>>> *cudaMemcpy GpuBuffer::Upload failed unspecified launch
>>> failurecudaMemcpy
>>>> GpuBuffer::Download failed unspecified launch failure*
>>>>
>>>> *cudaMemcpy GpuBuffer::Download failed an illegal instruction was
>>>> encountered*
>>>>
>>>> *Firstly we started running the sim using amber 18, restarting the sim
>>>> every 5 nanoseconds to get consecutive 5ns trajectories. After
>>> simulating
>>>> 25 nanoseconds, the program stopped randomly. Then we tried to repeat
>>> the
>>>> simulation that had failed (using the same random seed and initial
>>>> coordinates) and the simulation succeeded, but the same error came up
>>> in a
>>>> subsequent simulation. These errors kept coming at a random timestep
>>> when
>>>> we restarted the simulations. Energies in the output seemed to be OK and
>>>> simulations sometimes proceeded without errors when restarted. Hoping
>>> that
>>>> this was a bug, we compiled amber 20 and ran the same simulations and
>>> had
>>>> the same random cudaMemcpy errors. Just to check if the simulated system
>>>> was fine, we are also running it in a RTX2080 with amber 18 without
>>>> problems, so far.*
>>>>
>>>> *We are running out of ideas here so here we are reaching out to the
>>>> community for some help in this matter. We will appreciate every idea or
>>>> question that can enlighten us to solve this puzzle.*
>>>>
>>>> El lun, 19 jul 2021 a las 11:08, Carlos Simmerling (<
>>>> carlos.simmerling.gmail.com>) escribió:
>>>>
>>>>> for ff19SB problems, make sure your Amber version is completely
>>> updated
>>>>> with current patches. There was a fix a while back that corrected an
>>>> error
>>>>> that could lead to failures with some force fields including ff19SB.
>>>>> Information on applying patches is found here:
>>>>> http://ambermd.org/AmberPatches.php
>>>>>
>>>>> for the problem with ff14SB, I did not see the original post. More
>>>> details
>>>>> would be helpful, especially about the system you are simulating (is
>>> it
>>>>> only protein, or more? did you use any other parameters except ff14SB?
>>>>> Where were the initial coordinates obtained?).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 19, 2021 at 9:46 AM Gerardo Zerbetto De Palma <
>>>>> g.zerbetto.gmail.com> wrote:
>>>>>
>>>>>> Hi we are using ff14SB forcefield and the errors still appear.
>>>>>> Thanks for the help!
>>>>>> Regards
>>>>>>
>>>>>> Gerardo Zerbetto
>>>>>>
>>>>>> <
>>>>>>
>>>>>
>>>>
>>> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>>>>>>>
>>>>>> Virus-free.
>>>>>> www.avg.com
>>>>>> <
>>>>>>
>>>>>
>>>>
>>> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>>>>>>>
>>>>>> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>>
>>>>>> El vie, 16 jul 2021 a las 18:29, Rafał Madaj (<rmadaj.cbmm.lodz.pl
>>>> )
>>>>>> escribió:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Which force field are you using? I had exactly same problem with
>>>>> ff19SB.
>>>>>>> After changing into ff14SB the problem disappeared.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Rafal
>>>>>>>
>>>>>>> On 16.07.2021 18:01, Gerardo Zerbetto De Palma wrote:
>>>>>>>> Hi everyone.
>>>>>>>> We were trying to run some simulations of a membrane protein on
>>> an
>>>>>> NVIDIA
>>>>>>>> TITAN V and got stuck by some cudaMemcpy that came in different
>>>>>> flavors:
>>>>>>>>
>>>>>>>> cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
>>>>>>>> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>>>>>>>> cudaMemcpy GpuBuffer::Download failed an illegal instruction was
>>>>>>> encountered
>>>>>>>>
>>>>>>>> Firstly we started running the sim using amber 18, restarting
>>> the
>>>> sim
>>>>>>> every
>>>>>>>> 5 nanoseconds to get consecutive 5ns trajectories. After
>>> simulating
>>>>> 25
>>>>>>>> nanoseconds, the program stopped randomly. Then we tried to
>>> repeat
>>>>> the
>>>>>>>> simulation that had failed (using the same random seed and
>>> initial
>>>>>>>> coordinates) and the simulation succeeded, but the same error
>>> came
>>>> up
>>>>>> in
>>>>>>> a
>>>>>>>> subsequent simulation. These errors kept coming at a random
>>>> timestep
>>>>>> when
>>>>>>>> we restarted the simulations. Energies in the output seemed to
>>> be
>>>> OK
>>>>>> and
>>>>>>>> simulations sometimes proceeded without errors when restarted.
>>>> Hoping
>>>>>>> that
>>>>>>>> this was a bug, we compiled amber 20 and ran the same
>>> simulations
>>>> and
>>>>>> had
>>>>>>>> the same random cudaMemcpy errors. Just to check if the
>>> simulated
>>>>>> system
>>>>>>>> was fine, we are also running it in a RTX2080 with amber 18
>>> without
>>>>>>>> problems, so far.
>>>>>>>>
>>>>>>>> We are running out of ideas here so here we are reaching out to
>>> the
>>>>>>>> community for some help in this matter. We will appreciate every
>>>> idea
>>>>>> or
>>>>>>>> question that can enlighten us to solve this puzzle.
>>>>>>>>
>>>>>>>> Regards!
>>>>>>>> Gerardo Zerbetto De Palma
>>>>>>>>
>>>>>>>> <
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>>>>>>>>
>>>>>>>> Virus-free.
>>>>>>>> www.avg.com
>>>>>>>> <
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>>>>>>>>
>>>>>>>> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>>>> _______________________________________________
>>>>>>>> AMBER mailing list
>>>>>>>> AMBER.ambermd.org
>>>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> AMBER mailing list
>>>>>>> AMBER.ambermd.org
>>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>>>
>>>>>> _______________________________________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>>
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 20 2021 - 16:30:02 PDT
Custom Search