Plot twist & update:
We managed to perform the GPU swap and, for now, it appears that the
problem is the power supply. The TITAN V GPU is now in a computer with
better (80+ bronze) power supply and has been running during the last 24
hours without any kind of problem. On the contrary, the RTX2080 GPU that is
now in the former TITAN V computer, has crashed after running for 12 hours.
Remarkably, no cudaMemcpy error appeared in the output, but it seemed that
the hard drive could't be accessed at all. We are going to run some
additional tests just to confirm this, but it is clear that the problem is
not in the code nor the GPU nor the simulated system.
Thank you all for all the help!
Regards
Gera!
El lun, 19 jul 2021 a las 12:13, Gerardo Zerbetto De Palma (<
g.zerbetto.gmail.com>) escribió:
> We tried older simulations that we had run on GTX1080 and RTX2080 and they
> all show the same problem. Another remarkable thing is that we had run
> similar simulations in the TITAN V without any trouble, and this type of
> crash started to appear more often. Additionally, when the sim crashes,
> also the whole computer crashes and we have to force reboot it. That is why
> we will try to swap the GPUs.
> Thanks a lot, Carlos!´
> Regards
>
> El lun, 19 jul 2021 a las 12:04, Carlos Simmerling (<
> carlos.simmerling.gmail.com>) escribió:
>
>> that's interesting that it sounds like it is the GPU and not the system
>> setup itself. Do other simulations of similar size work ok on the Titan V?
>>
>> On Mon, Jul 19, 2021 at 11:00 AM Gerardo Zerbetto De Palma <
>> g.zerbetto.gmail.com> wrote:
>>
>> > Hi Carlos. Thanks for the help. The system we are trying to simulate is
>> a
>> > nPT membrane embedded protein tetramer. We are just running a plain MD
>> sim,
>> > so no additional parameters are set. The initial coordinates were
>> obtained
>> > from a previous simulation that we had run with the same parameters, so
>> > those coordinates are from a very well thermalized system. We just made
>> a
>> > single point mutation and made a minimization just to relax any
>> clashes. It
>> > is quite remarkable that the same system is being simulated in an
>> RTX2080
>> > (in another computer) without any trouble but when we run it in the
>> TITAN V
>> > it randomly crashes. Now we will try swapping the GPUs just to discard
>> that
>> > it is not a problem with any other computer component.
>> > Thanks a lot, again.
>> > Regards
>> >
>> > The original post was this one:
>> > *Hi everyone.*
>> > *We were trying to run some simulations of a membrane protein on an
>> NVIDIA
>> > TITAN V and got stuck by some cudaMemcpy that came in different
>> flavors:*
>> >
>> >
>> >
>> > *cudaMemcpy GpuBuffer::Upload failed unspecified launch
>> failurecudaMemcpy
>> > GpuBuffer::Download failed unspecified launch failure*
>> >
>> > *cudaMemcpy GpuBuffer::Download failed an illegal instruction was
>> > encountered*
>> >
>> > *Firstly we started running the sim using amber 18, restarting the sim
>> > every 5 nanoseconds to get consecutive 5ns trajectories. After
>> simulating
>> > 25 nanoseconds, the program stopped randomly. Then we tried to repeat
>> the
>> > simulation that had failed (using the same random seed and initial
>> > coordinates) and the simulation succeeded, but the same error came up
>> in a
>> > subsequent simulation. These errors kept coming at a random timestep
>> when
>> > we restarted the simulations. Energies in the output seemed to be OK and
>> > simulations sometimes proceeded without errors when restarted. Hoping
>> that
>> > this was a bug, we compiled amber 20 and ran the same simulations and
>> had
>> > the same random cudaMemcpy errors. Just to check if the simulated system
>> > was fine, we are also running it in a RTX2080 with amber 18 without
>> > problems, so far.*
>> >
>> > *We are running out of ideas here so here we are reaching out to the
>> > community for some help in this matter. We will appreciate every idea or
>> > question that can enlighten us to solve this puzzle.*
>> >
>> > El lun, 19 jul 2021 a las 11:08, Carlos Simmerling (<
>> > carlos.simmerling.gmail.com>) escribió:
>> >
>> > > for ff19SB problems, make sure your Amber version is completely
>> updated
>> > > with current patches. There was a fix a while back that corrected an
>> > error
>> > > that could lead to failures with some force fields including ff19SB.
>> > > Information on applying patches is found here:
>> > > http://ambermd.org/AmberPatches.php
>> > >
>> > > for the problem with ff14SB, I did not see the original post. More
>> > details
>> > > would be helpful, especially about the system you are simulating (is
>> it
>> > > only protein, or more? did you use any other parameters except ff14SB?
>> > > Where were the initial coordinates obtained?).
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Jul 19, 2021 at 9:46 AM Gerardo Zerbetto De Palma <
>> > > g.zerbetto.gmail.com> wrote:
>> > >
>> > > > Hi we are using ff14SB forcefield and the errors still appear.
>> > > > Thanks for the help!
>> > > > Regards
>> > > >
>> > > > Gerardo Zerbetto
>> > > >
>> > > > <
>> > > >
>> > >
>> >
>> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>> > > > >
>> > > > Virus-free.
>> > > > www.avg.com
>> > > > <
>> > > >
>> > >
>> >
>> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>> > > > >
>> > > > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>> > > >
>> > > > El vie, 16 jul 2021 a las 18:29, Rafał Madaj (<rmadaj.cbmm.lodz.pl
>> >)
>> > > > escribió:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > Which force field are you using? I had exactly same problem with
>> > > ff19SB.
>> > > > > After changing into ff14SB the problem disappeared.
>> > > > >
>> > > > > Regards,
>> > > > >
>> > > > > Rafal
>> > > > >
>> > > > > On 16.07.2021 18:01, Gerardo Zerbetto De Palma wrote:
>> > > > > > Hi everyone.
>> > > > > > We were trying to run some simulations of a membrane protein on
>> an
>> > > > NVIDIA
>> > > > > > TITAN V and got stuck by some cudaMemcpy that came in different
>> > > > flavors:
>> > > > > >
>> > > > > > cudaMemcpy GpuBuffer::Upload failed unspecified launch failure
>> > > > > > cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>> > > > > > cudaMemcpy GpuBuffer::Download failed an illegal instruction was
>> > > > > encountered
>> > > > > >
>> > > > > > Firstly we started running the sim using amber 18, restarting
>> the
>> > sim
>> > > > > every
>> > > > > > 5 nanoseconds to get consecutive 5ns trajectories. After
>> simulating
>> > > 25
>> > > > > > nanoseconds, the program stopped randomly. Then we tried to
>> repeat
>> > > the
>> > > > > > simulation that had failed (using the same random seed and
>> initial
>> > > > > > coordinates) and the simulation succeeded, but the same error
>> came
>> > up
>> > > > in
>> > > > > a
>> > > > > > subsequent simulation. These errors kept coming at a random
>> > timestep
>> > > > when
>> > > > > > we restarted the simulations. Energies in the output seemed to
>> be
>> > OK
>> > > > and
>> > > > > > simulations sometimes proceeded without errors when restarted.
>> > Hoping
>> > > > > that
>> > > > > > this was a bug, we compiled amber 20 and ran the same
>> simulations
>> > and
>> > > > had
>> > > > > > the same random cudaMemcpy errors. Just to check if the
>> simulated
>> > > > system
>> > > > > > was fine, we are also running it in a RTX2080 with amber 18
>> without
>> > > > > > problems, so far.
>> > > > > >
>> > > > > > We are running out of ideas here so here we are reaching out to
>> the
>> > > > > > community for some help in this matter. We will appreciate every
>> > idea
>> > > > or
>> > > > > > question that can enlighten us to solve this puzzle.
>> > > > > >
>> > > > > > Regards!
>> > > > > > Gerardo Zerbetto De Palma
>> > > > > >
>> > > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>> > > > > >
>> > > > > > Virus-free.
>> > > > > > www.avg.com
>> > > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>> > > > > >
>> > > > > > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>> > > > > > _______________________________________________
>> > > > > > AMBER mailing list
>> > > > > > AMBER.ambermd.org
>> > > > > > http://lists.ambermd.org/mailman/listinfo/amber
>> > > > >
>> > > > > _______________________________________________
>> > > > > AMBER mailing list
>> > > > > AMBER.ambermd.org
>> > > > > http://lists.ambermd.org/mailman/listinfo/amber
>> > > > >
>> > > > _______________________________________________
>> > > > AMBER mailing list
>> > > > AMBER.ambermd.org
>> > > > http://lists.ambermd.org/mailman/listinfo/amber
>> > > >
>> > > _______________________________________________
>> > > AMBER mailing list
>> > > AMBER.ambermd.org
>> > > http://lists.ambermd.org/mailman/listinfo/amber
>> > >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 20 2021 - 09:30:03 PDT