Re: [AMBER] Cpu busy looping? from Jake Smith on 2013-08-20 (Amber Archive Aug 2013)

From: Jake Smith <amberonejake.aol.com>
Date: Tue, 20 Aug 2013 20:38:37 -0400 (EDT)

Hi Ross,

no I couldn't find information on how to slow down PCIe ... is there any way on Linux?
For memory I didn't think about it.... I didn't think that (CPU) memory could be a bottleneck...

I have thought that the core might be put in non-schedulable mode while waiting for something from the GPU, eg busylooping in interrupt handler or the like, and that would explain why CPU speed is irrelevant but still the core can't be used for anything else. That might not be an Amber problem, could certainly be a problem coming from CUDA.

Thank you
J

-----Original Message-----
From: Ross Walker <ross.rosswalker.co.uk>
To: AMBER Mailing List <amber.ambermd.org>
Sent: Wed, Aug 21, 2013 12:37 am
Subject: Re: [AMBER] Cpu busy looping?

Hi J,

Are you also halving the memory clock and PCI-E clock when you halve the
CPU clock?

You are also making the assumption that when you place two threads on the
same CPU it is the same as running 1 thread at half the clock speed. This
is not true. There is substantial overhead in task switching that
complicates matter considerably. You are also dividing up the PCI bus
bandwidth and the memory bandwidth. There may be more to this as well and
it might be possible to tweak things so it didn't cause a slow down - most
likely you need to modify OS parameters to make the effective affinity
time (the time between switching tasks on a core) shorter than it
currently is. By default it is probably longer than the typical kernel run
length which leads to the GPUs waiting for the CPU to switch in the right
thread. This is mostly speculation - given there isn't a single chip still
produced (as far as I am aware) that has 1 core, even my cell phone has
4!, I haven't seen the need to investigate it too much.

Only the 'Tesla' cards support Dynamic Parallelism so it has not been
worth our effort to develop code to take advantage of it since this would
essentially lock out GeForce cards. In addition I don't see any real
advantage to dynamic parallelism - it really just allows one to be lazy in
dividing up work as far as I can tell, although it could make the
underlying code easier to read. For AMBER I predict using it would most
likely just result in a performance drop due to the added complexity and
all the extra book keeping that goes with it.

I hope that helps.

All the best
Ross

On 8/20/13 3:19 PM, "Jake Smith" <amberonejake.aol.com> wrote:

>
>Hi Ross,
>thanks for your reply
>I understand what you say but only partially...
>How come that if I slow down the cpu clock by a factor 2 the speed of 1
>simulation does not change compared to normal, however if I return the
>CPU speed to the max and then I constrain a core to serve 2 simulations,
>then the speed of each simulation halvens? The number of cpu cycles
>dedicated to Amber in the two cases is the same...
>
>
>Also to reply you on another topic:
>You say that the CPU is used to launch the kernels... so Amber is not
>using K20's Dynamic Parallelism yet?
>Intuitively speaking such feature might provide a tremendous speed boost
>
>
>( All this I say is not to not-appreciate the Amber team's very excellent
>work )
>
>
>Thank you
>J
>
>
>
>-----Original Message-----
>From: Ross Walker <ross.rosswalker.co.uk>
>To: AMBER Mailing List <amber.ambermd.org>
>Sent: Tue, Aug 20, 2013 10:10 pm
>Subject: Re: [AMBER] Cpu busy looping?
>
>
>Hi Jake,
>
>The CPU is polling the GPU for kernel completions to:
>
>1) Launch the next kernel in the sequence. - since kernel launches are
>controlled by the CPU.
>
>2) Download / upload memory as needed to perform I/O.
>
>Unfortunately there is no free lunch - until there is a full operating
>system on a GPU - such that you can plug the disk directly into the GPU
>and throw away the rest of the machine this is the way it has to run. Be
>grateful that you are not using NAMD or Gromacs which would swallow ALL of
>your cores and ALL of your GPUs for a single calculation in order to get
>close to the speed AMBER gets using just 1 GPU and 1 CPU core - So 4 GPUs
>+ 4 CPU cores in a single node gives you the equivalent of 4 full nodes
>running 4 Gromacs simulations.
>
>The CPU speed itself is irrelevant in that it just needs to be a minimum
>speed - essentially enough to monitor interrupts at a fast enough
>frequency. We've not tested the bottom end, certainly sharing a single
>core for multiple GPUs slows things down but a single 1.8GHz core is
>easily enough to keep up with 99.999% of GPU calculations - the only place
>where it really would fall down is if you do a crazy amount of I/O, say
>with ntwx=1.
>
>Anyway, that's an aside. Note it is not the entire CPU being used 100% -
>it is merely 1 core being used 100%. Ultimately the rule is:
>
>You need 1 CPU core for each GPU in your system. These can be low end,
>cost effective CPUs (the MAJOR advantage of AMBER over Gromacs and NAMD
>which need expensive CPUs for performance) - this is what we tend to
>recommend with the custom designed Exxact machines (see the AMBER
>website). Thus a cheap 4 core CPU can easily handle running 4 GPUs flat
>out. Ideally you want 6 cores to allow some free for the OS. For home
>built I tend to recommend the 8 core AMD chips since they are very cheap
><$150 each and can easily handle 4 GPUs plus the operating system, I/O,
>interactive use etc.
>
>Attached are a couple of potential machine configs that work well.
>
>Note the other thing you can do, with say a dual x 8 core machine is to
>run 4 single GPU jobs, and then use the remaining 12 cores for a CPU MPI
>run.
>
>All the best
>Ross
>
>
>
>
>
>
>On 8/20/13 12:32 PM, "Jake Smith" <amberonejake.aol.com> wrote:
>
>>
>>Hello Amberers
>>While doing serial simulations on GPU the CPU speed seems indeed
>>irrelevant but then why is one CPU core always stuck at 100% busy when a
>>GPU is performing a computation? This is not good in terms of how many
>>GPUs can be driven by a low end CPU. Can I ask what is that core doing
>>exactly? The thing is strange especially because the CPU speed seems
>>irrelevant.
>>Thank you
>>J
>>_______________________________________________
>>AMBER mailing list
>>AMBER.ambermd.org
>>http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Aug 20 2013 - 18:00:03 PDT