Re: [AMBER] Amber16 Parallel CUDA Tests

From: Steven Ford <sford123.ibbr.umd.edu>
Date: Thu, 11 Aug 2016 22:54:15 -0400

Hello,

I'm still trying to figure out why the MPI CUDA tests are failing.

If I run tests with DO_PARALLEL="mpirun -np 4" and limit
CUDA_VISIBLE_DEVICES to only 0 or 1, all tests pass. I get the same
behavior with OpenMPI 1.8, 1.10, 2.0 and mpich 3.1.

I ran gpuP2PCheck just in case communication between the GPUs was the
problem. It confirms that communication is working:

CUDA-capable device count: 2
   GPU0 " Tesla K80"
   GPU1 " Tesla K80"

Two way peer access between:
   GPU0 and GPU1: YES

If it's of any use, here is the output of nvidia-smi -q:

==============NVSMI LOG==============

Timestamp : Thu Aug 11 22:42:34 2016
Driver Version : 352.93

Attached GPUs : 2
GPU 0000:05:00.0
    Product Name : Tesla K80
    Product Brand : Tesla
    Display Mode : Disabled
    Display Active : Disabled
    Persistence Mode : Disabled
    Accounting Mode : Disabled
    Accounting Mode Buffer Size : 1920
    Driver Model
        Current : N/A
        Pending : N/A
    Serial Number : 0325015055313
    GPU UUID :
GPU-a65eaa77-8871-ded5-b6ee-5268404192f1
    Minor Number : 0
    VBIOS Version : 80.21.1B.00.01
    MultiGPU Board : Yes
    Board ID : 0x300
    Inforom Version
        Image Version : 2080.0200.00.04
        OEM Object : 1.1
        ECC Object : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current : N/A
        Pending : N/A
    PCI
        Bus : 0x05
        Device : 0x00
        Domain : 0x0000
        Device Id : 0x102D10DE
        Bus Id : 0000:05:00.0
        Sub System Id : 0x106C10DE
        GPU Link Info
            PCIe Generation
                Max : 3
                Current : 3
            Link Width
                Max : 16x
                Current : 16x
        Bridge Chip
            Type : PLX
            Firmware : 0xF0472900
        Replays since reset : 0
        Tx Throughput : N/A
        Rx Throughput : N/A
    Fan Speed : N/A
    Performance State : P0
    Clocks Throttle Reasons
        Idle : Not Active
        Applications Clocks Setting : Active
        SW Power Cap : Not Active
        HW Slowdown : Not Active
        Unknown : Not Active
    FB Memory Usage
        Total : 12287 MiB
        Used : 56 MiB
        Free : 12231 MiB
    BAR1 Memory Usage
        Total : 16384 MiB
        Used : 2 MiB
        Free : 16382 MiB
    Compute Mode : Default
    Utilization
        Gpu : 0 %
        Memory : 0 %
        Encoder : 0 %
        Decoder : 0 %
    Ecc Mode
        Current : Disabled
        Pending : Disabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Texture Memory : N/A
                Total : N/A
            Double Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Texture Memory : N/A
                Total : N/A
        Aggregate
            Single Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Texture Memory : N/A
                Total : N/A
            Double Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Texture Memory : N/A
                Total : N/A
    Retired Pages
        Single Bit ECC : 0
        Double Bit ECC : 0
        Pending : No
    Temperature
        GPU Current Temp : 31 C
        GPU Shutdown Temp : 93 C
        GPU Slowdown Temp : 88 C
    Power Readings
        Power Management : Supported
        Power Draw : 59.20 W
        Power Limit : 149.00 W
        Default Power Limit : 149.00 W
        Enforced Power Limit : 149.00 W
        Min Power Limit : 100.00 W
        Max Power Limit : 175.00 W
    Clocks
        Graphics : 562 MHz
        SM : 562 MHz
        Memory : 2505 MHz
    Applications Clocks
        Graphics : 562 MHz
        Memory : 2505 MHz
    Default Applications Clocks
        Graphics : 562 MHz
        Memory : 2505 MHz
    Max Clocks
        Graphics : 875 MHz
        SM : 875 MHz
        Memory : 2505 MHz
    Clock Policy
        Auto Boost : On
        Auto Boost Default : On
    Processes : None

GPU 0000:06:00.0
    Product Name : Tesla K80
    Product Brand : Tesla
    Display Mode : Disabled
    Display Active : Disabled
    Persistence Mode : Disabled
    Accounting Mode : Disabled
    Accounting Mode Buffer Size : 1920
    Driver Model
        Current : N/A
        Pending : N/A
    Serial Number : 0325015055313
    GPU UUID :
GPU-21c2be1c-72a9-1b68-adab-459d05dd7adc
    Minor Number : 1
    VBIOS Version : 80.21.1B.00.02
    MultiGPU Board : Yes
    Board ID : 0x300
    Inforom Version
        Image Version : 2080.0200.00.04
        OEM Object : 1.1
        ECC Object : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current : N/A
        Pending : N/A
    PCI
        Bus : 0x06
        Device : 0x00
        Domain : 0x0000
        Device Id : 0x102D10DE
        Bus Id : 0000:06:00.0
        Sub System Id : 0x106C10DE
        GPU Link Info
            PCIe Generation
                Max : 3
                Current : 3
            Link Width
                Max : 16x
                Current : 16x
        Bridge Chip
            Type : PLX
            Firmware : 0xF0472900
        Replays since reset : 0
        Tx Throughput : N/A
        Rx Throughput : N/A
    Fan Speed : N/A
    Performance State : P0
    Clocks Throttle Reasons
        Idle : Not Active
        Applications Clocks Setting : Active
        SW Power Cap : Not Active
        HW Slowdown : Not Active
        Unknown : Not Active
    FB Memory Usage
        Total : 12287 MiB
        Used : 56 MiB
        Free : 12231 MiB
    BAR1 Memory Usage
        Total : 16384 MiB
        Used : 2 MiB
        Free : 16382 MiB
    Compute Mode : Default
    Utilization
        Gpu : 0 %
        Memory : 0 %
        Encoder : 0 %
        Decoder : 0 %
    Ecc Mode
        Current : Disabled
        Pending : Disabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Texture Memory : N/A
                Total : N/A
            Double Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Texture Memory : N/A
                Total : N/A
        Aggregate
            Single Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Texture Memory : N/A
                Total : N/A
            Double Bit
                Device Memory : N/A
                Register File : N/A
                L1 Cache : N/A
                L2 Cache : N/A
                Texture Memory : N/A
                Total : N/A
    Retired Pages
        Single Bit ECC : 0
        Double Bit ECC : 0
        Pending : No
    Temperature
        GPU Current Temp : 24 C
        GPU Shutdown Temp : 93 C
        GPU Slowdown Temp : 88 C
    Power Readings
        Power Management : Supported
        Power Draw : 70.89 W
        Power Limit : 149.00 W
        Default Power Limit : 149.00 W
        Enforced Power Limit : 149.00 W
        Min Power Limit : 100.00 W
        Max Power Limit : 175.00 W
    Clocks
        Graphics : 562 MHz
        SM : 562 MHz
        Memory : 2505 MHz
    Applications Clocks
        Graphics : 562 MHz
        Memory : 2505 MHz
    Default Applications Clocks
        Graphics : 562 MHz
        Memory : 2505 MHz
    Max Clocks
        Graphics : 875 MHz
        SM : 875 MHz
        Memory : 2505 MHz
    Clock Policy
        Auto Boost : On
        Auto Boost Default : On
    Processes : None


If it matters, when I do the tests with DO_PARALLEL="mpirun -np 4", I see
that each process is running a thread on both GPUs. For example:

# gpu pid type sm mem enc dec command
# Idx # C/G % % % % name
    0 30599 C 24 0 0 0 pmemd.cuda_DPFP
    0 30600 C 0 0 0 0 pmemd.cuda_DPFP
    0 30601 C 11 0 0 0 pmemd.cuda_DPFP
    0 30602 C 0 0 0 0 pmemd.cuda_DPFP
    1 30599 C 0 0 0 0 pmemd.cuda_DPFP
    1 30600 C 36 0 0 0 pmemd.cuda_DPFP
    1 30601 C 0 0 0 0 pmemd.cuda_DPFP
    1 30602 C 6 0 0 0 pmemd.cuda_DPFP

Is that expected behavior?

Has anybody else had any problems using K80s with MPI and CUDA? Or using
CentOS/RHEL 6?

This machine does have dual CPUs, could that be a factor?

I'm currently using AmberTools version 16.12 and Amber version 16.05.

Any insight would be greatly appreciated.

Thanks,

Steve



On Mon, Jul 25, 2016 at 3:06 PM, Steven Ford <sford123.ibbr.umd.edu> wrote:

> Ross,
>
> This is CentOS version 6.7 with kernel version 2.6.32-573.22.1.el6.x86_64.
>
> The output of nvidia-smi is:
>
> +------------------------------------------------------+
>
> | NVIDIA-SMI 352.79 Driver Version: 352.79 |
>
> |-------------------------------+----------------------+----
> ------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
> ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
> M. |
> |===============================+======================+====
> ==================|
> | 0 Tesla K80 Off | 0000:05:00.0 Off |
> Off |
> | N/A 34C P0 59W / 149W | 56MiB / 12287MiB | 0%
> Default |
> +-------------------------------+----------------------+----
> ------------------+
> | 1 Tesla K80 Off | 0000:06:00.0 Off |
> Off |
> | N/A 27C P0 48W / 149W | 56MiB / 12287MiB | 0%
> Default |
> +-------------------------------+----------------------+----
> ------------------+
>
>
> +-----------------------------------------------------------
> ------------------+
> | Processes: GPU
> Memory |
> | GPU PID Type Process name Usage
> |
> |===========================================================
> ==================|
> | No running processes found
> |
> +-----------------------------------------------------------
> ------------------+
>
> The version of nvcc:
>
> nvcc: NVIDIA (R) Cuda compiler driver
> Copyright (c) 2005-2015 NVIDIA Corporation
> Built on Tue_Aug_11_14:27:32_CDT_2015
> Cuda compilation tools, release 7.5, V7.5.17
>
> I used the GNU compilers, version 4.4.7.
>
> I am using OpenMPI version 1.8.1-5.el6 from the CentOS repository. I have
> not tried any other MPI installation.
>
> Output of mpif90 --showme:
>
> gfortran -I/usr/include/openmpi-x86_64 -pthread -I/usr/lib64/openmpi/lib
> -Wl,-rpath -Wl,/usr/lib64/openmpi/lib -Wl,--enable-new-dtags
> -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh -lmpi
>
>
> I set DO_PARALLEL to "mpirun -np 2"
>
> The parallel tests for the CPU were all successful.
>
> I had not run 'make clean' in between each step. I tried the tests again
> this morning after running 'make clean' and got the same result.
>
> I applied all patches this morning before testing again. I am using
> AmberTools 16.10 and Amber 16.04
>
>
> Thanks,
>
> Steve
>
> On Sat, Jul 23, 2016 at 6:32 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
>> Hi Steven,
>>
>> This is a large number of very worrying failures. Something is definitely
>> very wrong here and I'd like to investigate further. Can you give me some
>> more details about your system please. This includes:
>>
>> The specifics of what version of Linux you are using.
>>
>> The output of nvidia-smi
>>
>> nvcc -V (might be lower case v to get version info).
>>
>> Did you use the GNU compilers or the Intel compilers and in either case
>> which version?
>>
>> OpenMPI - can you confirm the version again and also send me the output
>> of mpif90 --showme (it might be --show or -show or something similar) -
>> essentially I want to see what the underlying compilation line is.
>>
>> Can you confirm what you had $DO_PARALLEL set to when you ran make test
>> for the parallel GPU build. Also can you confirm if the regular (CPU)
>> parallel build passed the tests please?
>>
>> Also did you run 'make clean' before each build step? E.g.
>>
>> ./configure -cuda gnu
>> make -j8 install
>> make test
>> *make clean*
>>
>> ./configure -cuda -mpi gnu
>> make -j8 install
>> make test
>>
>> Have you tried any other MPI installations? - E.g. MPICH?
>>
>> And finally can you please confirm which version of Amber (and
>> AmberTools) this is and which patches have been applied?
>>
>> Thanks.
>>
>> All the best
>> Ross
>>
>> On Jul 21, 2016, at 14:20, Steven Ford <sford123.ibbr.umd.edu> wrote:
>>
>> Ross,
>>
>> Attached are the log and diff files. Thank you for taking a look.
>>
>> Regards,
>>
>> Steve
>>
>> On Thu, Jul 21, 2016 at 5:34 AM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>>
>>> Hi Steve,
>>>
>>> Indeed that is too big a difference to just be rounding error - although
>>> if those tests are using Langevin or Anderson for the thermostat that would
>>> explain it (different random number streams) - although those tests are
>>> supposed to be skipped in parallel.
>>>
>>> Can you send me a copy directly of your .log and .dif files for the 2
>>> GPU run and I'll take a closer look at it.
>>>
>>> All the best
>>> Ross
>>>
>>> > On Jul 20, 2016, at 21:19, Steven Ford <sford123.ibbr.umd.edu> wrote:
>>> >
>>> > Hello All,
>>> >
>>> > I currently trying to get Amber16 installed and running on our
>>> computing
>>> > cluster. Our researchers are primarily interested in running the GPU
>>> > accelerated programs. For GPU computing jobs, we have one CentOS 6.7
>>> node
>>> > with a Tesla K80.
>>> >
>>> > I was able to build Amber16 and run the Serial/Parallel CPU plus the
>>> Serial
>>> > GPU tests with all file comparisons passing. However, only 5 parallel
>>> GPU
>>> > tests succeeded, while the other 100 comparisons failed.
>>> >
>>> > Examining the diff file shows that some of the numbers are not off by
>>> much
>>> > like the documentation said could happen. For example:
>>> >
>>> > 66c66
>>> > < NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 351.27
>>> PRESS =
>>> > 0.
>>> >> NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 353.29 PRESS =
>>> > 0.
>>> >
>>> > This may also be too large to attribute to a rounding error, but it is
>>> a
>>> > small difference compared to others:
>>> >
>>> > 85c85
>>> > < Etot = -217.1552 EKtot = 238.6655 EPtot =
>>> > -455.8207
>>> >> Etot = -1014.2562 EKtot = 244.6242 EPtot =
>>> > -1258.8804
>>> >
>>> > This was build with CUDA 7.5, OpenMPI 1.8, and run with
>>> DO_PARALLEL="mpirun
>>> > -np 2"
>>> >
>>> > Any idea what else could be affecting the output?
>>> >
>>> > Thanks,
>>> >
>>> > Steve
>>> >
>>> > --
>>> > Steven Ford
>>> > IT Infrastructure Specialist
>>> > Institute for Bioscience and Biotechnology Research
>>> > University of Maryland
>>> > (240)314-6405
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>>
>>
>> --
>> Steven Ford
>> IT Infrastructure Specialist
>> Institute for Bioscience and Biotechnology Research
>> University of Maryland
>> (240)314-6405
>> <2016-07-20_11-17-52.diff><2016-07-20_11-17-52.log>
>>
>>
>>
>
>
> --
> Steven Ford
> IT Infrastructure Specialist
> Institute for Bioscience and Biotechnology Research
> University of Maryland
> (240)314-6405
>



-- 
Steven Ford
IT Infrastructure Specialist
Institute for Bioscience and Biotechnology Research
University of Maryland
(240)314-6405
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 11 2016 - 20:00:03 PDT
Custom Search