Wednesday, October 26, 2011

Performance Analysis of NAMD on NVIDIA GPU

Abstract

As GPU become readily available resources in high performance computing environment, more and more applications are being considered to be a target application for taking advantage of highly parallel computing capability of GPU. To evaluate the performance improvement in application level, molecular dynamics applications, NAMD, is experimented in this article and its performance has been analyzed in three different perspectives: Scalability, speedup, and GPU utilization.

1. Experimental environment

GPU nodes are relatively new addition to the existing EOS system at Texas A&M University. Table 1 shows detail of the node tested in this experiment. It has two six core 2.8Ghz Intel Xeon X5660 processors based on Westmere architecture and 24GB DDR3 DRAM with 1,333Ghz clock cycle. Additionally, it is equipped with two NVIDIA M2050 GPU devices. Each M2050 GPU device runs at 1.15Ghz and 2687MB GDDR5 memory with 1.546Ghz clock cycle. M2050 is a high performance grade device with 14 streaming multiprocessors and each of them has 32 cores which is based on Fermi architecture. It also has ECC memory error protection, and L1 and L2 cache to provide not only accuracy but also high performance on computation. The node is running 64-bits Red Hat Enterprise Linux Server release 5.4.

NAMD has been tested on three configurations. First, it is executed with CPU only configuration. Secondly, it is executed with 1 NVIDIA M2070 GPU. Finally, it is executed with 2 NVIDIA M2050 GPUs. Since these tests are submitted through batch system and there is no guarantee that each test ran exclusively on compute node and compute node could be shared by several other jobs as well. Therefore, the performance presented in this article could be different from exclusive testing environment.


NVIDIA
M2050
NVIDIA
M2070
INTEL
X5660
Clock speed(GHz)
1.15
1.15
2.8
Memory(GB)
2.687
5.375
24.081
Memory Clock(GHz)
1.546
1.566
1.333
# of CUDA cores
448
448
N/A
CUDA Driver ver.
4.0
4.0
N/A
CUDA CC ver.
2.0
2.0
N/A
Table 1: Specification of experimental environment

To benchmark NAMD, apoa1 dataset is used in experiment. This dataset contains 92,000 atoms and simulates 500 steps. Among several parameters in input file, outputEnergies set to 100 as suggested in user manual to remove unnecessary CPU involvement for generating additional output in file.

2.Results

In this section, performance of NAMD has been analyzed in three aspects: Scalability, speedup, and GPU utilization.

A.Scalability

The ratio of CPU cores and GPU is 12:1 on single GPU node and 12:2 on multi GPU node. Each process of NAMD occupy one CPU core and 1 GPU. However, GPU can be shared by multiple process. Therefore, it is possible that GPU can be oversubscribed by too many processes and become bottleneck in certain configuration.
Figure 1

Since each process on NAMD maps to single CPU cores, number of cores means same as number of process in graph. CPU version of NAMD scales well up to 12 cores on Westmere node in Figure 1. However, GPU version of NAMD shows best performance around 4-6 cores. On 1 GPU nodes, all processes invoke kernel on same GPU concurrently, so it breaks scalability when there is 12 processes.

B.Performance

Figure 2 shows relative speedup of GPU version compared to CPU only version. Suppose the performance of CPU only version with 1 core is 1x. With 1 CPU core and 1 GPU, NAMD runs 7 times faster than CPU only version. With 2 CPU cores and 1 GPU, NAMD runs 13 times faster. While CPU version scales up well until 12 cores, GPU version hits the peak around 6 cores and starts falling.
Figure 2
C.GPU Utilization 

1)Single GPU environment 

Figure 3 shows how utilization of single GPU changes as the number of CPU core increases. With 1 CPU core, GPU uses only 30% of its capability. In other words, 70% of GPU is not doing anything and just stay idle. However, if there are 2 CPU cores, GPU is shared by these 2 cores and overall utilization has been doubled up to around 60%. Likewise, utilization goes over 90% when 4 CPU cores share GPU.
Figure 3
It turned out that higher GPU utilization directly affects on application’s performance as shown in Figure 2. With 2 CPU cores, NAMD runs twice faster than 1 CPU core.

2)Multi GPU environment

Among two graphs in Figure 4, top one shows utilization of first GPU and bottom one shows utilization of second GPU. The fact that two graphs look very similar each other represents NAMD distributes workload evenly to multiple GPU for computation. With 2 CPU cores, utilization of each GPU stays around 40%. With 4 CPU cores, utilization of each GPU goes up over 60%.
Figure 4
IV Conclusions

GPU application shows different performance charac-teristics from CPU application. Scalability can be affected by the ratio of CPU core and GPU. Oversubscribing GPU can limit the scalability of application. On multi GPU node, the overall speedup is affected by how application distribute workload to each GPU device. At the same time, under-subscribing GPU device, such as workload is not big enough to keep all GPU devices busy, is also limit the performance of GPU application.


Performance of Parallel Migrate-n on Linux cluster


Introduction

Migrate estimates effective population sizes and past migration rates between n population assuming a migration matrix model with asymmetric migration rates and different subpopulation sizes[1].

Experimental environment

In this experiment, the performance of serial version, which runs on single core, is measured as a basis of comparison. Then, the number of core is increased to 8, 16, 32, and 64. Each compute node has two quad core 2.8Ghz Intel Xeon X5560 processors based on Nehalem architecture and 24GB DDR3 DRAM with 1333Ghz clock cycle[2].

Since each compute node has 8 cores, 8 compute nodes are required for 64 core job. parmfile.testml is used as an input file for experiment. The only change made to it for this experiment is that value of menu parameter is changed to ‘NO’ to be able to run it in batch mode on EOS and HYDRA[3][4][5].

Results

Migrate keeps improving its performance up to 32 cores on both HYDRA and EOS[Fig.1].
[Figure 1]

However, with 64 cores, performance starts falling on HYDRA[Fig.2] and it does not perform better than 32 cores on EOS[Fig.3].
[Figure 2]

[Figure 3]

On EOS, 8 cores show 5 times faster than 1 core. There is almost no speedup between 32 and 64 cores. On HYDRA, 8 cores bring 4.5 times speedup compared to 1 core. 32 cores show best performance and 64 cores are a little bit slower than 32 cores.
[Figure 4]

Migrate runs more than twice faster on EOS than on HYDRA in general. With 8 cores EOS is about 3 times faster than HYDRA[Fig.4].

References

1.http://popgen.sc.fsu.edu/Migrate/Info.html
2.http://sc.tamu.edu/systems/eos/hardware.php
3.http://sc.tamu.edu/help/eos/batch/
4.http://sc.tamu.edu/help/hydra/batch.php
5.http://sc.tamu.edu/systems/hydra/hardware.php

Monday, September 5, 2011

2011 Fall Short Course, Introduction to Unix/Linux

Fall 2011 class starts on Sep. 5 through Sep. 8 at Teague B013.

Date : Sep. 5(Monday) ~ Sep. 8(Thursday).
Time : 3PM ~ 5PM
Location : Teague B013

This short course will be held in computer room and each attendee will be able to access computer for hands-on lab session. It is highly recommended for all attendee to get your EOS login ID ready before the first day of class. EOS login ID is required for hands-on lab.

Introduction to Linux is a short course specifically designed for beginner to Linux/Unix system. It will cover basic concept of Linux and frequently used commands.

Basic
  • What is Linux/Unix?
  • File and Directory
  • Edit text file
  • Setup environment
  • Remote access
Advanced
  • Process,Signal
  • I/O redirection,Pipe
  • Alias
  • Permission
  • Kernel & Shell

For detail,

https://sites.google.com/site/tamulinux/introduction-to-linux

Please, post a comment or feedback about this class. It would be helpful to improve the class.

Thanks you.


Sunday, August 28, 2011

Setting up dual monitor on Ubuntu 11.04 with NVIDIA

Several users have been complaining that global menu bar is disappeared after setting up dual monitor on Ubuntu 11.04. It happened to me as well. Finally, I found an alternative setting which is not exactly what I wanted, but, it gives me minimum functionality that I can accept.

Run 'NVIDIA X Server Settings' and make it look like these setup, then restart gdm.

Primary monitor
Secondary monitor

With this setting, you will get these features,

(1) dual monitors
(2) move windows across dual monitors
(3) global menu bar on each monitor*

I don't like (3). What I wanted is having dual monitor with global menu bar on primary monitor only. But, I couldn't find a solution for that.

After using this setup several days, I found out that having global menu bar on secondary monitor is actually very useful or must have feature. Global menu bar becomes application menu bar once application is launched on secondary monitor. So application menu bar resides close to application itself. Otherwise, you will end up with application menu bar on primary monitor and application on secondary monitor.

Let me know if you have better solution about this.

Thanks.

Monday, May 2, 2011

Compile libpng + pngwriter on IBM AIX 5.3


Compiling pngwriter, generally speaking, compiling any open source code with IBM compiler on IBM AIX machine is challenging. Sometimes, it only takes few minutes to build it, but, sometimes, it takes more than several days to figure it out that you could not use xlc to compile it in the first place because of compiler compatibility issue. Then, you might end up with gcc and, most likely, it's a snap.

pngwriter-0.5.4 requires libpng, but, avoid version 1.5.2(which is the latest version ATTOW) as much as possible. Instead, pick libpng-1.2.44 to be able to build it and afford going home tonight and have a dinner on time. :)

You will see only one error when you try to build it. Only one! What a day!

The error message is

"pngwriter.cc", line 1533.25: 1540-0217 (S) "__jmpbuf" is not a member of "struct png_struct_def".

pngwriter.h includes png.h for struct png_struct_def, then, along many other header files, /usr/include/sys/context.h is included. This is the reason why the error occurs. A macro to modify jmpbuf to __jmpbuf is defined in context.h and it actually modified it in line 1533 at pngwriter.cc, whereas the struct definition itself remains unmodified.

That's why it think '__jmpbuf is not a member of struct'. One easy solution is to move down the struct definition after context.h. By this way, jmpbuf in the struct definition also will be modified to __jmpbuf.

Moving down #include right before #include in pngwriter.h solves the problem and you should be able to build it without any error.

Good luck.

Brian