Tuesday, November 9, 2010

Matlab DCS + Torque : What's missing?

Matlab Distributed Computing Server directly supports Torque scheduler with minimum configuration. However, when it passes parameter such as number of CPU to Torque, it only allows passing one parameter using '^N^' keyword, which is defined in 'ResourceTemplate' at 'Scheduler Configuration Properties' window.

Unfortunately, Torque, in general, needs two parameters to specify number of CPU in this format to submit multi-nodes job.

nodes=X:ppn=Y

where X denotes number of node, Y denotes processor per node, respectively.

Torque set ppn to 1 by default if it is not explicitly specified. In other words, 4 way job result in

nodes=4:ppn=1

instead of

nodes=1:ppn=4

which is ideal for system having node with 4 cores each.

I needed a Matlab function that can handle this conversion between Matlab user and Matlab Distributed Computing Server.

TASKLAYOUT(X,Y) takes one parameter and claculates optimal way to define X, Y values and constructs right string which can be used by 'psub' command internally.

For example, assume each node has 8 cores, then TASKLAYOUT(X,Y) generate these output;

4 -> nodes=1:ppn=4
8 -> nodes=1:ppn=8
12 -> nodes=
2:ppn=6
13 -> nodes=2:ppn=7
16 -> nodes=2:ppn=8
18 -> nodes=3:ppn=6
23 -> nodes=3:ppn=8
24 -> nodes=3:ppn=8
32 -> nodes=4:ppn=8

In case of 12,13 and 18 way job, it does not requests full number of cores per node to minimized waste of unused cores. But, there is still sort of internal fragmentation in case of 13, which requests 14 cores even though it only needs 13 cores. In other words, requesting 13 cores are not really efficient at all.

Now it's time to install and test it.

Follow these steps to install tasklayout.m.

1. Define 'ResourceTemplate' in TORQUE Scheduler Configuration Properties

Click 'Parallel/Manage Configurations' on Matlab window.
Click 'File/Import' on 'Configurations Manager window'.
Double click configuration you want to modify.
Set 'ResourceTemplate' to '-l nodes=^N^,walltime=24:00:00'.
*Probably, most of you have done this already.

2. Copy tasklayout.m to Matlab directory.

cp tasklayout.m $MATLAB/toolbox/local


3. Modify $MATLAB/toolbox/distcomp/@distcomp/@pbsscheduler/pSubmitParallelJob.m. Assume each node has 8 cores.

selectStr=strrep(pbs.ResourceTemplate,'^N^',num2str(length(job.Tasks)))

to

selectStr=strrep(pbs.ResourceTemplate,'^N^',tasklayout(length(job.Tasks),8))

4. Done

Please, let me know if it is not compatible in certain environment or have bugs or anything is missing in this function.

Thanks




function taskStr = tasklayout( numTasks,numPpn )
%TASKLAYOUT Construct CPU requirement string for Torque Scheduler
% TASKLAYOUT(X,Y) construct CPU requirement string for X tasks
% with each node has Y cores.
%
% Notes:
%
% Matlab Distributed Computing Server directly supports Torque scheduler
% with minimum configuration. However, when it passes parameter such as
% number of CPU to Torque, it only allows passing one parameter using '^N^'
% keyword, which is defined in 'ResourceTempalte' at 'Scheduler
% Configuration Properties' window.
%
% Unfortunately, Torque, in general, needs two parameters in this format
% to submit multi nodes job.
%
% nodes=X:ppn=Y
%
% where X denotes number of node, Y denotes processor per node,
% respectively.
%
% TASKLAYOUT takes one parameter and claculates optimal way to define X, Y
% values and constructs right string which can be used by 'psub' command
% internally.
%
% Example 1:
% % Run 16 way job, each node has 8 cores.
% reqStr = tasklayout(16,8)
%
% ans =
%
% nodes=2:ppn=8
%
% returns
% taskStr
%
% See also matlabpool

% For Matlab Distributed Computing Server
% (c) Brian Kim @ Texas A&M University

if ( numTasks <= numPpn )
taskStr = strcat('1:ppn=',num2str(numTasks)) ;
elseif ( numTasks > numPpn )
if ( rem(numTasks,numPpn) == 0 )
taskStr = strcat(num2str(ceil(numTasks/numPpn)), ... ,
':ppn=',num2str(numPpn));
else
taskStr = strcat(num2str(ceil(numTasks/numPpn)),':ppn=', ... ,
num2str(ceil(numTasks/ceil(numTasks/numPpn))));
end
end


Tuesday, May 18, 2010

Scalability study of StarCCM+ on cluster

Introduction

Latest version of StarCCM+ starts supporting multi nodes job on POE/IBM/AIX environment. Previously, it was not possible to run StarCCM+ on more than one node in parallel. Therefore, the maximum number of cores is limited to 16 on hydra[1]. But, with StarCCM+ ver 5.02, it became possible to spawn parallel processes on remote compute node through POE parallel environment on IBM cluster. Now, the question is how scalable StarCCM+ is?

Experimental environment

For experiment , lemans_poly_17m.sim file has been tested, which is one of simulation input file frequently used for benchmark. The file size is 6.5GB and it has 17 millions of cells and 95 millions of vertices.



The primary purpose of experiment is to identify how scale StarCCM+ is on hydra, however, the relative performance of two different machines is also evaluated. Each compute node on eos[2] has two quad core Intel X5560 based cluster and each node is connected via Infiniband, whereas, compute node on hydra has 8 dual core Power5+ and HPS is used as a interconnect.



Results

The benchmark results are evaluated from two different perspectives; scalability, relative performance.

Scalability

The scalability on eos and hydra shows not much difference each other. The performance with 16 cores is used as a basis for speedup comparison.

On eos, speedup is easily doubled as the number of core increase and it stays very scalable up to 64 cores. With 128 cores(8x), it slows down and shows only 6.2x speedup.



On hydra, it shows very similar result and stays very scalable up to 64 cores and starts slow down around 128 cores.



Based on this result, there is no significant difference on the scalability of StarCCM+ on eos and hydra. However, it could be changed with different input files, I/O patterns, and characteristics of data in input file.

Performance

The definition of 'performance' here is, simply, who finish simulation fast. With same input file, how long it take on each machine? The differences between eos and hydra varies as the number of cores increase. eos is about 2.51x faster than hydra on 16 cores and 2.41x faster on 128 cores.



Conclusion


Benchmark results show that eos performs better than hydra to run StarCCM+ with given input file. However, the differences in hardware specification between 2 machines such as L3 cache size, interconnect, number of cores per node, can change the result with different characteristics of input file.

References

1. http://sc.tamu.edu/help/hydra/
2. http://sc.tamu.edu/systems/eos/hardware.php

Thursday, May 13, 2010

Quick install guide for LAMMPS on Linux cluster

This is a quick installation guide for LAMMS 10 May 2010 version for Linux cluster and it can be done within 1 hours if you follow this carefully.

SYSTEM SPECIFICATION

Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
Red Hat Enterprise Linux Server release 5.4 (Tikanga) 2.6.18-164.11.1.el5
Intel compiler and MPI stack
mpiicc for the Intel(R) MPI Library 4.0 for Linux*
Copyright(C) 2003-2010, Intel Corporation. All rights reserved.
Version 11.1

Prerequisites

Download these 2 files and untar them, respectively.

- fftw-2.1.5.tar.gz
http://www.fftw.org/fftw-2.1.5.tar.gz
- lammps.tar.gz (lib for LAMMPS, LAMMPS itself)
http://lammps.sandia.gov/download.html

Instruction

In a nutshell, installation procedure consists of 3 steps; (1)FFTW, (2)libs for LAMMPS, (3)LAMMPS itself. And if you use mpiicc and mpiifort, then you don't have to worry about the PATH of MPI package and it will be taken care of automatically.

(1)FFTW

Even though the latest version of FFTW is 3.2.2, but, unfortunately, LAMMS can not work with it. You better stick to 2.1.5 for now. And I assume you're installing it at /usr/local/fftw-2.1.5 directory.

After uncompressing it,

cd fftw-2.1.5
./configure CC=mpiicc F77=mpiifort --prefix=/usr/local/fftw-2.1.5
make
make check
make install

(2)libs for LAMMPS

After uncompressing lammps.tar.gz,(actually, libs for LAMMPS is part of lammps.tar.gz)

cd lib/reax
Change ifort to mpiifort in Makefile.ifort
make -f Makefile.ifort

cd lib/meam
Change ifort to mpiifort in Makefile.ifort
make -f Makefile.ifort

cd lib/poems
Change icc to mpiicc in Makefile.icc
make -f Makefile.icc

(3)LAMMPS
cd src

Check which packages are included
make package-status

Choose all standard packages to be included
make yes-standard

If you don't have GPU, then exclude gpu package. Otherwise, you will see tons of error message when you compile it.
make no-gpu

Edit MAKE/Makefile.linux

Remove MPI_PATH,MPI_LIB, it will be taken care of by mpiicc, mpiifort
#MPI_PATH =
#MPI_LIB = -lmpich -lpthread

CC=mpiicc
LINK=mpiicc
FFT_INC = -I/usr/local/fftw-2.1.5/include -DFFT_FFTW
FFT_PATH = -L/usr/local/fftw-2.1.5/lib
FFT_LIB = -lfftw

Build LAMMPS as an executable and library as well

make linux
make makelib
make -f Makefile.lib linux

INSTALLATION

I assume you're installing lammps at /usr/local/lammps directory.

copy bench/ doc/ examples/ potentials/ README tools/ to /usr/local/lammps
copy all *.a to /usr/local/lammps/lib
copy lmp_linux to /usr/local/lammps/bin

Done!!!

Once you have gotten this far, just let user know lammps is available at /usr/local/lammps, then they know how to play around it.

Wednesday, March 3, 2010

OpenMP v.s MPI. Who is the winner?

Recently, I have analyzed all the email communication between supercomputer user and helpdesk from 08/2006 to 03/2010. Only OpenMP and MPI related emails are counted.

openmp

This is just for fun and not intended to deliver any technical explanation or anything. Occasionally, some of user ask just out of curiosity, and this is answer to that. That's it.

Wednesday, February 10, 2010

Short course : Introduction to Unix/Linux

For spring semester, we will be offering short course, 'Introduction
to Unix',

Date : Feb 8(Monday) ~10(Wednesday).
Location : Annex library 417C

For detail,

http://groups.google.com/group/tamulinux

Please, leave a comment or feedback about this class. It would be helpful to improve the class.
Thanks you.