Mandated performance optimizations on small to HPC

806阅读 0评论2011-04-26 zzggbb
分类:LINUX

Here are the areas that one might need to optimize to achieve the best performance:

A. Optimize Applications
1. Compilers (Intel - Fortran compiler, AMD - AMD’s gnutools)
2. Math Libraries (Intel - Intel Math MKL, AMD - AMD’s ACML)
3. Optimization flags when compiled all the software components that would be used to create the production application binaries
4. For MPI applications: preferably MPICH2 or MPICH1, use OSC’s mpiexec. I have had bad experiences with OpenMPI (openmpi-1.1.1-8 on Rocks) but for the latest version 1.2.5 is ok.

B. Optimize Linux Operating System on nodes
1. NFS (rsize, wrsize, noatime, /etc/exports(async), adjust the number of daemons)
2. Ethernet card devices (MTU = 9000 (jumbo frames), Adaptive interrupt coalescing enabled, TCP Segmentation Offloading enabled, IRQ affinity on SMP (multi core/multi proc), TCP Offload Engine if supported)
3. Linux Kernel (Correct Processor family, Preemption Model (Server), disable preemption big kernel lock, Timer Frequency (100 Hz), CPU Frequence Scaling (Performance))
4. Turn off all unnecessary services

C. Optimize /proc related entries or /etc/sysctl.conf
….
####################################
#global read and write socket buffers (256 K)
net.core.rmem_default = 262144
net.core.wmem_default = 262144
#max size of read and write socket buffers (8 MB, new: 16 MB)
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
#read and write socket buffers specific to TCP
#min default max size (4KB 16MB 16MB) - default # <= rmem_max
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_mem = 786432 1048576 1572864

#turn off sack,dsack and timestamping, since this is local and HPC - not really needed
net.ipv4.tcp_sack=0
net.ipv4.tcp_dsack = 0
net.ipv4.tcp_timestamps=0

#turn on low_latency alg.
net.ipv4.tcp_low_latency=1

net.ipv4.ipfrag_high_thresh = 4194303
net.ipv4.ipfrag_low_thresh = 1048575
# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1

#max number of incoming packets queued (350) for delivery to device queue
net.core.netdev_max_backlog = 2500

#max accept queue backlog (default: 128)
#net.core.somaxconn = 256
#default 1024, could increase if connections from clients dropped
#net.ipv4.tcp_max_syn_backlog = 1024

#shmmax: max shared mem segment size in bytes
kernel.shmmax = 4294967296
#shmmin: 1byte-2GB: min shared memory segment size in bytes

#shmni: max number of share mem segments
#shmall <= shmmax / shmmni(4096)
#max shared mem segment size in pages system wide
kernel.shmall = 4294967296

#max sem per array, max sem sys wide, max ops per semop call,max # of arrays
#semmsl(8000), semmns, semmop(8000), semmni(32767)
#semmns <= semmsl * semmni
#kernel.semms <= Total system physical memory

#MPI apps use a lot of semaphores
kernel.sem = 1000 51200 128 1024
############################################

D. Do not go cheap on networking switches: From my experiences, everybody should get a non-blocking forwarding switch (Nortel BayStack 5510, Extreme Network Summit x450e)

E. (.bashrc) Minimized glibc’s malloc/free operations for mpi applications, especially on NUMA architecture
export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=-1
#export MALLOC_TOP_PAD_=2097152

http://blogold.chinaunix.net/u1/50058/showart_1300144.html

上一篇:intel fortran 并行编译参数
下一篇:linux下使用uuencode+mail发送附件[技术]