James Lin


James Lin now severs as the vice director of HPC Center in Shanghai Jiao Tong University. Before joining HPC center early this May, he has worked as assistant professor in Dept. of Computer Science of the same university since year 2005. His major research interests include HPC and many-core programming. He is the PI of NVIDIA Center of Excellence at SJTU (http://ccoe.sjtu.edu.cn), and co-founder and member of steering committee of HMPP Competence Center for APAC (http://competencecenter.hmpp.org/hmpp-coc-for-asia/). In his HPC research, CFD/CAE is always his favorite target area. He has led or participated several CFD/CAE projects within HPC or Grid computing since year 2003, and worked closely with CFD team in COMAC before China announced the plan to build her own civil jet in year 2007. He is also quite active in HPC education. As the founder of SJTU HPC Seminar, he has organized 20+ sessions for students in SJTU and Shanghai since year 2010.

Presentation Abstract

CFD (Computational Fluid Dynamic) is the fundamental in designing large civil airplane. Working with CFD team in COMAC (Commercial Aircraft Corporation of China) closely, SJTU (Shanghai Jiao Tong University) has developed an in-house code called SJTU-NS3D to solve 3D RANS (Reynolds Average Navier-Stokes) equations on structured grids by FVM (finite volume method), which has been widely used in designing wing model.

To meet rising demanding of computation resource for higher precision, we has designed CUDA version of SJTU-NS3D. As the hot spot, Runge-Kutta iteration has been offloaded to GPU as kernel functions. Then we has further optimized by four different approaches: 1) for better locality, transformed 1D matrices stored intermediate variables to 3D matrices; 2) to update data in neighbor vertexes, used multiple kernels to synchronize globally; 3) to eliminate data transfer between host and device memory, paralleled one direction each time in Implicit Residual Smoothing Algorithm; 4) for better access to global memory after matrices transformation, used intrinsic function cudaMalloc() to  coalesce access.

Optimized CUDA version has achieved 20-fold speedup for ORENA M6 wing model comparing to single thread on Intel i7 920@2.67GHz, without sacrifice any accuracy, and 37-fold speedup for a real wing model candidate from COMAC on single Fermi C2050. After testing with other real cases, CFD team in COMAC has adopted the code in daily work. Future work will be 1) adopt higher accuracy schemes in convective terms adaptation besides Jameson Central Difference; 2) expand to multiple Fermi cards to meet larger Grid numbers.