关于fluent使用GPU加速计算的问题(显存不足?)



  • 最近尝试GPU加速FLUENT计算,工作站配置是至强6138*2,256G内存,quadro GP100显卡显存16G,其中任务管理器显示共享GPU内存128G。载入了一个大约2700万网格的模型,出现了如下错误:

    错误1(驱动版本 376.84):
    AMGX ERROR: file amg\amg_gpu.c line 1041
    AMGX ERROR: Thrust failure.
    Thrust failure: bad allocation: out of memory
    File and line number are not available for this exception.
    错误2(驱动版本391.74):
    AMGX ERROR: file amg\amg_gpu.c line 1050
    AMGX ERROR: CUDA kernel launch error.
    Caught amgx exception: Cuda failure: 'out of memory'
    at: D:/users/ajarolia/AMGX/amg/base/src/blas.cu:665
    Stack trace:
    

    看起来是由于显存不足造成的(共享内存不发挥作用?)。如果载入一些较小的模型,则计算速度还不如纯CPU。按照N卡官网的说法,几百万网格以下的问题使用GPU加速效果不明显(网址https://www.nvidia.com/zh-cn/data-center/gpu-accelerated-applications/ansys-fluent/)。但若考虑显卡显存的限制,4块16G显存的GPU计算卡也只能处理不到3000万网格的问题。16G显存的tesla p100报价5w+,这个硬件费用压力有点大,工作站也装不了那么多。

    来请教各位有没有用单块GPU加速计算的办法。


  • CORE Fluent讲师

    我之前也在探索怎么用GPU加速的问题
    您可以先参考一下这两个帖子

    https://www.cfd-online.com/Forums/fluent/148654-resolved-gpu-fluent.html
    https://www.cfd-online.com/Forums/hardware/187098-gpu-acceleration-ansys-fluent.html

    图片我没有粘贴过来~

    The topic of GPU acceleration for Ansys Fluent sometimes seems to be shrouded in mystery. So I ran a few benchmarks to answer some frequently asked questions and get a snapshot of the capability of this feature in 2017.

    Flow Setup:
    Benchmark case: 3D lid driven cavity in a cubical domain
    Grid resolution: 64x64x64 -> 262144 cells
    Reynolds number: 10000
    solver type: pressure-based, steady
    Turbulence model: standard k-epsilon
    Number of iterations: 100, reporting interval 10
    default settings whenever possible

    Software/Hardware:
    Operating system: Opensuse Leap 42.1
    Fluent version: Ansys Fluent 18.0
    CPU: Intel Xeon W3670, 6 cores, 3.2GHz, HT disabled
    Memory: 24 GB DDR3-1333 ECC triple-channel
    GPU: Quadro 5000 (theoretical compute performance: 722 GFLOPS single, 361 GFLOPS double, memory bandwidth: 126 GB/s, memory size: 2.5 GB GDDR5)

    1. Coupled algorithm

    As stated in this guide, GPU acceleration works best if the linear solver fraction is high which is usually the case when using the coupled solver. Fluent reported it to be around 60% or higher in all cases shown here. Without further ado:

    So obviously GPU acceleration works under the right circumstances.
    Using only one CPU core, adding the GPU results in a speed-up of 50-60% in single-precision (SP) and double precision (DP) respectively. But you can already see the diminishing returns with higher CPU core counts.

    1. SIMPLE algorithm

    Using the SIMPLE algorithm the picture is completely different. The linear solver fraction without a GPU is just below 30% for all cases, so GPU acceleration as it is currently implemented in Ansys Fluent can not be as effective. This is a caveat that Ansys is aware of and that is clearly stated in the more in-depth reviews of this feature.

    As expected, solution times are much higher with a GPU "acceleration".
    To be clear: this is not new information, Ansys never claimed that GPU acceleration was worth it with the SIMPLE algorithm.

    1. Pairing "high-end" CPUs with slow GPUs

    You might expect to be on the safe side as long as you are using the coupled solver. But we could already see the diminishing returns in case 1 with higher CPU core counts. We increase the discrepancy with different hardware: 2x Xeon E5-2687W, 128GB (16x8GB) DDR3-1600 reg ECC, Quadro 4000 (theoretical compute performance: 486 GFLOPS SP, 243 GFLOPS DP, memory bandwidth: 89.9 GB/s, memory size: 2 GB GDDR5)

    While solution times with a GPU and one CPU core are slightly lower than without a GPU, there is a huge performance penalty when using the GPU along with 14 CPU cores. This is despite the fact that the linear solver fraction is 60% without a GPU. So clearly, a low-end GPU will slow down fast CPUs even if the other criteria for using GPU acceleration are met.

    1. Consumer-grade graphics cards
      Lets see what a cheap consumer-grade graphics card can do for GPU acceleration. The hardware in this test: 2x Xeon E5-2650v4, 128GB (8x16GB) DDR4-2400 reg ECC, Geforce GTX 1060 6GB (theoretical compute performance: 4372 GFLOPS SP, 137 GFLOPS DP, memory bandwidth: 192 GB/s, memory size: 6 GB GDDR5). Note that there was a suspended computation residing in memory so the numbers might not be representative for the absolute performance of this processor type.

    The conclusion: GPU acceleration in Ansys Fluent definitely works with cheap gaming graphics cards. Even in DP the performance gains from the GPU are quite remarkable given its low DP performance. This might indicate that the workload in this benchmark is not entirely compute bound. Memory- and PCIe-transfers might also be important. However, the GPU is still a huge bottleneck as soon as we are using more CPU cores.

    1. Q&A

    Question
    When can I use GPU acceleration?
    Answer

    1. You need to use the right solver in the first place. For example the coupled flow solver or the DO radiation model. Switching from SIMPLE or its variants to coupled just to use GPU acceleration is probably not the best idea.
    2. Your model must fit into the GPU memory. You can estimate the amount of memory needed with the formulas in section 4 of the guide mentioned earlier. The benchmark I ran used ~0.5 GB of VRAM in single precision and ~1 GB in double precision. Again: if your model does not fit in the GPU memory, you currently can not use GPU acceleration. GPU memory from dual-cards or more than one card does stack, so you can use this to simulate larger models.

    Question
    Which GPUs can I use for GPU acceleration in Ansys Fluent
    Answer
    Ansys only recommends Tesla compute cards for this purpose. However, you can use virtually any recent Nvidia GPU. Yes, even Geforce cards, I verified this with a GTX 1060.
    That being said, not all GPUs are created equal. The main differentiation lies in the DP compute performance. Nearly all modern Geforce and Quadro GPUs have a DP/SP performance ratio of 1/32. A Quadro P6000, one of the most expensive GPUs you can buy right now has a theoretical peak performance of 11758 GFLOPS SP but only 367 GFLOPS DP. Just about the same as the seriously outdated Quadro 5000 I used in this test. This is not an issue if you want to compute in SP, but a colossal waste of money if you want to perform simulations in DP. In this case you will have to buy a Tesla card. Be careful though: even some of the Tesla cards now have reduced DP capabilities because their target application is deep learning.
    One of the last exceptions from this rule that is still somewhat relevant today is the first generation of Titan GPUs "Kepler" released in 2013 and 2014 (Titan, Titan Black, Titan Z). They have a DP/SP ratio of 1/3 and can be bought used for a reasonable price.

    Question
    Should I spend extra money on a compute GPU when buying a new Fluent workstation
    Answer
    For a "general purpose" Workstation with a limited budget the answer is probably no. You are better off spending excess money on more CPU performance in most cases. Only when you have maxed out CPU performance or if you are sure that you mostly use the solvers that benefit from GPU acceleration and your models are small enough you might consider it.

    Edit: here is a nearly exhaustive list of Nvidia GPUs with high DP capabilities:



  • 感谢回复!帖子很长,还要消化一下。。但总体感觉当时花在GPU上的预算换在CPU上可能会更合适一点:zoule:


Log in to reply