并行效率疑问


  • 版主

    简单说一下case的情况:

    三维圆柱绕流,pisoFoam, incompressible solver, LES dynamic k equation, Re 10^5, 3千万结构网格。

    为了加速运算,又使用了576核试算,infinband, 24cores/node, 32 Gb mem /node。预计比288核那个case要快,结果挺意外的:

    288 core 运行到0.014 second 的时候 clockTime 631 s

    Time = 0.014 
    
    Courant Number mean: 0.0209991 max: 0.860686
    DILUPBiCG:  Solving for Ux, Initial residual = 0.000686338, Final residual = 5.79235e-09, No Iterations 3
    DILUPBiCG:  Solving for Uy, Initial residual = 0.000616596, Final residual = 2.39462e-07, No Iterations 2
    DILUPBiCG:  Solving for Uz, Initial residual = 0.25073, Final residual = 2.22608e-08, No Iterations 3
    GAMG:  Solving for p, Initial residual = 0.0951268, Final residual = 0.00450444, No Iterations 2
    time step continuity errors : sum local = 3.43222e-08, global = 3.79731e-10, cumulative = -1.1601e-09
    GAMG:  Solving for p, Initial residual = 0.00459438, Final residual = 0.000221683, No Iterations 11
    time step continuity errors : sum local = 1.67411e-09, global = -1.41906e-10, cumulative = -1.302e-09
    GAMG:  Solving for p, Initial residual = 0.000220326, Final residual = 9.99438e-08, No Iterations 59
    time step continuity errors : sum local = 7.78239e-13, global = -1.40881e-16, cumulative = -1.302e-09
    DILUPBiCG:  Solving for k, Initial residual = 0.0347117, Final residual = 7.93887e-08, No Iterations 2
    bounding k, min: -1.42701e-10 max: 6.66159e-05 average: 7.69531e-09
    ExecutionTime = 627.61 s  ClockTime = 631 s
    

    576 核 ClockTime = 2914 s

    Time = 0.014
    
    Courant Number mean: 0.0209908 max: 0.853335
    DILUPBiCG:  Solving for Ux, Initial residual = 0.000742593, Final residual = 7.78633e-07, No Iterations 2
    DILUPBiCG:  Solving for Uy, Initial residual = 0.000648146, Final residual = 2.82185e-07, No Iterations 2
    DILUPBiCG:  Solving for Uz, Initial residual = 0.244847, Final residual = 6.20964e-07, No Iterations 2
    GAMG:  Solving for p, Initial residual = 0.0907169, Final residual = 0.00453577, No Iterations 31
    time step continuity errors : sum local = 3.71049e-08, global = 5.32494e-10, cumulative = -4.51107e-09
    GAMG:  Solving for p, Initial residual = 0.00511943, Final residual = 0.000253917, No Iterations 334
    time step continuity errors : sum local = 1.90337e-09, global = 1.51918e-11, cumulative = -4.49588e-09
    GAMG:  Solving for p, Initial residual = 0.000284638, Final residual = 9.81376e-08, No Iterations 385
    time step continuity errors : sum local = 6.5267e-13, global = 7.25236e-15, cumulative = -4.49587e-09
    DILUPBiCG:  Solving for k, Initial residual = 0.0346825, Final residual = 2.17503e-08, No Iterations 2
    bounding k, min: -9.64626e-16 max: 6.49695e-05 average: 7.68376e-09
    ExecutionTime = 2903.75 s  ClockTime = 2914 s
    

    除了decompose的方式不一样之外:

    hierarchicalCoeffs // 576 core
    {
        n               (4 4 36);
        delta           0.00001;
        order           xyz;
    }
    
     hierarchicalCoeffs		 // 288 core 
      {		  
          n               (16 9 2);		
          delta           0.001;		
          order           xyz;		 
      }
    

    其他的设置完全一样。

    想问问论坛里有这方面经验的人朋友,愿意分享一下你的经验吗?是从求解器的选择入手?从数值scheme方面,还是decompose的方式? 虽然烧的是国家的机器,但是还是能节省点。LOL。


  • 版主

    @random_ran
    是 GAMG 在拖你的后腿,这么多核的情况下,你可以试试用 PCG + DIC 来求解 p 方程,这个方法并行效率会高一些。


  • 版主

    @xpqiu

    谢谢你的建议,确实PCG+DIC 解p方程的效率有所提高,目前新的case还在run。

    Time = 0.048 //576 cores
    
    Courant Number mean: 0.0211288 max: 0.859123
    DILUPBiCG:  Solving for Ux, Initial residual = 0.000456629, Final residual = 3.65897e-07, No Iterations 2
    DILUPBiCG:  Solving for Uy, Initial residual = 0.000477652, Final residual = 4.92182e-07, No Iterations 2
    DILUPBiCG:  Solving for Uz, Initial residual = 0.235746, Final residual = 6.0203e-08, No Iterations 3
    DICPCG:  Solving for p, Initial residual = 0.21036, Final residual = 0.0103945, No Iterations 676
    time step continuity errors : sum local = 3.27749e-08, global = 2.74055e-11, cumulative = 2.44545e-08
    DICPCG:  Solving for p, Initial residual = 0.0069283, Final residual = 0.000342288, No Iterations 700
    time step continuity errors : sum local = 1.71314e-09, global = -5.01474e-11, cumulative = 2.44043e-08
    DICPCG:  Solving for p, Initial residual = 0.000423809, Final residual = 2.08942e-05, No Iterations 723
    time step continuity errors : sum local = 1.06217e-10, global = 5.3039e-13, cumulative = 2.44049e-08
    DILUPBiCG:  Solving for k, Initial residual = 0.0159443, Final residual = 3.974e-07, No Iterations 3
    bounding k, min: -6.91417e-13 max: 0.0010178 average: 3.66373e-08
    ExecutionTime = 1327.45 s  ClockTime = 1333 s
    
    Time = 0.048 //288 core
    
    Courant Number mean: 0.0211368 max: 0.860415
    DILUPBiCG:  Solving for Ux, Initial residual = 0.000472671, Final residual = 1.86206e-07, No Iterations 2
    DILUPBiCG:  Solving for Uy, Initial residual = 0.000470667, Final residual = 1.50369e-07, No Iterations 2
    DILUPBiCG:  Solving for Uz, Initial residual = 0.227442, Final residual = 1.43158e-08, No Iterations 3
    GAMG:  Solving for p, Initial residual = 0.14952, Final residual = 0.00727048, No Iterations 3
    time step continuity errors : sum local = 3.21523e-08, global = 1.32072e-09, cumulative = -2.68714e-11
    GAMG:  Solving for p, Initial residual = 0.0105374, Final residual = 0.000436311, No Iterations 17
    time step continuity errors : sum local = 1.38004e-09, global = 2.02486e-11, cumulative = -6.62288e-12
    GAMG:  Solving for p, Initial residual = 0.000588406, Final residual = 9.97916e-08, No Iterations 102
    time step continuity errors : sum local = 2.47533e-13, global = 3.66548e-17, cumulative = -6.62284e-12
    DILUPBiCG:  Solving for k, Initial residual = 0.0159342, Final residual = 3.67721e-08, No Iterations 3
    bounding k, min: -1.80015e-12 max: 0.00101339 average: 3.66396e-08
    ExecutionTime = 1824.48 s  ClockTime = 1831 s
    
    scalarTransport write:
    DILUPBiCG:  Solving for s, Initial residual = 0.00981188, Final residual = 7.3898e-09, No Iterations 2
    

    我把decompose的方式用得和288 cores 那个case 类似

    hierarchicalCoeffs //576 case
    {
        n               (32 9 2);
        delta           0.001;
        order           xyz;
    }
    
     hierarchicalCoeffs	 //288 case
      {		  
    	       n               (16 9 2);
    	      delta           0.001;
    	      order           xyz;
      }
    


  • 这么多核心为什么采用hierarchical而不用scotch?你这个核数很多,处理器边界很重要。对了n (16 9 2);这个数字的分布也会有影响,CPU太多了毕竟。如果又有兴趣可以随便调节试试,不过看样子很费时间。3千万结构网格。


  • 版主

    @cfd-china

    你说得很对。从hierarchical到scotch是我要考虑的一个方向。不过由于decompose的时间花费,现在我在处理一些solver上的选择,这样可以不decompse。这个表格是我现在做过的一些case。

    我的目标计算时间是1个小时算出0.5second。

    #case p-solver #核心数 decompose 方法 数值精度 求解器 1小时后,模拟时间 湍流模型 #PISO 矫正次数
    1 GAMG+symGaussSeidel 576 hiera (32 9 2) -6 pisoFoam - LES-kqE 3
    2 - - - - - - - -
    3 PCG+DIC 576 hiera (32 9 2) -6 pisoFoam 0.133 LES-WALE 3
    4 PCG+DIC 576 hiera (32 9 2) -6 pisoFoam 0.129 LES-kqE 3
    5 PCG+DIC 576 hiera (32 9 2) -6 pisoFoam 0.125 LES-TKE 3
    6 PCG+DIC 576 hiera (32 9 2) -7 pisoFoam 0.125 LES-TKE 3
    7
    8
    9 GAMG+smootherGaussSeidel 576 hiera (4 4 36) -7 pisoFoam 0.015(2914s) LES-TKE 3
    10 GAMG+smootherGaussSeidel 288 hiera (16 9 2) -6 pisoFoam 0.171 LES-TKE 2
    11 GAMG+smootherGaussSeidel 288 hiera (16 9 2) -6 pisoFoam 0.132 LES-TKE 3


  • 跪了。。。。你这可以发文章了,看起来算例10,288核GAMG最快?


  • 版主


  • 版主

    我只是在正式运算前进行一下,并行效率的探究。实际上,目前的计算速度,还是太慢。能接受的计算速度是初始的一个小时里能计算到0.5秒。 我的时间步长是 0.001s 保证Max.Co < 1。

    我发现DIC+PCG对压力方程的迭代次数特别多,计算两个小时之后,单个时间步长的迭代次数就已经快到1000次了。这只是计算的开始阶段,往后算的话,我估计发散的可能性非常大。

    除了现在对solver的选择之外,我打算再试试可压求解器,用pimple的可控时间步长来进行初始化计算。


  • 版主

    $ checkMesh

    Create time
    
    Create polyMesh for time = 0
    
    Time = 0
    
    Mesh stats
        points:           34617600
        faces:            102296400
        internal faces:   100743600
        cells:            33840000
        faces per cell:   6
        boundary patches: 5
        point zones:      0
        face zones:       0
        cell zones:       0
    
    Overall number of cells of each type:
        hexahedra:     33840000
        prisms:        0
        wedges:        0
        pyramids:      0
        tet wedges:    0
        tetrahedra:    0
        polyhedra:     0
    
    Checking topology...
        Boundary definition OK.
        Cell to face addressing OK.
        Point usage OK.
        Upper triangular ordering OK.
        Face vertices OK.
        Number of regions: 1 (OK).
    
    Checking patch topology for multiply connected surfaces...
        Patch               Faces    Points   Surface topology                  
        INLET               28200    28848    ok (non-closed singly connected)  
        OUTLET              28200    28848    ok (non-closed singly connected)  
        CYLINDER            56400    57600    ok (non-closed singly connected)  
        FRONT_CYC           720000   721200   ok (non-closed singly connected)  
        BACK_CYC            720000   721200   ok (non-closed singly connected)  
    
    Checking geometry...
        Overall domain bounding box (-20 -20 0) (20 20 3.07458)
        Mesh has 3 geometric (non-empty/wedge) directions (1 1 1)
        Mesh has 3 solution (non-empty) directions (1 1 1)
        Boundary openness (9.81794e-16 9.03981e-18 9.17862e-15) OK.
        Max cell openness = 3.5214e-16 OK.
        Max aspect ratio = 229.976 OK.
        Minimum face area = 7.49334e-07. Maximum face area = 0.022353.  Face area magnitudes OK.
        Min volume = 4.90189e-08. Max volume = 0.00146226.  Total volume = 3861.2.  Cell volumes OK.
        Mesh non-orthogonality Max: 4.97612e-05 average: 0
        Non-orthogonality check OK.
        Face pyramids OK.
        Max skewness = 0.006519 OK.
        Coupled point location match (average 4.44139e-16) OK.
    
    Mesh OK.
    
    End
    

  • 版主

    @random_ran 不知道你运行之前是否 renumberMesh?用了这个之后能减少求解方程时的迭代次数。

    另外,还可以尝试 PCG + GAMG,用 GAMG当 PCG 的 smoother。


  • 版主

    @xpqiu

    非常感谢你的建议, renumberMesh 在一个800万网格 96核的模拟中,时间缩短了40%。

    以后多核计算前,我都会renumberMesh 的,也建议大家用上这个好用的功能,加速你的算例。:D



  • @random_ran 你好,我想问一下,我运用scotch方法剖分了网格,并行运算时提示网格出现下述问题是咋回事:
    0_1492227908686_QQ图片20170415114415.jpg

    大概提示是找不到法向量,我对应到程序中是这个地方:

       103     if (mag(pn) < VSMALL)
       104     {
       105         FatalErrorInFunction
       106             << "Cannot determine normal vector from patches."
       107             << abort(FatalError);
       108     }
       109     else
       110     {
       111         pn /= mag(pn);
       112     }
       113
    

    大概意思就是分块的面积太小了,找不到法向量?

    用simple方法提示的错误不一样,大意几个地方网格的交界面有问题。

    可是不管哪个方法,我都检查了网格,checkMesh没问题,单核也可以计算。

    我怀疑是不是加密的关系,一般感觉加密的level到3就容易出问题,我这个就到了3,请问你遇到过类似的情况吗?


  • 版主

    @CFDngu

    你好,CFDngu:

    你这个问题我没有遇到过,decompose的时候没有遇到过报错的问题。可能是我的几何体比较简单的原因?

    建议你把:

    decomposeParDict 内容
    $ checkMesh 结果
    O.F. 版本

    贴出来,或许有碰到过类似问题的朋友能帮助到你。



  • @random_ran 多谢建议,不过确实都没有任何问题。。。

    等过一段我贴上来:)



  • @xpqiu renumber的功能是什么呢?我试着用了一下,发现结果完全不对啊。。。
    不renumber的结果就对了



  • 好贴子,马住。另外推荐一个连接:
    http://nscc-gz.cn/newsdetail.html?6151


  • 版主

    @CFDngu renumberMesh 的作用是减小矩阵的 bandwidth,让非零元素更集中到主对角附近。
    https://en.wikipedia.org/wiki/Cuthill–McKee_algorithm



  • @xpqiu 可是我的算例为什么renumber之后就结果就不对了呢?只有初始场是正确的,第二步开始就完全错了。和算例是2D的有关系吗?


  • 版主

    @CFDngu 2D 也可以renumberMesh,具体是怎么不对,log贴上来看看。不太可能是renumberMesh 导致的,可能是其他原因。



  • 0_1492507825969_85246950.jpg

    好像真是有问题,最下面的。。。。

    facesets出问题了吗?


登录后回复
 

与 CFD中文网 的连接断开,我们正在尝试重连,请耐心等待