并行效率疑问

random_ran

简单说一下case的情况：

三维圆柱绕流，pisoFoam, incompressible solver, LES dynamic k equation, Re 10^5， 3千万结构网格。

为了加速运算，又使用了576核试算，infinband, 24cores/node, 32 Gb mem /node。预计比288核那个case要快，结果挺意外的：

288 core 运行到0.014 second 的时候 clockTime 631 s

Time = 0.014 

Courant Number mean: 0.0209991 max: 0.860686
DILUPBiCG:  Solving for Ux, Initial residual = 0.000686338, Final residual = 5.79235e-09, No Iterations 3
DILUPBiCG:  Solving for Uy, Initial residual = 0.000616596, Final residual = 2.39462e-07, No Iterations 2
DILUPBiCG:  Solving for Uz, Initial residual = 0.25073, Final residual = 2.22608e-08, No Iterations 3
GAMG:  Solving for p, Initial residual = 0.0951268, Final residual = 0.00450444, No Iterations 2
time step continuity errors : sum local = 3.43222e-08, global = 3.79731e-10, cumulative = -1.1601e-09
GAMG:  Solving for p, Initial residual = 0.00459438, Final residual = 0.000221683, No Iterations 11
time step continuity errors : sum local = 1.67411e-09, global = -1.41906e-10, cumulative = -1.302e-09
GAMG:  Solving for p, Initial residual = 0.000220326, Final residual = 9.99438e-08, No Iterations 59
time step continuity errors : sum local = 7.78239e-13, global = -1.40881e-16, cumulative = -1.302e-09
DILUPBiCG:  Solving for k, Initial residual = 0.0347117, Final residual = 7.93887e-08, No Iterations 2
bounding k, min: -1.42701e-10 max: 6.66159e-05 average: 7.69531e-09
ExecutionTime = 627.61 s  ClockTime = 631 s

576 核 ClockTime = 2914 s

Time = 0.014

Courant Number mean: 0.0209908 max: 0.853335
DILUPBiCG:  Solving for Ux, Initial residual = 0.000742593, Final residual = 7.78633e-07, No Iterations 2
DILUPBiCG:  Solving for Uy, Initial residual = 0.000648146, Final residual = 2.82185e-07, No Iterations 2
DILUPBiCG:  Solving for Uz, Initial residual = 0.244847, Final residual = 6.20964e-07, No Iterations 2
GAMG:  Solving for p, Initial residual = 0.0907169, Final residual = 0.00453577, No Iterations 31
time step continuity errors : sum local = 3.71049e-08, global = 5.32494e-10, cumulative = -4.51107e-09
GAMG:  Solving for p, Initial residual = 0.00511943, Final residual = 0.000253917, No Iterations 334
time step continuity errors : sum local = 1.90337e-09, global = 1.51918e-11, cumulative = -4.49588e-09
GAMG:  Solving for p, Initial residual = 0.000284638, Final residual = 9.81376e-08, No Iterations 385
time step continuity errors : sum local = 6.5267e-13, global = 7.25236e-15, cumulative = -4.49587e-09
DILUPBiCG:  Solving for k, Initial residual = 0.0346825, Final residual = 2.17503e-08, No Iterations 2
bounding k, min: -9.64626e-16 max: 6.49695e-05 average: 7.68376e-09
ExecutionTime = 2903.75 s  ClockTime = 2914 s

除了decompose的方式不一样之外：

hierarchicalCoeffs // 576 core
{
    n               (4 4 36);
    delta           0.00001;
    order           xyz;
}

 hierarchicalCoeffs		 // 288 core 
  {		  
      n               (16 9 2);		
      delta           0.001;		
      order           xyz;		 
  }

其他的设置完全一样。

想问问论坛里有这方面经验的人朋友，愿意分享一下你的经验吗？是从求解器的选择入手？从数值scheme方面，还是decompose的方式？虽然烧的是国家的机器，但是还是能节省点。LOL。

xpqiu

@random_ran
是 GAMG 在拖你的后腿，这么多核的情况下，你可以试试用 PCG + DIC 来求解 p 方程，这个方法并行效率会高一些。

random_ran

@xpqiu

谢谢你的建议，确实PCG+DIC 解p方程的效率有所提高，目前新的case还在run。

Time = 0.048 //576 cores

Courant Number mean: 0.0211288 max: 0.859123
DILUPBiCG:  Solving for Ux, Initial residual = 0.000456629, Final residual = 3.65897e-07, No Iterations 2
DILUPBiCG:  Solving for Uy, Initial residual = 0.000477652, Final residual = 4.92182e-07, No Iterations 2
DILUPBiCG:  Solving for Uz, Initial residual = 0.235746, Final residual = 6.0203e-08, No Iterations 3
DICPCG:  Solving for p, Initial residual = 0.21036, Final residual = 0.0103945, No Iterations 676
time step continuity errors : sum local = 3.27749e-08, global = 2.74055e-11, cumulative = 2.44545e-08
DICPCG:  Solving for p, Initial residual = 0.0069283, Final residual = 0.000342288, No Iterations 700
time step continuity errors : sum local = 1.71314e-09, global = -5.01474e-11, cumulative = 2.44043e-08
DICPCG:  Solving for p, Initial residual = 0.000423809, Final residual = 2.08942e-05, No Iterations 723
time step continuity errors : sum local = 1.06217e-10, global = 5.3039e-13, cumulative = 2.44049e-08
DILUPBiCG:  Solving for k, Initial residual = 0.0159443, Final residual = 3.974e-07, No Iterations 3
bounding k, min: -6.91417e-13 max: 0.0010178 average: 3.66373e-08
ExecutionTime = 1327.45 s  ClockTime = 1333 s

Time = 0.048 //288 core

Courant Number mean: 0.0211368 max: 0.860415
DILUPBiCG:  Solving for Ux, Initial residual = 0.000472671, Final residual = 1.86206e-07, No Iterations 2
DILUPBiCG:  Solving for Uy, Initial residual = 0.000470667, Final residual = 1.50369e-07, No Iterations 2
DILUPBiCG:  Solving for Uz, Initial residual = 0.227442, Final residual = 1.43158e-08, No Iterations 3
GAMG:  Solving for p, Initial residual = 0.14952, Final residual = 0.00727048, No Iterations 3
time step continuity errors : sum local = 3.21523e-08, global = 1.32072e-09, cumulative = -2.68714e-11
GAMG:  Solving for p, Initial residual = 0.0105374, Final residual = 0.000436311, No Iterations 17
time step continuity errors : sum local = 1.38004e-09, global = 2.02486e-11, cumulative = -6.62288e-12
GAMG:  Solving for p, Initial residual = 0.000588406, Final residual = 9.97916e-08, No Iterations 102
time step continuity errors : sum local = 2.47533e-13, global = 3.66548e-17, cumulative = -6.62284e-12
DILUPBiCG:  Solving for k, Initial residual = 0.0159342, Final residual = 3.67721e-08, No Iterations 3
bounding k, min: -1.80015e-12 max: 0.00101339 average: 3.66396e-08
ExecutionTime = 1824.48 s  ClockTime = 1831 s

scalarTransport write:
DILUPBiCG:  Solving for s, Initial residual = 0.00981188, Final residual = 7.3898e-09, No Iterations 2

我把decompose的方式用得和288 cores 那个case 类似

hierarchicalCoeffs //576 case
{
    n               (32 9 2);
    delta           0.001;
    order           xyz;
}

 hierarchicalCoeffs	 //288 case
  {		  
	       n               (16 9 2);
	      delta           0.001;
	      order           xyz;
  }

CFD中文网

这么多核心为什么采用hierarchical而不用scotch？你这个核数很多，处理器边界很重要。对了n (16 9 2); 这个数字的分布也会有影响，CPU太多了毕竟。如果又有兴趣可以随便调节试试，不过看样子很费时间。3千万结构网格。

random_ran

@cfd-china

你说得很对。从hierarchical到scotch是我要考虑的一个方向。不过由于decompose的时间花费，现在我在处理一些solver上的选择，这样可以不decompse。这个表格是我现在做过的一些case。

我的目标计算时间是1个小时算出0.5second。

#case	p-solver	#核心数	decompose 方法	数值精度	求解器	1小时后，模拟时间	湍流模型	#PISO 矫正次数
1	GAMG+symGaussSeidel	576	hiera (32 9 2)	-6	pisoFoam	-	LES-kqE	3
2	-	-	-	-	-	-	-	-
3	PCG+DIC	576	hiera (32 9 2)	-6	pisoFoam	0.133	LES-WALE	3
4	PCG+DIC	576	hiera (32 9 2)	-6	pisoFoam	0.129	LES-kqE	3
5	PCG+DIC	576	hiera (32 9 2)	-6	pisoFoam	0.125	LES-TKE	3
6	PCG+DIC	576	hiera (32 9 2)	-7	pisoFoam	0.125	LES-TKE	3
7
8
9	GAMG+smootherGaussSeidel	576	hiera (4 4 36)	-7	pisoFoam	0.015(2914s)	LES-TKE	3
10	GAMG+smootherGaussSeidel	288	hiera (16 9 2)	-6	pisoFoam	0.171	LES-TKE	2
11	GAMG+smootherGaussSeidel	288	hiera (16 9 2)	-6	pisoFoam	0.132	LES-TKE	3

CFD中文网

你这可以发文章了，看起来算例10，288核GAMG最快？

random_ran

我只是在正式运算前进行一下，并行效率的探究。实际上，目前的计算速度，还是太慢。能接受的计算速度是初始的一个小时里能计算到0.5秒。我的时间步长是 0.001s 保证Max.Co < 1。

我发现DIC+PCG对压力方程的迭代次数特别多，计算两个小时之后，单个时间步长的迭代次数就已经快到1000次了。这只是计算的开始阶段，往后算的话，我估计发散的可能性非常大。

除了现在对solver的选择之外，我打算再试试可压求解器，用pimple的可控时间步长来进行初始化计算。

random_ran

$ checkMesh

Create time

Create polyMesh for time = 0

Time = 0

Mesh stats
    points:           34617600
    faces:            102296400
    internal faces:   100743600
    cells:            33840000
    faces per cell:   6
    boundary patches: 5
    point zones:      0
    face zones:       0
    cell zones:       0

Overall number of cells of each type:
    hexahedra:     33840000
    prisms:        0
    wedges:        0
    pyramids:      0
    tet wedges:    0
    tetrahedra:    0
    polyhedra:     0

Checking topology...
    Boundary definition OK.
    Cell to face addressing OK.
    Point usage OK.
    Upper triangular ordering OK.
    Face vertices OK.
    Number of regions: 1 (OK).

Checking patch topology for multiply connected surfaces...
    Patch               Faces    Points   Surface topology                  
    INLET               28200    28848    ok (non-closed singly connected)  
    OUTLET              28200    28848    ok (non-closed singly connected)  
    CYLINDER            56400    57600    ok (non-closed singly connected)  
    FRONT_CYC           720000   721200   ok (non-closed singly connected)  
    BACK_CYC            720000   721200   ok (non-closed singly connected)  

Checking geometry...
    Overall domain bounding box (-20 -20 0) (20 20 3.07458)
    Mesh has 3 geometric (non-empty/wedge) directions (1 1 1)
    Mesh has 3 solution (non-empty) directions (1 1 1)
    Boundary openness (9.81794e-16 9.03981e-18 9.17862e-15) OK.
    Max cell openness = 3.5214e-16 OK.
    Max aspect ratio = 229.976 OK.
    Minimum face area = 7.49334e-07. Maximum face area = 0.022353.  Face area magnitudes OK.
    Min volume = 4.90189e-08. Max volume = 0.00146226.  Total volume = 3861.2.  Cell volumes OK.
    Mesh non-orthogonality Max: 4.97612e-05 average: 0
    Non-orthogonality check OK.
    Face pyramids OK.
    Max skewness = 0.006519 OK.
    Coupled point location match (average 4.44139e-16) OK.

Mesh OK.

End

xpqiu

@random_ran 不知道你运行之前是否 renumberMesh？用了这个之后能减少求解方程时的迭代次数。

另外，还可以尝试 PCG + GAMG，用 GAMG当 PCG 的 smoother。

random_ran

@xpqiu

非常感谢你的建议， renumberMesh 在一个800万网格 96核的模拟中，时间缩短了40%。

以后多核计算前，我都会renumberMesh 的，也建议大家用上这个好用的功能，加速你的算例。:D

CFDngu

@random_ran 你好，我想问一下，我运用scotch方法剖分了网格，并行运算时提示网格出现下述问题是咋回事：
0_1492227908686_QQ图片20170415114415.jpg

大概提示是找不到法向量，我对应到程序中是这个地方：

   103     if (mag(pn) < VSMALL)
   104     {
   105         FatalErrorInFunction
   106             << "Cannot determine normal vector from patches."
   107             << abort(FatalError);
   108     }
   109     else
   110     {
   111         pn /= mag(pn);
   112     }
   113

大概意思就是分块的面积太小了，找不到法向量？

用simple方法提示的错误不一样，大意几个地方网格的交界面有问题。

可是不管哪个方法，我都检查了网格，checkMesh没问题，单核也可以计算。

我怀疑是不是加密的关系，一般感觉加密的level到3就容易出问题，我这个就到了3，请问你遇到过类似的情况吗？

random_ran

@CFDngu

你好,CFDngu:

你这个问题我没有遇到过，decompose的时候没有遇到过报错的问题。可能是我的几何体比较简单的原因？

建议你把：

decomposeParDict 内容
$ checkMesh 结果
O.F. 版本

贴出来，或许有碰到过类似问题的朋友能帮助到你。

CFDngu

@random_ran 多谢建议，不过确实都没有任何问题。。。

等过一段我贴上来：）

CFDngu

@xpqiu renumber的功能是什么呢？我试着用了一下，发现结果完全不对啊。。。
不renumber的结果就对了

qjh888

好贴子，马住。另外推荐一个连接：
http://nscc-gz.cn/newsdetail.html?6151

xpqiu

@CFDngu renumberMesh 的作用是减小矩阵的 bandwidth，让非零元素更集中到主对角附近。
https://en.wikipedia.org/wiki/Cuthill–McKee_algorithm

CFDngu

@xpqiu 可是我的算例为什么renumber之后就结果就不对了呢？只有初始场是正确的，第二步开始就完全错了。和算例是2D的有关系吗？

xpqiu

@CFDngu 2D 也可以renumberMesh，具体是怎么不对，log贴上来看看。不太可能是renumberMesh 导致的，可能是其他原因。

CFDngu

好像真是有问题，最下面的。。。。

facesets出问题了吗？

xpqiu

@CFDngu renumberMesh运行完以后应该重新提取一下faceSet

CFD中文网

并行效率疑问