@mithraLa 对啊,哪怕是14代i9,也只有8个大核,再多真没用了。7302这种神U,是8通道3200ddr4,16核,每通道2核;考虑到内存频率,6400的ddr5,每通道4核刚刚好。也就是说12通道4800的9004,每通道3核,也就是36核刚刚好。
问题是这年头,通道比核贵啊,多买几个核,效率是低了,但峰值性能还是能提高。
heike256
帖子
-
-
@李东岳 这么夸张啊,我在你那个200万网格测试例中霸榜的数据,是用hyper-v跑出来的,理论讲wsl 2和hyper-v一样。
-
foamToVTK
-
很明显,正经生产环境都是超算,既不用wsl也不用虚拟机。开发环境我两者都用,个人办公电脑用wsl,因为顺手,课题组公用服务器用hyper-v,因为有多用户功能。性能差一倍完全可能,要看测试算例和具体配置,不钻牛角尖的话,两者性能差距没这么大,或者说wsl2.0和hyper-v性能应该一样。
-
@2019201300 i7只有两个内存通道,在正确的位置插两条就行。开不开小核影响不大,意思是bios开不开小核对峰值性能影响不大,很显然,你这种情况,并行数为8达到峰值性能,并行数设置成16算是误用,bios里开小核和用小核参与并行计算明显是两个概念,我相信你开了超线程,并行数设置成24,会更慢,这种大小核cpu并行数等于大核数量性能最好,小核的功能主要是在负载低的时候降低功耗。我看你测试结果很好了啊,用了虚拟机会慢一丢丢,很接近我的结果了。
-
@liuxin 大小核的原因,混用大小核的话,大核会被拉低到小核的水平
-
我5年前就就收到过类似求助,解决之后还是陆续收到不同版本的求助,今天跑来问我这个版本怎么装不上,这个版本装上了那个版本怎么又装不上,笔记本装上了,超算怎么装不上。这么多年过去了,这软件的易用性没有进步,用户换了一波又一波,永远都是新手在用,学会了拿到学历就转行走人。
-
@tidedrinker 7002系列性价比真的无敌
-
@Gxfile 目测是Intel-W系列首测?这成绩连13700F都不如,是不是有啥配置没弄对?Intel-W支持8×3200MHz内存,13700是2×6400MHz,即便Intel-W主频低,也不会比13700慢啊。
-
@李东岳 海鲜市场可多了,海光应该还挺好用的,龙芯没玩儿过
-
三只青蛙跳下水,不懂,不懂,不懂
-
没太看明白,文献最大流速是0.4m/s,你算的是0.6m/s?差这么多可真够离谱的,既然你水面高度和水槽宽度都定了,流量也定了,流速不就定了么?还算啥?
-
只要收敛判据一致,松不松弛对结果没影响,升阻力系数算不准的原因那可多了。
-
@江山不如凉竹 我也用过超算,超算中心服务水平各异,有些确实存储空间紧张,反正就是各种难用,实在不行你就写脚本,自动后处理,自动删没用的原始数据。
-
@江山不如凉竹 超算可以买存储空间,在超算上后处理,然后删掉没用的原始数据,把后处理结果传输回来。后处理软件加载压缩过的数据会变慢,压缩也会耗时,一来一回就是用CPU时间换硬盘空间。
-
30T只是两块硬盘的容量,用机械硬盘加缓存或者机械硬盘阵列的方案,成本都还可以接受;如果在远程服务器算的话,数据传输可能是个问题。CFD的最终产出无非是报告、论文、视频,都是小数据,openfoam输出的物理场数据并不需要长期保存。
另,数据规模越大越不建议压缩,paraview后处理可以自动解压,无需干预,但性能会大幅下降。 -
看起来是个简单问题,但需要了解更多信息才能判断。报错信息就是字面意思,关键词version未定义,可能的原因是文件不完整,比如第一次加密网格写入文件时意外退出。
-
试了一下关闭超线程和关闭E核(小核)的性能差别。关闭超线程对8核并行速度无影响,还是157s左右;超线程和小核都关闭的话,8核并行速度会变慢,计算时间200s以上了。
-
我买了13700F,不带K的型号,因为我在书房用,希望发热小些,风扇转速低,安静。内存用了D5 6400双路,用不同配置做了测试,统计了Clock Time,似乎某些环境下ExecutionTime和预期非常不同,比ClockTime短很多,所以用了更可靠的ClockTime。
处理器默认配置(开小核,开超线程),openfoam用2112 apt安装的版本
# cores Wall time (s): ------------------------ 8 157 4 232 2 331 1 746
====================================
处理器默认配置,openfoam用2112 Intel oneapi 自己编译的版本# cores Wall time (s): ------------------------ 8 278 6 187 4 210 2 345 1 798
可见,官方编译的版本在13700F上取得了157s的成绩,性能瓶颈依然是内存带宽,考虑到桌面平台CPU价格较低,且内存频率高于服务器平台,依然值得选购。
-
@2019201300 小核关了吗?内存频率是多少?
-
Fine. Give me more details about performance. Reducing switch is expected to reduce communication delay which is benifit to mpi. To what extend will the smaller delay improve the performance when running CFD application. Is this another advantage of ring topology infiniband network?
The performance improvement due to the reduced communication delay in a ring topology compared to a star topology depends largely on the specifics of your CFD application.
With the InfiniBand protocols, the primary factor that contributes to the delay is not the time it takes for the packet to traverse the physical network (which is typically less than a few microseconds), but the time it takes for the packet to get through the software stack. The overhead of software in sending and receiving messages can be a significant component of the total latency.
MPI-based CFD applications often require large scale, synchronized computing. If your application often needs to exchange small amounts of data between all nodes (such as boundary values or synchronization barriers), then minor latency improvements might accumulate and contribute to significant overall performance improvement.
However, if your CFD application is structured so that most of the time is spent in computation, and communication is relatively infrequent (typical for methods where each node can work on a discrete part of the problem for much of the time), then the improvement might be marginal.
Furthermore, if your CFD application is using collective communication, like MPI_Bcast or MPI_Reduce, or if it's using non-blocking sends/receives, then it may be intelligent enough to take advantage of the network topology to minimise communication delays.
So, it can potentially be an advantage of ring topology, but the magnitude of this advantage is highly dependent on the specifics of your CFD code. Thorough testing will be necessary to calculate the potential benefit in your own case.
-
Great! Here's more details. I perform CFD over infiniband network and wonder whether the ring topology is able to achiever better performance and save the cost of infiniband switch if there are 4 computer nodes in the network.
CFD, or Computational Fluid Dynamics, is a complex set of computations that often require high-speed, low-latency inter-node communication. This is particularly the case when using MPI (Message Passing Interface), which is a common parallel programming model used for scientific computation.
The InfiniBand network, being a high-speed, low-latency network, is well suited to this task. However, the topology can play a significant role in the performance.
For a small size of 4 nodes, the ring topology could potentially achieve somewhat similar performance to the star topology given that the path length between most nodes would be relatively small. An additional benefit would indeed be the cost savings from not purchasing an InfiniBand switch, which can be expensive.
However, there are a few considerations to keep in mind:
Performance: While a ring topology might be able to handle the communication needs, it might not always offer the best performance or scalability. A switch in a star network provides intelligent routing, which ensures that data packets find the most efficient path.
Failure Tolerance: If a node in the ring fails, it's harder to bypass it. In contrast, in a star network, because all nodes are connected to the switch, a single node failure won't bring down the communication of all other nodes.
Ease of Troubleshooting: Troubleshooting is also easier in a star network setup as you only need to check connectivity from each node to the switch. In a ring, you would need to check the connectivity between each pair of nodes.
In summary, while you might find that a ring topology could work in a 4-node InfiniBand network, you'd certainly want to perform some tests or simulations with your specific CFD workload to make sure the performance measures up to your expectations, and that you're comfortable with the trade-offs.
-
A further question. Is there any research that figured out the performance difference between the star-topology infiniband network with switch and the ring-topology infiniband network without switch for mpi applications.
There has definitely been a significant amount of research aimed at comparing different network topologies, including star and ring, for MPI applications. However, finding specific studies that compare star-topology InfiniBand network with switch and the ring-topology InfiniBand network without switch may be challenging because the actual comparison can depend largely on specific use cases, type of MPI applications, and other factors such as network size, traffic volume, and fault tolerance requirements.
That said, both topologies have their unique advantages.
Star Topology: The central switch can effectively manage and control network traffic, ensuring optimal data routing and reducing chances of network congestion. It also makes the network easier to debug and manage. However, it introduces a single point of failure i.e., if the central switch fails, the entire network may collapse.
Ring Topology: It might offer better performance for smaller networks due to shorter travel paths. However, the absence of a switch could create potential data routing and bottleneck issues, especially for larger networks or heavy network traffic. On the other hand, it does not have a single point of failure and offers better fault tolerance compared to star topology.
While it’s not precisely what you asked for, a good reference to start might be “Performance Analysis of InfiniBand FDR and EDR on Scientific User Cases” by Hakkarinen and Knuutila (2015). Although it does not compare star and ring topologies, it does provide a valuable analysis of InfiniBand performance for scientific applications.
For a detailed and specific comparison, you might need to conduct a custom research or benchmark testing based on your specific MPI applications and network environments.
-
What if I ran mpi program over a ring-topology infiniband network?
Running an MPI (Message Passing Interface) program over a ring-topology InfiniBand network could potentially offer some benefits and challenges as well.
Reduced Latency: InfiniBand is known for its low-latency and high-bandwidth capabilities. This would result in a faster communication between the nodes compared to Ethernet based networks.
Enhanced Performance: The high speed offered by InfiniBand could potentially enhance the performance of the MPI program, especially if it involves significant inter-node communication.
Scalability: The ring topology would allow easy scalability. As the nodes increase, the topology would remain unchanged. All nodes would be equally distant from each other without the need to add more switches or links.
Network Efficiency: The ring topology would ensure efficient data transmission. The data packets could travel in either direction on the ring to reach their destination, making the network more efficient.
Fault Tolerance: The ring topology provides a certain level of fault tolerance. If one node or connection fails, the data can travel in the opposite direction to reach its destination.
However, the main challenge would be potential bottleneck situations. While InfiniBand is designed to minimize latency, the ring topology could introduce a bottleneck situation if one node on the ring becomes overwhelmed with the data. This could potentially slow down the entire network, affecting the performance of the MPI program.
Another challenge could be configuring and managing the ring topology network. It might require significant expertise and effort to set up and optimally configure the network, especially if the nodes are physically distributed.
In conclusion, running an MPI program over a ring-topology Infiniband network could provide benefits in terms of speed, efficiency, and flexibility. However, it might also bring potential challenges related to network bottleneck and complexity of network configuration and management.
-
于教授太帅了,这是摆了擂台让我们打啊
-
@李东岳 网络带宽小于内存带宽的1/4可以出现这种情况,网络带宽制约非常明显的时候才能出现计算性能随节点数量线性scale-up的情况
-
@尚善若水 我也觉得海鲜市场不靠谱,期待你结果
-
@hurricane007 老铁,发个联系方式
-
@cccrrryyy 容器是个好东西,不过从实务角度讲,排队机制比竞价机制更适合hpc应用,这样,容器似乎仅剩部署方便这个优势了,无法发挥硬件池化和弹性计算的优势
-
@李东岳 我觉得网格分解技术还是可以深挖,而且目前连接面最少这种优化方向不一定适应目前的多socket架构,目前基于IB网络的架构,socket之间的连接方式非常多样,CPU内部总线直连(同一主板上的两块CPU)、pcie总线IB网络直连(网卡所在PCIe通道的CPU之间),经由另外一块CPU的pcie总线IB网络连接,经由另外两块CPU的pcie总线IB网络连接,这之间的延迟和带宽都不同,部分情况还存在争夺带宽的情况,如果使用mxc的双插槽网卡,sockets之间的流量有所平衡,这种情况一定程度缓解,但如果网格分解能做到hardware-aware,在软件层面考虑hpc架构的特征,性能还有很大提升空间,cfd网格分解有个关键特征,就是通讯流量存在空间聚类特征,也就是说一个区域只和相邻的区域发生数据交换,cpu interface只存在于相邻的几个进程,这种流量特征更适合环形拓扑的网络,理论上,配合恰当的网格分解算法,在很大的环网上都不会发生拥塞。
-
@leilei 听AI的,人类一思考,AI就发笑
-
@leilei 需要运行时输入参数的作业,显然得用 slurm 的 interactive job
-
u1s1, 动态链接是不侵权的,也不被GNU传染,OpenFOAM也用了 petsc、scotch 等 external solver
-
海鲜市场有好多ES版的9684X,不到1w块钱,我有主板,谁能赞助我两块?
-
@李东岳 刚刚更新了一下我的知识储备,Epyc Zen 4 128核的处理9754已经在6月13日发布了,这个时候能搞到货的都是真大佬。9654已经不是地表最强了,刚落地,就落后,还是做个垃圾佬比较快乐。
不过,6月13日发布的这批处理器,97?4,缓存只有256M,基本不用测试了,跑OpenFOAM也就9554的水平,倒是9?84X系列,都是高主频、大缓存,特别是9684X,三级缓存高达1152MB,非常值得期待,不知道比我手头的9654ES强多少。 -
@李东岳 我先测的2000w网格,结果不敢私藏:
测试了CFD-China的2000万网格算例2000个时间步的版本,并进行了性能优化,优化后的性能表现也令人满意。优化前,windows hyper-v 集成工具未启用,simpleFoam求解器运行耗时4119 s(Clock Time,下同),windows 任务管理器显示的CPU占用率约70%,说明未能充分发挥CPU性能,受到hyper-v资源分配的限制;启动hyper-v 集成工具后,宿主机(windows)和虚拟机(ubuntu)之间有了通讯,windows hyper-v可以更好地分配资源,这种条件下,simpleFoam求解器运行耗时4024 s,提高2.3%;进一步地,将384个逻辑内核全部分配给虚拟机,但计算时仍然保持192并行数,计算时长减少到3358 s,再提高16.5%。服务器在机房正式上架后,优化BIOS中的CPU设置,关闭超线程,虚拟机的核心数和物理机保持一致为192核,使用192核并行计算,耗时减小到3244 s,再提高3.4%。根据AMD 9004系列处理器架构设计白皮书,该处理器Zen 4架构多个核心共享L3 Cache,每个CPU有12个L3 Cache,因此MPI并行计算的瓶颈可能出现在多个进程争夺L3 Cache,因而,适当降低MPI并行数,并添加-map-by L3cache 选项,MPI并行数为180时,耗时 2824 s;并行数为168时,耗时2909 s。进行MPI运行优化后,最好成绩为2824 s,提高12.9%。
以上优化仅仅是调整了硬件和软件的使用方式,还未涉及开发过程中的优化,计算用时从4119 s,降低到2824 s,性能提高31.4%。进一步的编译器优化还在进行。
-
@李东岳 @CFDngu 水了这么多楼,这次发个正经的。
双路Epyc ES 100-0000000894-04(俗称9654ES,步进b0),内存DDR5 4800 16G×24,硬盘三星980Pro
OpenFOAM v2112 进行了非常多的fine tunning,但是算例文件没改过,无脑Allrun;宿主操作系统是Windows server 2022,虚拟机软件是hyper-v,客户机操作系统是Ubuntu 20.04。测试结果如下:
cores Wall time (s):
192 57.82
190 48.26
128 35.09
64 45.29
32 72.56
16 134.86
8 167.85
4 259.53
2 569.34
1 928.2峰值性能35 s,以后请叫我榜一大哥。
这个记录应该不难破,抄这个配置,裸金属直接安装Ubuntu 20.04就能破。 -
@李东岳 以太网总感觉差了一点味道
-
众所周知,IB网卡咸鱼上100G和10G价格大差不差,但是100G IB 交换机就不好搞了,咱也买不起一整个机柜,插不满交换机,想问下MCX516A或MCX566A这种双口100G网卡,三个节点组个环网,效果怎么样?理论讲单节点总共也有200G带宽了,不知道会不会成为性能瓶颈。既然都上IB卡了,也不能用7302这种700块钱的CPU了,上个7642,货源充足,咸鱼上挑够6块不成问题。手头的工作站加块4口万兆网卡,做管理节点。这样就得到一个4节点集群,顶多5万块钱。这套配置相比超算中心200G IB交换机相连的三个节点性能如何,相比100G交换机呢?有没有土豪赞助点儿经费?
-
@李东岳 epyc四代还没出256核的型号,顶配是9654,双路也就192个物理核心,QS版本也就发行价三折,这代看起来最合适OpenFOAM的9354,市面上QS版不多,正式版的价格比9654QS还贵。不得不说,DDR5内存真香,但主板还是太贵,主要是PCIe5.0的成本太高,OpenFOAM还用不到。
-
@CFDngu 新的花了多少钱?
-
衰减不怕啊,加个补偿系数
-
@李东岳 李老师啥时候推DDR5的加特林?
-
@CFDngu 买的是二手吗?9004系列24通道DDR5内存,差不多能用到128核
-
@leilei 上海超算魔方III OpenFOAM 并行效率测试
两个节点也还行,供参考
-
@DoubleTrump goodbountyhunter
-
看到这篇文章是用basilisk做的,搜到了了这个贴子,楼主有什么进展吗?软件好不好编译?我想follow
When does an impacting drop stop bouncing?
March 2023 Journal of Fluid Mechanics 958 -
@2019201300 无脑上7302啊,单路还是双路就看你预算了,捡垃圾配上ib网卡,还能做双节点
-
./Allrun > /dev/null 2>&1
200万网格并行算力测试(OpenFOAM版本)
wsl和虚拟机
window下的paraview如何读取拉格朗日场?
wsl和虚拟机
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
waves2Foam安装 2024年更新版本
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
水槽的速度变化感觉有错误
水槽的速度变化感觉有错误
pimpleFoam-松弛
openfoam如何减小大规模并行计算输出文件占据的存储
openfoam如何减小大规模并行计算输出文件占据的存储
openfoam如何减小大规模并行计算输出文件占据的存储
topoSet和refineMesh的使用
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
IB host chaining 三节点方案运行 openfoam 效果如何?
IB host chaining 三节点方案运行 openfoam 效果如何?
IB host chaining 三节点方案运行 openfoam 效果如何?
IB host chaining 三节点方案运行 openfoam 效果如何?
求问,有没有两相可压缩等熵流的验证算例?
HPC节点独显效率
200万网格并行算力测试(OpenFOAM版本)
能发个招聘贴吗?南洋理工招个博后
再请教2个集群问题
OpenFOAM metis 安装方法
OpenFOAM metis 安装方法
一个HPC上提交任务的问题
一个HPC上提交任务的问题
这壳套的太明目张胆了... ...
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
IB host chaining 三节点方案运行 openfoam 效果如何?
IB host chaining 三节点方案运行 openfoam 效果如何?
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
waves2foam孤立波造波衰减
200万网格并行算力测试(OpenFOAM版本)
200万网格并行算力测试(OpenFOAM版本)
HPC 节点反馈
有大神精通Basilisk的吗?
有大神精通Basilisk的吗?
Openfoam计算要选什么配置电脑比较好?
让Allrun不产生log.文件