我需要解决一个大问题,在一个大的图形示例上,为了做到这一点,我在线程之间划分输入空间,以解决每组输入上的indipendenlty相同的函数。当我了解我的软件的可扩展性时,我注意到,当我增加使用的线程数量时,在4个线程之后,时间会增加。我编写了一个非常小的例子来看看为什么会发生这种情况,如下所示:
#include <algorithm>
#include <random>
#include <thread>
#include <iostream>
#include <chrono>
template<typename T>
inline double getMs(T start, T end) {
return double(
std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
.count()) /
1000;
}
int main(int) {
std::random_device rd;
std::mt19937 g(rd());
unsigned int n = std::thread::hardware_concurrency();
std::cout << n << " concurrent threads are supported.\n";
for (size_t np = 2; np < 17; np++) {
auto start = std::chrono::high_resolution_clock::now();
std::cout << np << " threads: ";
std::vector<std::thread> threads(np);
int number_stops = 50; // memory 39420
int number_transfers = 1; // memory
int number_structures = 1; // memory
int number_iterations = 1000000; // time
auto dimension = number_stops * (number_transfers + 1) * number_structures;
auto paraTask = [&]() {
for (int b = 0; b < number_iterations; b++) {
//std::srand(unsigned(std::time(nullptr)));
std::vector<int> v(dimension, 1586)
//std::generate(v.begin(), v.end(), std::rand);
v.clear();
}
};
for (size_t i = 0; i < np; i++) {
threads[i] =
std::thread(paraTask);
}
// Join the threads
for (auto&& thread : threads) thread.join();
double elapsed = getMs(start, std::chrono::high_resolution_clock::now());
printf("parallel completed: %.3f sec.\n",
elapsed);
}
return 0;
}
只是一个简短的描述。为了模拟我正在工作的实际软件,我在这里使用变量:
int number_stops = 50; // memory 39420
int number_transfers = 1; // memory
int number_structures = 1; // memory
int number_iterations = 1000000; // time
不需要太多的细节,前三个是用来模拟内存消耗的(每次调用填充多少向量条目),而第四个是用来模拟迭代次数的。这是为了看看是什么导致了时间的增加,如果是我们添加线程时的内存消耗,或者如果我们在每个线程中有更多的计算时间(或两者兼而有之)。
我在这里复制上面设置的结果:
16 concurrent threads are supported.
2 threads: parallel completed: 0.995 sec.
3 threads: parallel completed: 1.017 sec.
4 threads: parallel completed: 1.028 sec.
5 threads: parallel completed: 1.081 sec.
6 threads: parallel completed: 1.131 sec.
7 threads: parallel completed: 1.122 sec.
8 threads: parallel completed: 1.216 sec.
9 threads: parallel completed: 1.445 sec.
10 threads: parallel completed: 1.603 sec.
11 threads: parallel completed: 1.596 sec.
12 threads: parallel completed: 1.626 sec.
13 threads: parallel completed: 1.634 sec.
14 threads: parallel completed: 1.611 sec.
15 threads: parallel completed: 1.648 sec.
16 threads: parallel completed: 1.688 sec.
所以,正如你所看到的,时间增加了。为什么呢?我也尝试了另一种方法(更少的迭代,但更多的内存):
int number_stops = 50; // memory 39420
int number_transfers = 100; // memory
int number_structures = 100; // memory
int number_iterations = 50; // time
同样的情况发生,时间增加:
16 concurrent threads are supported.
2 threads: parallel completed: 0.275 sec.
3 threads: parallel completed: 0.267 sec.
4 threads: parallel completed: 0.278 sec.
5 threads: parallel completed: 0.282 sec.
6 threads: parallel completed: 0.303 sec.
7 threads: parallel completed: 0.314 sec.
8 threads: parallel completed: 0.345 sec.
9 threads: parallel completed: 0.370 sec.
10 threads: parallel completed: 0.368 sec.
11 threads: parallel completed: 0.395 sec.
12 threads: parallel completed: 0.407 sec.
13 threads: parallel completed: 0.431 sec.
14 threads: parallel completed: 0.444 sec.
15 threads: parallel completed: 0.448 sec.
16 threads: parallel completed: 0.455 sec.
为了给予更多的上下文,这里是我的计算机的规格:
- CPU -第11代Intel(R)Core(TM)i7- 11700KF@3.60GHz
- 内存- 16 GB DDR4
- Windows 11编译器- MS_VS 2022
此外,这里是来自CPU-Z的硬件报告
我的CPU有8个物理核心和16个逻辑核心。
2条答案
按热度按时间vnjpjtjt1#
使用线程并不是提高性能的金锤。如果你的程序设计得不好,增加线程会导致程序的瘫痪。
你的线程不做太多的事情。只是分配和释放内存。堆是共享状态,是同步的。这意味着除了产生线程开销外,你还会有同步损失。
所以你有来自线程的多个开销,唯一并行的并且不被其他线程阻碍的事情(初始化为默认值)是快速查看来自多个线程的任何收益。与那些开销相比,它是快速的。
a5g8bdjr2#
逻辑CPU(16)就像你说的,只是逻辑上的。物理CPU数量是计算机器、分支处理器等的数量。* 更大 * 数量的 * 逻辑 * CPU只是使多核CPU能够更有效地使用那些“子引擎”。
所以你应该想知道的是,时间从1个线程增加到8个线程,因为所有线程都在做同样的事情,所以(大部分时间)使用相同的8个物理核心部分,只是将等待时间减少了一点。
下一步:你的线程的工作主要是与内存相关的,而内存的I/O通道是一个非常有限的资源,不能使用8次。
为了得到你想要的结果,创建一个工作循环,例如只使用整数计算,内存非常低(例如,总是对相同的整数求和)并丢弃所有结果以进一步避免内存通道并发。那么您的线程应该几乎完全并行运行,提供几乎稳定的运行时,但对于如果CPU需要一个(最小)时间,那么当概率不好并且其他一些进程也需要CPU时,CPU需要上下文切换-您的程序并不孤单,操作系统也应该轮到它:-)并且您占用的核心越多,与系统共享计算资源的概率就越高。