![]() When you want to see the parallel and vectorized algorithms in action, Nvidias STL implementation Thrust may be an ideal candidate. To my knowledge, neither the Windows compiler nor the GCC compiler supports the parallel and vectorized execution of parallel STL algorithms. This behavior is according to the C++17 standard because the execution policies are only a hint for the compiler. When I ask for parallel and vectorized execution by using the execution policy std::execute::par_unseqI get the parallel execution policy ( std::execute::par). Therefore, I assume that the GCC compiler uses the same strategy as the Windows compiler. The performance numbers of the parallel and vectorized version and the parallel version are in the same ballpark. I have four cores, and the parallel execution is about four times faster than the sequential execution. Here is the explanation for the Visual C++ Team Blog: Using C++17 Parallel Algorithms for Better Performance: Note that the Visual C++ implementation implements the parallel and parallel unsequenced policies the same way, so you should not expect better performance for using par_unseq on our implementation, but implementations may exist that can use that additional freedom someday. ![]() The numbers for the parallel and the parallel and vectorized execution are in the same ballpark. My windows laptop has eight logical cores, but the parallel execution is more than ten times faster. ![]() My main focus is the relative performance of sequential and parallel execution. I’m keen to know if the parallel execution of the STL algorithms pays and to what extent. This means for Windows, the flag /O2 and on Linux, the flag -O3. ![]() I use maximum optimization on Windows and Linux. If you want the numbers for your system, you must repeat the test. These performance numbers should only give you a gut feeling. I don’t want to compare Windows and Linux because both computers run Windows, and Linux has different capabilities. But before I do that, I have to make a short disclaimer. Let me start with the windows performance numbers. If a lambda function wants to change its values, it has to be declared mutable. Lambda functions are, per default, constant. This is necessary because the lambda functions modify their argument workVec. There is one particular point about the three lambda functions ((6), (7), and (8)) used in this program. The function template getExecutionTime (4) gets the name of the execution policy, and the lambda function executes the lambda function (5) and shows the execution time. Template void getExecutionTime( const std ::string & title, Func func)įirst, the vector randValues is filled with 500 million numbers from the half-open interval [0, pi / 2 [. parallelSTLPerformance.cpp #include #include #include #include #include #include #include #include Ĭonstexpr long long size = 500 ' 000 ' 000 You have to link against the TBB using the flag -ltbb. When I installed the developer package of the TBB on my Linux desktop ( Suse), the package manager also chose the TBB memory allocator. To be precise, you need TBB 2018 version or higher. The TBB is a C++ template library developed by Intel for parallel programming on multi-core processors. The GCC uses the Intel Thread Building blocks (TBB) under the hood. You have to install a few additional libraries to use the parallel STL algorithms with the GCC. I present in this post the brand-new GCC 11.1, but a GCC 9 should also be acceptable. I recognized that GCC supports my favorite C++17 feature: the parallel algorithms of the Standard Template Library. The reason for the short detour about my template post is the following. Today, I made a performance test using the Microsoft and GCC compiler to answer the simple question: Does the execution policy pay off? In my last post, “ Parallel Algorithms of the STL with the GCC Compiler“, I presented the necessary theory about the C++17 algorithm.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |