Parallel C++

C++

Parallel Computing

Published

December 28, 2025

In the post, I will explore various methods to achieve parallelism in C++. I will keep updating this post to include more methods as I learn about them.

Introduction

Let’s start with a very simple example that adds all the numbers in a vector. I will use the C++’s standard template library (STL), and start with a serial implementation. I will use the time command to measure the execution time of the program.

I create a vector of ones with \(2^{30}=1,073,741,824\) (approximately a billion) elements. For context, each integer takes 4 bytes, so this vector will take approximately 4GB of memory. I add them, serially, as follows:

%%writefile source/1_serial_sum.cpp
#include <vector>
#include <numeric>

int main() {
    std::vector<int> my_vector (1<<30, 1);
    int sum = std::reduce(my_vector.begin(), my_vector.end(), 0);
    return 0;
}

Writing source/1_serial_sum.cpp

Let’s break down the code:

Line 2-3: I include the necessary headers. <vector> is for using the vector container and <numeric> is for the std::reduce function.
Line 6: I define a vector of size \(2^{30}\) filled with ones.
Line 7: I use std::reduce to sum the elements of the vector. I pass to the function the beginning and end iterators of the vector, along with an init value of 0.

I compile and run the program below, measuring the execution time with the time command:

%%sh
g++ source/1_serial_sum.cpp -o build/1_serial_sum
time -p ./build/1_serial_sum

real 8.36
user 7.14
sys 1.93

The real time is the actual elapsed time, while the user time is the CPU time spent in user mode and system time is the CPU time spent in kernel mode ¹.

¹ For more details, see https://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1

Next, I will introduce parallelism by passing an execution policy to the std::reduce function, as follows:

%%writefile source/1_parallel_sum.cpp
#include <vector>
#include <numeric>
#include <execution>

int main() {
    std::vector<int> my_vector (1<<30, 1);
    int sum = std::reduce(std::execution::par, my_vector.begin(), my_vector.end(), 0);
    return 0;
}

Writing source/1_parallel_sum.cpp

I have changed two lines in the code:

Line 5: I added the <execution> header.
Line 8: I added the std::execution::par policy to the std::reduce function.

Note that <execution> requires a C++17 compliant compiler, and in GCC, I need to link with the -ltbb flag to use Intel’s Threading Building Blocks (TBB) for parallelism. For reference, I check which compilers support parallel algorithms and execution policies on the webpage Compiler support for C++17. Here is line relating to <execution> on GCC:

In the Linux environment, I install TBB using the package manager, with the following command:

sudo apt-get install libtbb-dev

Now, I compile and run the parallel version of the program:

%%sh
g++ source/1_parallel_sum.cpp -o build/1_parallel_sum -ltbb
time -p ./build/1_parallel_sum

real 2.32
user 12.27
sys 1.27

The execution time is significantly reduced compared to the serial version, demonstrating the benefits of parallelism for large datasets. Note that parallelism has overhead costs, so for small datasets, the serial version may perform better. It also has many pitfalls, such as race conditions, which can lead to incorrect results if not handled properly. I will explore more about these issues in future updates.