Exploitation of the concept of data parallelism started in 1960s with the development of the Solomon machine.1 The Solomon machine, also called a vector processor, was developed to expedite the performance of mathematical operations by working on a large data array (operating on multiple data in consecutive time steps). Concurrency of data operations was also exploited by operating on multiple data at the same time using a single instruction. These processors were called 'array processors'.2 In the 1980s, the term was introduced 3 to describe this programming style, which was widely used to program Connection Machines in data parallel languages like C*. Today, data parallelism is best exemplified in graphics processing units (GPUs), which use both the techniques of operating on multiple data in space and time using a single instruction.
Most data parallel hardware supports only a fixed number of parallel levels, often only one. This means that within a parallel operation it is not possible to launch more parallel operations recursively, and means that programmers cannot make use of nested hardware parallelism. The programming language NESL was an early effort at implementing a nested data-parallel programming model on flat parallel machines, and in particular introduced the flattening transformation that transforms nested data parallelism to flat data parallelism. This work was continued by other languages such as Data Parallel Haskell and Futhark, although arbitrary nested data parallelism is not widely available in current data-parallel programming languages.
In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different distributed data. In some situations, a single execution thread controls operations on all the data. In others, different threads control the operation, but they execute the same code.
For instance, consider matrix multiplication and addition in a sequential manner as discussed in the example.
Below is the sequential pseudo-code for multiplication and addition of two matrices where the result is stored in the matrix C. The pseudo-code for multiplication calculates the dot product of two matrices A, B and stores the result into the output matrix C.
If the following programs were executed sequentially, the time taken to calculate the result would be of the O ( n 3 ) {\displaystyle O(n^{3})} (assuming row lengths and column lengths of both matrices are n) and O ( n ) {\displaystyle O(n)} for multiplication and addition respectively.
We can exploit data parallelism in the preceding code to execute it faster as the arithmetic is loop independent. Parallelization of the matrix multiplication code is achieved by using OpenMP. An OpenMP directive, "omp parallel for" instructs the compiler to execute the code in the for loop in parallel. For multiplication, we can divide matrix A and B into blocks along rows and columns respectively. This allows us to calculate every element in matrix C individually thereby making the task parallel. For example: A[m x n] dot B [n x k] can be finished in O ( n ) {\displaystyle O(n)} instead of O ( m ∗ n ∗ k ) {\displaystyle O(m*n*k)} when executed in parallel using m*k processors.
It can be observed from the example that a lot of processors will be required as the matrix sizes keep on increasing. Keeping the execution time low is the priority but as the matrix size increases, we are faced with other constraints like complexity of such a system and its associated costs. Therefore, constraining the number of processors in the system, we can still apply the same principle and divide the data into bigger chunks to calculate the product of two matrices.4
For addition of arrays in a data parallel implementation, let's assume a more modest system with two central processing units (CPU) A and B, CPU A could add all elements from the top half of the arrays, while CPU B could add all elements from the bottom half of the arrays. Since the two processors work in parallel, the job of performing array addition would take one half the time of performing the same operation in serial using one CPU alone.
The program expressed in pseudocode below—which applies some arbitrary operation, foo, on every element in the array d—illustrates data parallelism:5
In an SPMD system executed on 2 processor system, both CPUs will execute the code.
Data parallelism emphasizes the distributed (parallel) nature of the data, as opposed to the processing (task parallelism). Most real programs fall somewhere on a continuum between task parallelism and data parallelism.
The process of parallelizing a sequential program can be broken down into four discrete steps.6
7
Data and task parallelism, can be simultaneously implemented by combining them together for the same application. This is called Mixed data and task parallelism. Mixed parallelism requires sophisticated scheduling algorithms and software support. It is the best kind of parallelism when communication is slow and number of processors is large.8
Mixed data and task parallelism has many applications. It is particularly used in the following applications:
A variety of data parallel programming environments are available today, most widely used of which are:
Data parallelism finds its applications in a variety of fields ranging from physics, chemistry, biology, material sciences to signal processing. Sciences imply data parallelism for simulating models like molecular dynamics,10 sequence analysis of genome data 11 and other physical phenomenon. Driving forces in signal processing for data parallelism are video encoding, image and graphics processing, wireless communications 12 to name a few.
This section is an excerpt from Data-intensive computing.[edit]
"The Solomon Computer". https://www.computer.org/csdl/pds/api/csdl/proceedings/download-article/50610097/pdf?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJjc2RsX2FwaSIsImF1ZCI6ImNzZGxfYXBpX2Rvd25sb2FkX3Rva2VuIiwic3ViIjoiYW5vbnltb3VzQGNvbXB1dGVyLm9yZyIsImVtYWlsIjoiYW5vbnltb3VzQGNvbXB1dGVyLm9yZyIsImV4cCI6MTU2NTEwNTAxMX0.AD74lJbBAdGWNvVIpeeTmyF1S7hb4_rUDeSeoDoJ0R4 ↩
"SIMD/Vector/GPU" (PDF). Retrieved 2016-09-07. https://www.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php%3Fmedia%3Dseth-740-fall13-module5.1-simd-vector-gpu.pdf ↩
Hillis, W. Daniel and Steele, Guy L., Data Parallel Algorithms Communications of the ACMDecember 1986 /wiki/Daniel_Hillis ↩
Barney, Blaise. "Introduction to Parallel Computing". computing.llnl.gov. Archived from the original on 2013-06-10. Retrieved 2016-09-07. https://web.archive.org/web/20130610122229/https://computing.llnl.gov/tutorials/parallel_comp/ ↩
Some input data (e.g. when d.length evaluates to 1 and round rounds towards zero [this is just an example, there are no requirements on what type of rounding is used]) will lead to lower_limit being greater than upper_limit, it's assumed that the loop will exit immediately (i.e. zero iterations will occur) when this happens. ↩
Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN 978-1-4822-1118-4. 978-1-4822-1118-4 ↩
"How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism". Tim Dettmers. 2014-11-09. Retrieved 2016-09-13. http://timdettmers.com/2014/11/09/model-parallelism-deep-learning/ ↩
"The Netlib" (PDF). http://www.netlib.org/lapack/lawnspdf/lawn97.pdf ↩
"OpenMP.org". openmp.org. Archived from the original on 2016-09-05. Retrieved 2016-09-07. https://web.archive.org/web/20160905232633/http://openmp.org/wp/ ↩
Boyer, L. L; Pawley, G. S (1988-10-01). "Molecular dynamics of clusters of particles interacting with pairwise forces using a massively parallel computer". Journal of Computational Physics. 78 (2): 405–423. Bibcode:1988JCoPh..78..405B. doi:10.1016/0021-9991(88)90057-5. /wiki/Bibcode_(identifier) ↩
Yap, T.K.; Frieder, O.; Martino, R.L. (1998). "Parallel computation in biological sequence analysis". IEEE Transactions on Parallel and Distributed Systems. 9 (3): 283–294. CiteSeerX 10.1.1.30.2819. doi:10.1109/71.674320. /wiki/CiteSeerX_(identifier) ↩
Singh, H.; Lee, Ming-Hau; Lu, Guangming; Kurdahi, F.J.; Bagherzadeh, N.; Filho, E.M. Chaves (2000-06-01). "MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications". IEEE Transactions on Computers. 49 (5): 465–481. doi:10.1109/12.859540. ISSN 0018-9340. https://www.researchgate.net/publication/3044209 ↩
Handbook of Cloud Computing, "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010. http://www.cse.fau.edu/~borko/HandbookofCloudComputing.html ↩