Fundamentals of Parallelism on Intel Architecture
Cluster computers (distributed memory)
Multithreading (in shared memory) - each computer having multiple cores.
Vectorization (of floating point math) - each core being able to multiple data in a single instruction (SIMD).
Shorter time to solution.
Being able to process lot more data.
Power efficiency.
Cost efficiency.
The number of transistors is increasing but the clock frequency and the power have stalled because to increase the clock speed you'd need higher power and after a certain point there are no cooling solutions available, we hit a "power wall".
So we cant increase clock speed to make our application smarter, but we have a lot of transistors with us so we'll want to make our cores smarter. (Pipelining allows only so many stages, superscalar would need extra resources and out-of-order isnt scalable.)
Future proof, optimized, follows standards.
Memory (caches, heirerachy optimizations)
Threading (multiple cores)
Scalar tuning (pipelining)
Vectorizing
You can either vectorize code manually or automatically. "Intrinsic" functions allow manual vectorization but its not future proof.
Compliers can automatic-vectorize, where they transform your scalar code into vector instructions by looking for loops.
The number of times the loop's gonna happen should be known (for loops..).
No vector dependence.
Applies it to innermost loops by default.
If you're calling functions in the loop, they must be SIMD enabled.
Pragma omp simd enforces vect?orization of loops.
Which is cases where:
You wanna override the innermost loop thing.
Where compiler decisions fail.
Where you wanna guide the vector length, vector reduction (this is when you wanna compute lots of things and then do sum+=allofthem) etc.
Loops with SIMD enabled functions.
For conditions like if in a function you wanna vectorize, it does the operation first then uses a mask to evaluate the condition.
The mask just means that suppose your output is 11001000 and the condition says just give me every third bit then it will simply and 11001000 with 00100100.
Functions with the pragma omp declared simd on top!!
The compiler produces three versions of the function:
Scalar
Vector
Masked
So you can use this func in a loop in any of the three ways.
Safe vs unsafe to vectorize.
If youre going i++ and there's a dependence on the prev element - not safe.
If you're going i++ and theres a dependence on [i+1] - safe, cause there's no need to depend the computation of something prior.
If your performing vectors (as pointers) are both accessing same mem spaces (aliased) - not safe.
Strip mining makes it easy for the compiler to recognize loops that can be vectorized.
You split it into two nested loops, one looping from 0 to a prime number (n-n%STRIP) and in strides ii+=STRIP.
The inner loop's going from the outer var ii to a lentght of STRIP, filling in for the outer loop
const int STRIP = something
const int Primeno = n - n%STRIP
for(ii=0; ii<Primeno;ii+=STRIP){
for(i=ii; i<ii+STRIP; i++){
do stuff } }
for(i=Primeno; i<n; i++)
do stuff
Then in the end you'll write another loop to do a little of whatever's left (cause you computed that Primeno for the easy-vectory thing and now if there's a little more of n left..).
Strip mining is often used to allow vectorization to coexist with threads.
Multi processing (independent stream of instructions) (own memory address space).
Multi threading (independent stream of instructions) (shared memory address space).
include pthread.h, funcs take in void* threadID
Variable of type pthread_t thr
func call pthread_create(&thr, NULL, func, NULL);
For clean exit - pthread_exit(NULL)
Code will not compile without linkline arg -lpthread
If one thread modifies memory the other thread sees it.
int pid, status;
pid = fork();
After fork we end up with two copies of the process pid==0 (parent) and pid!=0 (child).
if(pid!=0) start child process else ...
wait(&stat) for clean exit
Memories not shared so changes dont affect ..
open multi processing
#include omp.h
Initial thread: const int n = omp_get_max_threads()
#pragma omp parallel {
This code in parallel, omp_get_thread_num()
}
Compiler arg -qopenmp
Setting no of threads: export OMPNUMTHREADS=5 (default is the number of logical CPUs in your computer)
Threads are streams of instructions sharing memory.
Method 1 - C, C++, fortran
int A, B;
#pragma omp parallel private(A) shared(B)
{
threads have their own 'copies' of A, share B
}
Method 2 - C, C++
int B;
#pragma omp parallel
{
int A;
}
#pragma omp parallel for
- Iterations distributed across available threads.
#pragma omp parallel
{
int j;
#pragma omp for
}
#pragma omp parallel schedule(<mode>)
Scheduling mode is the algorithm with which iterations are assigned to threads.
Static - decided threads get decided load, unbalanced load.
Dynamic - whichever thread finishes first gets, balanced load, high scheduling overhead.
Guided - dynamically hands out larger chunks first then chunks get smaller.
Two or more threads access the same mem address and atleast one of them's trying to write.
Mutual exclusion condition - protects data races by serializing, we dont want these in the innermost loop(slows it down(?)).
Critical:
#pragma omp parallel
{
#pragma omp critical
{
large pieces of code protected
}
}
Atomic:
#pragma omp parallel
{
#pragma omp atomic
//one line
sum += i;
}
#pragma omp parallel for reduction(+: total)
Reduction BTS:
int total = 0;
#pragma omp parallel
{
int total_thr = 0;
#pragma omp for
for(int i = 0; i < n; i++)
total_thr += i;
#pragma omp atomic
total += total_thr;
}
Data reuse in caches.
Vector arithmetic is cheap, memory access is expensive.
More than 50 - compute bound application.
Less than 50 - bandwidth bound application.
Knights landing - KNL:
MCDram on package
DDR4 on platform
Core <> regs <> L1 <> L2 <> MCDRAM(flat cache or hybrid) <> DDR4
Xeon:
Core <> regs <> L1 <> L2 <> LLC (last level cache)(shared bw cores) <> DDR4 ram (main mem).
QPI - quick path interconnect (NUMA - non uniform mem access) (transparent to the programmer).
Coprocessor
Core <> regs <> L1 <> L2 <> GDDR5 (not shared with host system).
The coprocessor connects to the host via?
PCIe, not transparant to the programmer.
Flat mode
MCDRAM as a seperate numa node.
You control what goes into the MCDRAM.
Cache mode
MCDRAM used as LLC.
Its used automatically.
Hybrid mode
Half as numa node and half as cache
Does it fit in 16 GB?
If yes, numactl lets you run it on the MCDRAM directly.
No code modification required.
Partition it as <=16GB?
Manually allocate BW critical mem to MCDRAM.
Memkind calls need to be added to the code
<=16GB but cache mode?
The chip figures out how to use the MCDRAM.
No code modification required.
numactl -H
numactl --membind
#include <hbwmalloc.h>
const int n = 1<<10;
double *A = hbw_malloc(sizeof(double)*n);
Aligned allocator
double *B;
int ret = hbw_posix_memalign((void**) &B, 64, sizeof(double)*n);
Free
hbw_free(A);
hbw_free(B);
Compilation
g++/icpc -lmemkind progname.cc -o runme
numactl -m 1 applicationname
For when you dont want to pollute the cache with unreused data.
#pragma vector nontemporal
Set -qopt-streaming-stores = always
Interactions with the cache are not transparent to the programmer.
You get the whole cacheline, the minimum you can access at a time is predefined, in intel architecture its 64bytes.
The parallelism built into the memory subsystem.
It means close memory elements are used as much as possible instead of random access. Its good to maintain unit-stride memory access.
Array of structures is wasteful, structure of arrays is better for unitstride.
Array {x, y, z, x, y, z} you want only x's VS
Structure st.x[0], st.x[1]....
Loop tiling - How do you optimize these loops for the reuse of b[j] calls?
for(int i=0; i<m; i++)
for(int j=0; j<n; j++)
compute(a[i], b[j]);
Step 1: Strip mining the inner loop.
for(int i=0; i<m; i++)
for(int jj=0; jj<n; jj+=TILE)
for(int j=jj; j<jj+TILE; j++)
compute(a[i], b[j]);
Step 2: Permute so that you reuse j=jj sooner.
for(int jj=0; jj<n; jj+=TILE)
for(int i=0; i<m; i++)
for(int j=jj; j<jj+TILE; j++)
compute(a[i], b[j]);
Or loop tile the outer loop - unroll and jam aka register blocking.
MPI (message passing interface) is a framework for distributed memory programming. It allows you to run your application on multiple processors in a cluster.
MPI connections for all cores isnt practical these days given how many cores there are in modern computers (for eg 72 cores supporting upto 4 hyper threading)
So instead, use MPI to scale across compute nodes and OpenMP to scale across cores.
Good implementations of MPI are aware of OpenMP and work with it
#include "mpi.h"
First command: MPI_Init (&argc, &argv) /*where argc and argv are inputs to main: main(int argc, char* argv[])*/
Exit with: MPI_Finalize();
Two processes participate in communication. One is a sender and the other is a receiver. All other processes do not have to wait for this communication to finish.
MPI_Send(&outMsg, msgLen, MPI_CHAR, receiver, tag, MPI_COMM_WORLD); here char outMsg[msgLen]; and "receiver" is a rank no
MPI_Recv(&inMsg, msgLen, MPI_CHAR, sender, tag, MPI_COMM_WORLD, &stat); //here char inMsg[msgLen]; and "sender" is a rank no