Fundamentals of Parallelism on Intel Architecture

Questions

Fundamentals of parallelism
Vectorization
Multi-processing and multi-threading
Memory organization and caches
Distributed Memory Programming

Fundamentals of parallelism

What are three layers of parallelism?

Cluster computers (distributed memory)
Multithreading (in shared memory) - each computer having multiple cores.
Vectorization (of floating point math) - each core being able to multiple data in a single instruction (SIMD).

What are the 4 different things increased performance could mean?

Shorter time to solution.
Being able to process lot more data.
Power efficiency.
Cost efficiency.

What's some usual "speed increase" gotos, and why arent they good enough?

The number of transistors is increasing but the clock frequency and the power have stalled because to increase the clock speed you'd need higher power and after a certain point there are no cooling solutions available, we hit a "power wall".
So we cant increase clock speed to make our application smarter, but we have a lot of transistors with us so we'll want to make our cores smarter. (Pipelining allows only so many stages, superscalar would need extra resources and out-of-order isnt scalable.)

Whats some features of "modern code"?

Future proof, optimized, follows standards.

What are the different areas optimization can be done?

Memory (caches, heirerachy optimizations)
Threading (multiple cores)
Scalar tuning (pipelining)
Vectorizing

Vectorization

2 types of vectorizing your code?

You can either vectorize code manually or automatically. "Intrinsic" functions allow manual vectorization but its not future proof.
Compliers can automatic-vectorize, where they transform your scalar code into vector instructions by looking for loops.

Limitations of automatic vectorization

The number of times the loop's gonna happen should be known (for loops..).
No vector dependence.
Applies it to innermost loops by default.
If you're calling functions in the loop, they must be SIMD enabled.

What does pragma omp simd do?

Pragma omp simd enforces vect?orization of loops.
Which is cases where:
- You wanna override the innermost loop thing.
- Where compiler decisions fail.
- Where you wanna guide the vector length, vector reduction (this is when you wanna compute lots of things and then do sum+=allofthem) etc.
- Loops with SIMD enabled functions.

What is the "masked" version of the function used for?

For conditions like if in a function you wanna vectorize, it does the operation first then uses a mask to evaluate the condition.
The mask just means that suppose your output is 11001000 and the condition says just give me every third bit then it will simply and 11001000 with 00100100.

What are simd enabled functions?

Functions with the pragma omp declared simd on top!!
The compiler produces three versions of the function:
- Scalar
- Vector
- Masked
So you can use this func in a loop in any of the three ways.

What is vector dependence and safety vectorizing?

Safe vs unsafe to vectorize.
If youre going i++ and there's a dependence on the prev element - not safe.
If you're going i++ and theres a dependence on [i+1] - safe, cause there's no need to depend the computation of something prior.
If your performing vectors (as pointers) are both accessing same mem spaces (aliased) - not safe.

What is strip mining?

Strip mining makes it easy for the compiler to recognize loops that can be vectorized.
You split it into two nested loops, one looping from 0 to a prime number (n-n%STRIP) and in strides ii+=STRIP.
The inner loop's going from the outer var ii to a lentght of STRIP, filling in for the outer loop

const int STRIP = something
const int Primeno = n - n%STRIP
for(ii=0; ii<Primeno;ii+=STRIP){
    for(i=ii; i<ii+STRIP; i++){
        do stuff } }
for(i=Primeno; i<n; i++)
     do stuff

Then in the end you'll write another loop to do a little of whatever's left (cause you computed that Primeno for the easy-vectory thing and now if there's a little more of n left..).
Strip mining is often used to allow vectorization to coexist with threads.

Multi-processing and multi-threading

Two ways to run multiple streams of instructions

Multi processing (independent stream of instructions) (own memory address space).
Multi threading (independent stream of instructions) (shared memory address space).

What's the syntax for including threads?

include pthread.h, funcs take in void* threadID
Variable of type pthread_t thr
func call pthread_create(&thr, NULL, func, NULL);
For clean exit - pthread_exit(NULL)
Code will not compile without linkline arg -lpthread
If one thread modifies memory the other thread sees it.

What's the syntax for including processes?

int pid, status;
pid = fork();
After fork we end up with two copies of the process pid==0 (parent) and pid!=0 (child).
if(pid!=0) start child process else ...
wait(&stat) for clean exit
Memories not shared so changes dont affect ..

Syntax - OPENMP

open multi processing
#include omp.h
Initial thread: const int n = omp_get_max_threads()

#pragma omp parallel {
        This code in parallel, omp_get_thread_num()
    }

Compiler arg -qopenmp
Setting no of threads: export OMPNUMTHREADS=5 (default is the number of logical CPUs in your computer)
Threads are streams of instructions sharing memory.

What are the two methods for variable sharing in parallel processes?

Method 1 - C, C++, fortran

int A, B;
  #pragma omp parallel private(A) shared(B)
  {
    threads have their own 'copies' of A, share B
  }

Method 2 - C, C++

int B;
  #pragma omp parallel
  {
    int A;
  }

What's the syntax of FOR loops in OpenMP?

#pragma omp parallel for - Iterations distributed across available threads.

#pragma omp parallel
 { 
  int j;
  #pragma omp for
 }

#pragma omp parallel schedule(<mode>)
Scheduling mode is the algorithm with which iterations are assigned to threads.
- Static - decided threads get decided load, unbalanced load.
- Dynamic - whichever thread finishes first gets, balanced load, high scheduling overhead.
- Guided - dynamically hands out larger chunks first then chunks get smaller.

What are race conditions?

Two or more threads access the same mem address and atleast one of them's trying to write.

What are mutexes?

Mutual exclusion condition - protects data races by serializing, we dont want these in the innermost loop(slows it down(?)).
Critical:

#pragma omp parallel 
  {
    #pragma omp critical 
    {
        large pieces of code protected
    }
  }

Atomic:

#pragma omp parallel 
  {
    #pragma omp atomic
    //one line
    sum += i;
  }

What is reduction? (Sum+=i type cases)

#pragma omp parallel for reduction(+: total)
Reduction BTS:

int total = 0;
  #pragma omp parallel 
  {
    int total_thr = 0;
    #pragma omp for
    for(int i = 0; i < n; i++)
        total_thr += i;
    #pragma omp atomic
        total += total_thr; 	
 }

Memory organization and caches

What else should you worry about if you want data vectorization to matter? And why?

Data reuse in caches.
Vector arithmetic is cheap, memory access is expensive.

Whats the difference bw the flops/memoryaccess ratio ><50?

More than 50 - compute bound application.
Less than 50 - bandwidth bound application.

Memory hierarchy in KNL, Xeon, and KNC

Knights landing - KNL:
- MCDram on package
- DDR4 on platform
- Core <> regs <> L1 <> L2 <> MCDRAM(flat cache or hybrid) <> DDR4
Xeon:
- Core <> regs <> L1 <> L2 <> LLC (last level cache)(shared bw cores) <> DDR4 ram (main mem).
- QPI - quick path interconnect (NUMA - non uniform mem access) (transparent to the programmer).
Coprocessor
- Core <> regs <> L1 <> L2 <> GDDR5 (not shared with host system).
- The coprocessor connects to the host via?
  - PCIe, not transparant to the programmer.

What are the high bandwidth memory modes (for MCDRAM)?

Flat mode
- MCDRAM as a seperate numa node.
- You control what goes into the MCDRAM.
Cache mode
- MCDRAM used as LLC.
- Its used automatically.
Hybrid mode
- Half as numa node and half as cache

Using "high bandwidth memory" bandwidth bound applications?

Does it fit in 16 GB?
- If yes, numactl lets you run it on the MCDRAM directly.
- No code modification required.
Partition it as <=16GB?
- Manually allocate BW critical mem to MCDRAM.
- Memkind calls need to be added to the code
<=16GB but cache mode?
- The chip figures out how to use the MCDRAM.
- No code modification required.

What's the syntax for finding which mode your MCDRAM is in

numactl -H
numactl --membind

What's the syntax for the memkind library?

#include <hbwmalloc.h>
const int n = 1<<10;
double *A = hbw_malloc(sizeof(double)*n);

Aligned allocator

double *B;
int ret = hbw_posix_memalign((void**) &B, 64, sizeof(double)*n);

Free
- hbw_free(A);
- hbw_free(B);
Compilation
- g++/icpc -lmemkind progname.cc -o runme

Syntax for using numactl?

numactl -m 1 applicationname

When are streaming stores used?

For when you dont want to pollute the cache with unreused data.
#pragma vector nontemporal
Set -qopt-streaming-stores = always

Locality in space

Is the cache programmable?

Interactions with the cache are not transparent to the programmer.

Even when you want only one floating point number from a memory/cache, how does the cache-line look like?

You get the whole cacheline, the minimum you can access at a time is predefined, in intel architecture its 64bytes.

What parallelism do we need to take advantage of?

The parallelism built into the memory subsystem.

What is unit stride memory access?

It means close memory elements are used as much as possible instead of random access. Its good to maintain unit-stride memory access.

Array of structs vs struct of arrays

Array of structures is wasteful, structure of arrays is better for unitstride.
Array {x, y, z, x, y, z} you want only x's VS
Structure st.x[0], st.x[1]....

Locality in time

Loop tiling - How do you optimize these loops for the reuse of b[j] calls?

for(int i=0; i<m; i++)
        for(int j=0; j<n; j++)
            compute(a[i], b[j]);

Step 1: Strip mining the inner loop.

for(int i=0; i<m; i++)
            for(int jj=0; jj<n; jj+=TILE)
                    for(int j=jj; j<jj+TILE; j++)
                        compute(a[i], b[j]);

Step 2: Permute so that you reuse j=jj sooner.

for(int jj=0; jj<n; jj+=TILE)	
        for(int i=0; i<m; i++)		
            for(int j=jj; j<jj+TILE; j++)
                compute(a[i], b[j]);

Or loop tile the outer loop - unroll and jam aka register blocking.

Distributed Memory Programming

What's MPI?

MPI (message passing interface) is a framework for distributed memory programming. It allows you to run your application on multiple processors in a cluster.

Why hybrid MPI + OPENMP?

MPI connections for all cores isnt practical these days given how many cores there are in modern computers (for eg 72 cores supporting upto 4 hyper threading)
So instead, use MPI to scale across compute nodes and OpenMP to scale across cores.
Good implementations of MPI are aware of OpenMP and work with it

Syntax for MPI?

#include "mpi.h"
First command: MPI_Init (&argc, &argv) /*where argc and argv are inputs to main: main(int argc, char* argv[])*/
Exit with: MPI_Finalize();

What is Peer to Peer messaging?

Two processes participate in communication. One is a sender and the other is a receiver. All other processes do not have to wait for this communication to finish.

MPI send command's syntax

 MPI_Send(&outMsg, msgLen, MPI_CHAR, receiver, tag, MPI_COMM_WORLD); here char outMsg[msgLen]; and "receiver" is a rank no

MPI receive command's syntax

 MPI_Recv(&inMsg, msgLen, MPI_CHAR, sender, tag, MPI_COMM_WORLD, &stat); //here char inMsg[msgLen]; and "sender" is a rank no

Shreenithi. Last modified: July 25, 2021. Website built with Franklin.jl and the Julia programming language. This work is licensed under a Creative Commons Attribution 4.0 International License.