Fundamentals of Parallelism on Intel Architecture

Questions

  1. Fundamentals of parallelism
    1. What are three layers of parallelism?
    2. What are the 4 different things increased performance could mean?
    3. What's some usual "speed increase" gotos, and why arent they good enough?
    4. Whats some features of "modern code"?
    5. What are the different areas optimization can be done?
  2. Vectorization
    1. 2 types of vectorizing your code?
    2. Limitations of automatic vectorization
    3. What does pragma omp simd do?
    4. What is the "masked" version of the function used for?
    5. What are simd enabled functions?
    6. What is vector dependence and safety vectorizing?
    7. What is strip mining?
  3. Multi-processing and multi-threading
    1. Two ways to run multiple streams of instructions
    2. What's the syntax for including threads?
    3. What's the syntax for including processes?
    4. Syntax - OPENMP
    5. What are the two methods for variable sharing in parallel processes?
    6. What's the syntax of FOR loops in OpenMP?
    7. What are race conditions?
      1. What are mutexes?
    8. What is reduction? (Sum+=i type cases)
  4. Memory organization and caches
    1. What else should you worry about if you want data vectorization to matter? And why?
    2. Whats the difference bw the flops/memoryaccess ratio ><50?
    3. Memory hierarchy in KNL, Xeon, and KNC
    4. What are the high bandwidth memory modes (for MCDRAM)?
    5. Using "high bandwidth memory" bandwidth bound applications?
    6. What's the syntax for finding which mode your MCDRAM is in
    7. What's the syntax for the memkind library?
    8. Syntax for using numactl?
    9. When are streaming stores used?
    10. Locality in space
      1. Is the cache programmable?
      2. Even when you want only one floating point number from a memory/cache, how does the cache-line look like?
    11. What parallelism do we need to take advantage of?
    12. What is unit stride memory access?
    13. Array of structs vs struct of arrays
    14. Locality in time
  5. Distributed Memory Programming
    1. What's MPI?
    2. Why hybrid MPI + OPENMP?
    3. Syntax for MPI?
    4. What is Peer to Peer messaging?
    5. MPI send command's syntax
    6. MPI receive command's syntax

Fundamentals of parallelism

What are three layers of parallelism?

What are the 4 different things increased performance could mean?

What's some usual "speed increase" gotos, and why arent they good enough?

Whats some features of "modern code"?

What are the different areas optimization can be done?

Vectorization

2 types of vectorizing your code?

Limitations of automatic vectorization

What does pragma omp simd do?

What is the "masked" version of the function used for?

What are simd enabled functions?

What is vector dependence and safety vectorizing?

What is strip mining?

const int STRIP = something
const int Primeno = n - n%STRIP
for(ii=0; ii<Primeno;ii+=STRIP){
    for(i=ii; i<ii+STRIP; i++){
        do stuff } }
for(i=Primeno; i<n; i++)
     do stuff

Multi-processing and multi-threading

Two ways to run multiple streams of instructions

What's the syntax for including threads?

What's the syntax for including processes?

Syntax - OPENMP

#pragma omp parallel {
        This code in parallel, omp_get_thread_num()
    }

What are the two methods for variable sharing in parallel processes?

int A, B;
  #pragma omp parallel private(A) shared(B)
  {
    threads have their own 'copies' of A, share B
  }
int B;
  #pragma omp parallel
  {
    int A;
  }

What's the syntax of FOR loops in OpenMP?

#pragma omp parallel
 { 
  int j;
  #pragma omp for
 }

What are race conditions?

What are mutexes?

#pragma omp parallel 
  {
    #pragma omp critical 
    {
        large pieces of code protected
    }
  }
#pragma omp parallel 
  {
    #pragma omp atomic
    //one line
    sum += i;
  }

What is reduction? (Sum+=i type cases)

int total = 0;
  #pragma omp parallel 
  {
    int total_thr = 0;
    #pragma omp for
    for(int i = 0; i < n; i++)
        total_thr += i;
    #pragma omp atomic
        total += total_thr; 	
 }

Memory organization and caches

What else should you worry about if you want data vectorization to matter? And why?

Whats the difference bw the flops/memoryaccess ratio ><50?

Memory hierarchy in KNL, Xeon, and KNC

What are the high bandwidth memory modes (for MCDRAM)?

Using "high bandwidth memory" bandwidth bound applications?

What's the syntax for finding which mode your MCDRAM is in

What's the syntax for the memkind library?

#include <hbwmalloc.h>
const int n = 1<<10;
double *A = hbw_malloc(sizeof(double)*n);
double *B;
int ret = hbw_posix_memalign((void**) &B, 64, sizeof(double)*n);

Syntax for using numactl?

When are streaming stores used?

Locality in space

Is the cache programmable?

Even when you want only one floating point number from a memory/cache, how does the cache-line look like?

What parallelism do we need to take advantage of?

What is unit stride memory access?

Array of structs vs struct of arrays

Locality in time

for(int i=0; i<m; i++)
        for(int j=0; j<n; j++)
            compute(a[i], b[j]);
for(int i=0; i<m; i++)
            for(int jj=0; jj<n; jj+=TILE)
                    for(int j=jj; j<jj+TILE; j++)
                        compute(a[i], b[j]);
for(int jj=0; jj<n; jj+=TILE)	
        for(int i=0; i<m; i++)		
            for(int j=jj; j<jj+TILE; j++)
                compute(a[i], b[j]);

Distributed Memory Programming

What's MPI?

Why hybrid MPI + OPENMP?

Syntax for MPI?

What is Peer to Peer messaging?

MPI send command's syntax

MPI_Send(&outMsg, msgLen, MPI_CHAR, receiver, tag, MPI_COMM_WORLD); here char outMsg[msgLen]; and "receiver" is a rank no

MPI receive command's syntax

MPI_Recv(&inMsg, msgLen, MPI_CHAR, sender, tag, MPI_COMM_WORLD, &stat); //here char inMsg[msgLen]; and "sender" is a rank no