Search Header Logo
ATAL_FDP_HPC_S6_OpenMP

ATAL_FDP_HPC_S6_OpenMP

Assessment

Presentation

Computers

University

Practice Problem

Medium

Created by

Pushpendra Pateriya

Used 2+ times

FREE Resource

49 Slides • 5 Questions

1

Parallel Programming with OpenMP

media

Pushpendra Kumar Pateriya
Assistant Professor, Head of System Programming
School of Computer Science and Technology,
Lovely Professional University

2

Multiple Choice

Why is parallel programming needed in modern computing?

1
To improve performance and efficiency in multi-core and large-scale computing.
2
To eliminate the need for data synchronization.
3
To simplify programming for single-core processors.
4

To replace all sequential programs

3

What's the need of parallel architectures and parallel programming

media

4

Moore's Law.

The chart tracks the progress of transistor integration, for Intel's devices, from 1960 to 2010.

media

5

Computer Architecture and the Power Wall

media

6

Partial solution: simple low power cores

media

7

Power calculation for analysis

media

8

Power consumption Analysis

media

9

Microprocessor trends

media

10

Concurrency vs. Parallelism

  • Concurrency: A condition of a system in which multiple tasks are logically active at one time.

  • Parallelism: A condition of a system in which multiple tasks are actually active at one time.

media

11

Concurrency vs. Parallelism

media

12

Multiple Choice

What is the key difference between parallel and concurrent execution?

1

Parallel execution happens simultaneously, while concurrent execution involves task switching.

2
Concurrent execution runs tasks in a sequence without overlap.
3

Concurrent execution requires multiple processors.

4
Parallel execution is only possible on single-core processors.

13

The Parallel programming process:

media

14

OpenMP Overview

  • Definition: OpenMP (Open Multi-Processing) is an API for parallel programming in C, C++, and Fortran.

  • Purpose: Enables shared-memory multiprocessing on multi-core and multi-threaded processors.

  • Components:

    • Compiler directives (#pragma omp ...)

    • Runtime library routines

    • Environment variables

​Advantages:

  • Simplicity: Uses directives instead of rewriting code

  • Scalability: Works on multi-core architectures

  • Portability: Supported by most modern compilers

15

OpenMP Limitations

  • Works only for shared-memory architectures

  • Manual tuning required for performance optimization

  • Not suitable for all types of parallelism (e.g., distributed memory systems)

16

OpenMP Solution Stack

media

17

Single and Multithreaded Process

media

18

Programming shared memory computers

media

19

Programming shared memory computers

media

20

Multiple Choice

Which of the following statements is true regarding the stack, heap, text, and data segments in the context of threads?

1
The stack is shared among all threads, while the heap is unique to each thread.
2
Both the stack and heap are unique to each thread.
3
The text segment is unique to each thread, while the data segment is shared among all threads.
4
The stack is unique to each thread, while the heap is shared among all threads.

21

A shared memory program

An instance of a program:

  • One process, multiple threads sharing a common address space.

  • Threads interact via shared memory using reads/writes.

  • OS scheduler interleaves thread execution for fairness.

  • Synchronization ensures correctness across all execution orders.

media

22

Exercise 1, Part A: Hello world

What will be the output of the following code?

int main()

{

int ID = 0;

printf(“ hello(%d) ”, ID);

printf(“ world(%d) \n”, ID);

}

23

Open Ended

What will be the output of following C program?

int main()

{

int ID = 0;

printf(“ hello(%d) ”, ID);

printf(“ world(%d) \n”, ID);

}

24

Modified Code

media

25

A multi-threaded “Hello world” program

media

26

  • Memory access time varies depending on the location of the data.

  • Memory is divided into "Near" and "Far" regions, where access to local (near) memory is faster than remote (far) memory.

  • Example: AMD EPYC processors with multiple interconnected memory regions.

Non-Uniform Memory Access (NUMA)

  • All processors have equal access time to the shared memory.

  • The operating system treats all processors identically.

  • Example: Intel Xeon-based servers with multiple CPUs sharing the same memory.

Symmetric Multiprocessor (SMP)

​Shared memory Computers

27

OpenMP Programming Model

Fork-Join Parallelism:

  • Master thread spawns a team of threads as needed.

  • Parallelism added incrementally until performance goals are met: i.e. the sequential program evolves into a parallel program.

media

28

Thread Creation: Parallel Regions

For example, To create a 4 thread Parallel region

media

29

Thread Creation: Parallel Regions

media

30

Exercise

media

31

Serial pi program

media

32

Write a parallel program

  • Create a parallel version of the pi program using a parallel construct.

  • Pay close attention to shared versus private variables.

  • In addition to a parallel construct, you will need the runtime library routines

  • int omp_get_num_threads(); // Number of threads in the team

  • int omp_get_thread_num(); //Thread ID or rank

  • double omp_get_wtime(); //Time in Seconds since a fixed point in the past

33

A simple Parallel pi program

media

34

Results

media

35

False sharing

media

36

Eliminate False sharing by padding the sum array

media

37

Results: pi program padded accumulator

media

38

Do we really need to pad our arrays?

Padding arrays requires a good understanding of how the cache works. If you switch to a computer with a different cache size, your program's performance may drop significantly.

39

How do threads interact?

  • OpenMP is a multi-threading, shared address model.

    • Threads communicate by sharing variables.

  • Unintended sharing of data causes race conditions:

    • race condition: when the program’s outcome changes as the threads are scheduled differently.

  • To control race conditions:

    • Use synchronization to protect data conflicts.

  • Synchronization is expensive so:

    • Change how data is accessed to minimize the need for synchronization.

40

Synchronization

  • Synchronization: bringing one or more threads to a well defined and known point in their execution.

  • The two most common forms of synchronization are:

media
media

41

Synchronization

  • High level synchronization:

    • critical

    • atomic

    • barrier

    • ordered

  • ​Low level synchronization:

    • flush

    • locks (both simple and nested)

42

Synchronization: Barrier

  • Barrier: Each thread waits until all threads arrive.

media

43

Synchronization: critical

  • Mutual exclusion: Only one thread at a time can enter a critical region.

media

44

Synchronization: Atomic

  • Atomic provides mutual exclusion but only applies to the update of a memory location (the update of X in the following example)

media

45

Pi program with false sharing

  • Original Serial pi program with 100000000 steps ran in 1.83 seconds.

media

46

Using a critical section to remove impact of false sharing

media

47

Using a critical section to remove impact of false sharing

media

48

Open Ended

Any Question till now?

49

The loop worksharing Constructs

  • The loop worksharing construct splits up loop iterations among the threads in a team

media

50

Loop worksharing Constructs

media

51

Both are equivalent

Put the “parallel” and the worksharing directive on the same line

media

52

Nested loops

collapse clause

  • Will form a single loop of length NxM and then parallelize that.

  • Useful if N is O(no. of threads) so parallelizing the outer loop makes balancing the load difficult.

media

53

Reduction

  • OpenMP reduction clause: reduction (op : list)

  • Inside a parallel or a work-sharing construct:

  • A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”).

  • Updates occur on the local copy.

  • Local copies are reduced into a single value and combined with the original global value.

  • The variables in “list” must be shared in the enclosing parallel region.

media
media

54

-- Pushpendra Kumar Pateriya

"That brings us to the end of today's session. We explored OpenMP"
"Thank you everyone" 

media

Parallel Programming with OpenMP

media

Pushpendra Kumar Pateriya
Assistant Professor, Head of System Programming
School of Computer Science and Technology,
Lovely Professional University

Show answer

Auto Play

Slide 1 / 54

SLIDE