HPC Xeon Phi Exercise: Hands-On Lab


Introduction

There are two kinds of hardware accelerators in common use: GPUs and coprocessors. Coprocessors generally offer fewer cores than a GPU, but a coprocessor's cores:

  1. Are usually faster (in terms of clock speed) than the cores on a GPU.
  2. Offer greater functionality (in terms of instruction set) than the cores on a GPU.
  3. Let you write programs using the same languages (e.g., C, C++, etc) and libraries (e.g., MPI, OpenMP, etc) as a normal CPU.
Intel has a coprocessor line they call the Xeon Phi. Our supercomputer has a special "accelerator" node acc.calvin.edu with a Xeon Phi 3120 coprocessor that we can use to explore the Phi's capabilities.

Using ssh, login to acc.calvin.edu, the same way as you would login to dahl.calvin.edu. Make certain this works before proceeding further.

One-Time Setup

On acc.calvin.edu, generate a new public-private keypair by entering:

   ssh-keygen -t rsa
When prompted for passwords, just press 'Enter', as we want to be able to login to the Phi without entering a password. This will generate public and private keys in your ~/.ssh folder on acc.

The Xeon Phi is running its own version of Linux, and has its own file system separate from that of its host computer. Because of this, we need to sync the keys we just generated on the host's file system to the file system on the Phi. To sync these keys, enter:

   /sbin/micctrl --sshkeys
Then enter:
   ssh mic0
to verify that you can login to the Phi without a password.

(If this doesn't work and the Phi prompts you for a password, try entering the password changeme to login using a password. Then on the Phi, enter these commands, which do manually what the micctrl command should have done for you:

   mkdir .ssh
   chmod 700 .ssh
   cp /homex/yourUserName/.ssh/id_rsa.pub .ssh/authorized_keys
   chmod 600 .ssh/authorized_keys
Then logout from the Phi and try to ssh to mic0 again.)

Make certain this works before proceeding, as that will make the rest of the exercise go more smoothly.. When you have successfully logged into , logout from the Phi and proceed.

Today's exercise contains a number of questions for you to answer, so you may want to open a blank document and take notes as you go.

Background

The cores on the Phi coprocessor use a special architecture Intel calls the Many Integrated Core (MIC) architecture. Each MIC core has four hardware threads and two 512 bit vector units. The instruction set is based on the original Pentium (586) instructions, with additional SIMD instructions for using the vector units, plus other extensions.

Intel sometimes calls the Phi a "cluster on a chip", in which the "cluster" consists of cores connected via a ring network, as follows:

Intel's MIC architecture

Since the Phi's cores are connected via a network AND they share memory, we can write programs for it using:

  1. OpenMP, like a shared-memory manycore CPU;
  2. MPI, like a distributed-memory cluster; or
  3. both MPI+OpenMP, using a heterogeneous approach.
In today's exercise, we will see how to use all three options.

The MIC architecture is an extension of Intel's standard x86 architecture. (For example, it contains instruction-set support to use the vector units each Xeon Phi core has.) To provide maximum flexibility, Intel supports two basic approaches for using the Phi:

  1. "Native" applications are normal parallel applications that are cross-compiled for and run on the Xeon Phi. A "native" application can be run in any of several ways:
    1. By using an Intel program called micnativeloadex, we can launch a native program on the Phi coprocessor from the host computer.
    2. By (a) using scp to copy the program from the host computer to the Phi; (b using ssh to login to the Phi; and (c) running the native program directly on the Phi.
    3. By (a) setting up the Network File System (NFS) on the Phi to mount the host computer's home directory; (b) using ssh to login to the Phi; and (c) running the native program on the Phi from the host's file system, via NFS.
  2. "Offload" applications are parallel applications that are compiled and run on the host, but one or more parts of the computation are "offloaded" to the Xeon Phi, using a special Intel directive.

We will see how to build and run both kinds of application in this exercise. Since the Phi is an Intel product, it is easiest to utilize using Intel's compilers: icc for C programs using OpenMP, and mpiicc for C programs using MPI.

Getting Started

On acc, copy the files for today's exercise into your home directory, as follows:

   cp -r /home/cs/374/exercises/11 phi
Then cd to your new phi directory.

Part 1. OpenMP

Environment Setup for OpenMP

There are a few environment variables we need to set up to use Intel's C compiler and the Xeon Phi's MIC architecture. Intel has provided a shell-script that we can use to set up our environment for using Intel's C compiler icc. To run this script, enter the following command:

   source /opt/intel/composerxe/bin/compilervars.sh intel64
Next, we need to set our environment so that icc can find the libraries needed to use the Xeon Phi (MIC) architecture. This can be done by entering:
   export SINK_LD_LIBRARY_PATH=/opt/intel/composer_xe_2015.5.223/compiler/lib/mic/:/opt/mpss/3.6/sysroots/k1om-mpss-linux/lib64

Note that the scope of these environment variables is your current shell. Each time you create a new shell, you will need to re-enter these commands. If you think you might be using the Phi frequently, you may want to put these commands in your .bash_profile file, so that you don't have to repeat them each session.

1.A. "Native" OpenMP Applications

Use the cd command to change your working directory to the 01.nativeOpenMP directory. Then use the ls command to view its contents.

Each directory for today's exercise will contain similar files.

The provided Makefile will build two applications: one for the regular x86 architecture on the host's Xeon CPU, and one for the MIC architecture on the Xeon Phi. (Feel free to view the contents of the Makefile to see how it does this.)

To build both applications, enter:

   make
and you should see two compile commands performed:
   icc -openmp spmd-openmp.c -o spmd-openmp
   icc -openmp -mmic spmd-openmp.c -o mic-spmd-openmp 
Both commands use Intel's C compiler, icc, and the -openmp switch to process OpenMP directives. The first command builds spmdOpenMP for the x86 architecture; the second command includes the -mmic switch to build spmdOpenMP for the MIC architecture.

You can run the x86 application the usual way. Try:

   ./spmdOpenMP
and
   ./spmdOpenMP 4
to get a sense of how it works.

Take a few minutes to examine its source code. There, you'll see that the program uses the OpenMP call omp_get_num_procs() to determine the number of processors (cores) are available in the system.

Question 1: How many cores does OpenMP report our host computer having?

Next, let's try to run the MIC version the same way. Try:

   ./mic-spmdOpenMP
You should see an error message, because the binary instruction set for this program is for the MIC architecture, not the x86 architecture.

As mentioned previously, there are different ways to run a "native" application on the Xeon Phi, so let's examine those next.

1.A.i. Running a "Native" Application From the Host

One way to run a "native" application is to use Intel's micnativeloadex program, which lets you launch the "native" application from the host. To use this approach, enter:

   micnativeloadex ./mic-spmdOpenMP 
(If that doesn't work for some reason, enter the full name: /opt/intel/mic/bin/micnativeloadex ./mic-spmd-openmp).

Question 2: How many cores does OpenMP report our Xeon Phi having?

We may want to control the number of threads on the Phi, like we did on the CPU. However, if we try to use a command line argument the same way:

   micnativeloadex ./mic-spmdOpenMP 4
that argument gets taken as an argument for micnativeloadex, not our program, generating an error. To tell micnativeloadex to pass the 4 on to our program as an argument, we can use the -a switch:
   micnativeloadex ./mic-spmdOpenMP -a 4
Congratulations, you've just run your first "native" application on the Xeon Phi coprocessor!
1.A.ii. Running a "Native" Application From the Phi

The Xeon Phi is a functional computer, running its own (Linux) operating system, but its visibility is limited to our host. On our host computer, the Xeon Phi is known as mic0. (If we had multiple Xeon Phi's, they would be mic0, mic1, mic2, and so on.)

If everything is configured properly, you should be able to ssh to the Phi, so take a moment to verify that by entering:

   ssh mic0
Use the ls command to examine the contents of your home directory.

Question 3: How do the contents of your home directory on the Xeon Phi compare to the contents of your home directory on the host?

Open a different terminal tab or window and use it to ssh to to the host computer, so that in one tab/window you are logged into our host, and in the other you are logged into the Xeon Phi on that host. In your host window, cd to your 01.nativeOpenMP directory (the directory containing your "native" application) and enter:

   scp mic-spmdOpenMP mic0:~/
If all is well, this should copy the "native" application from the host to your home directory on the Phi.

Back in your Phi window, use the ls command to verify that the "native" application is now present.

Since you are on the Phi in that window, you can run the "native" application like a normal program:

   ./mic-spmdOpenMP 
or
   ./mic-spmdOpenMP 4
Verify that this is working correctly for you before continuing.

The Xeon Phi has no hard disk; its file system is actually stored on the accelerator in RAM, and since RAM is volatile, user programs like mic-spmdOpenMP will not surive a reboot.

Space in RAM is also very limited -- our Phi has just 4GB of RAM for the operating system, user space, etc. Because the file space is limited, it is important to remove application files on the Phi when you are done using them, to avoid exhausting its file space prematurely. When you are done experimenting with mic-spmdOpenMP, enter:

   rm ./mic-spmdOpenMP
to remove the application.

This approach -- where you use scp to copy your application from the host to the Phi has obvious disadvantages: (a) you have two copies of the application taking up space, and (b) the copy on the Phi consumes its limited file space.

If you have a wonderful system administrator (which we do), a better approach is to have your sysadmin configure the Phi to mount the host's /home directory using the Network File System (NFS). Our sysadmin has done so using the name /homex, so in the terminal window where you are logged into the Phi, enter:

   cd /homex/your-user-name
Then use the ls command to view the directory's contents.

Question 4: How do the contents of your homex directory on the Xeon Phi compare to the contents of your home directory on the host?

Use the cd command to change your working directory to your 02.nativeOpenMP directory. Use the ls command to verify that mic-spmdOpenMP is present. Then enter:

   ./mic-spmdOpenMP
and/or
   ./mic-spmdOpenMP 4

Question 5: Explain how you are able to run mic-spmdOpenMP on the Xeon Phi, when we just deleted mic-spmdOpenMP from our home directory on the Phi? What advantage(s) does this approach have, compared to copying the application from the host to the Phi?

1.B. "Offload" OpenMP Applications

On the host, use the cd command to change your working directory to the 02.offloadOpenMP directory. Then use the ls command to examine its contents. As before, you should see a Makefile, a README.txt, and a source file spmdOpenMP-offload.c. Take a few minutes to explore these files, especially spmdOpenMP-offload.c.

Question 6. What #pragma directive does this program contain that we have not seen before?

Use make to build the program. Note that the Makefile only builds a single program (spmdOpenMP-offload) this time.

This program can be run on the host in the usual way, for example:

   ./spmdOpenMP-offload 4
If you compare the behavior against the source code, you'll see that some of the code is running on the host and some of it is running on the Phi.

This is a key difference between the "Offload" approach and the "Native" approach: where the "Native" approach requires a separate version of the program that is cross-compiled for the MIC architecture, the "Offload" approach uses one or more #pragma directives to selectively build parts of a program for the MIC architecture, and causes those parts to be run on the Phi coprocessor, if one is available.

Question 7: What is the effect of Intel's #pragma offload directive?

1.C. Hybrid (CPU+Offload) OpenMP Applications

You may have noticed that spmdOpenMP-offload.c lets us control the number of threads on the CPU, but not on the Xeon Phi. This raises an obvious question: How can we also control the number of threads on the Phi?

You may have also noticed that in spmdOpenMP-offload.c, the code on the CPU runs, then the code on the Phi runs. This raises another question: How can we get code to run on both the CPU and the Phi simultaneously?

To answer these questions, change your working directory to 03.hybridOpenMP. There, you should see our usual 3 files: Makefile, README.txt, and a source file spmdOpenMP-hybrid.c. Intel uses the term "hybrid" to describe the situation where a computation simultaneously uses both the host CPU and the MIC. The program in spmpOpenMP-hybrid.c illustrates how to do this using OpenMP directives. This is different from what we have seen before, so take a few moments to view its contents.

The key feature in this program is its use of the OpenMP sections directive:

   #pragma omp sections
   {
      #pragma omp section
      {
          // section 1
      }
      #pragma omp section
      {
          // section 2
      }
      ...
      #pragma omp section
      {
          // section n
      }
   }
When performed, the sections directive causes the code in each section to be performed in parallel. The program in spmdOpenMP-hybrid.c uses this approach, placing the code to be run on the CPU's cores in one section, and placing the code to be run on the Phi's cores in a different section that contains Intel's #offload directive.

Use make to build the program, then run it:

   ./spmdOpenMP-hybrid
By default, the program uses 1 thread on the host and 1 thread on the Phi, but you can supply command-line arguments for each:
   ./spmdOpenMP-hybrid 3 4
Experiment with this using increasing numbers until you see some interleaving between the output of the CPU and that of the Phi.

Question 8: What does Intel mean by "hybrid" computing? How can "hybrid" computing be accomplished using OpenMP?

Within spmdOpenMP-hybrid.c, there are two commented-out barrier pragmas, one in each section. Using these barrier directives, see if you can answer the following questions before continuing.

Question 9: What is the effect of a barrier within a parallel block that is within a section? Does it affect all of the threads in the computation, or is its effects confined to the threads in that parallel block?

Part 2. MPI

Now that we have seen some different ways to use OpenMP on the Xeon Phi, let's look at MPI.

Environment Setup for MPI

As was the case for OpenMP, we need to set up our environment before we can use MPI on the Xeon Phi. As before, this takes two steps. The first step is the same as before -- setting up your environment to use Intel's C compiler -- so you only need to do this again if you are starting a new session:

   source /opt/intel/composerxe/bin/compilervars.sh intel64
The second step is to set up the environment as needed to use Intel's version of MPI:
   source /opt/intel/impi/5.0.3.049/intel64/bin/mpivars.sh
As before, the scope of these environment variables is your current shell, so any time you launch a new shell, you will need to re-enter these commands. Or you can put them in your ~/.bashrc file, so that they are automatically performed whenever a new shell is launched..

With our environment set up for MPI, we are ready to begin.

2.A. "Native" MPI Applications

Use the cd command to change your working directory to the 04.spmdMPI directory; then use the ls command to view the directory's contents. You should see the usual 3 files: Makefile, README.txt and a source file spmdMPI.c.

2.A.i. Building the Program

Use make to build the spmdMPI program. You should see two separate build actions:

   mpiicc  spmdMPI.c -o spmdMPI
and
   mpiicc -mmic spmdMPI.c -o mic-spmdMPI
By now, you should be able to recognize that these steps are building two different versions of the program: (i) a regular version for the host's x86 Xeon CPU (spmdMPI), and (ii) a "native" version for the Xeon Phi accelerator (mic-spmdMPI).

Before we see how to run these programs, take a few minutes to look over spmdMPI.c to get a sense of what the program is doing.

2.A.ii. Running the x86 Version

To run the x86 version of the program, enter:

   mpirun -np 4 ./spmdMPI

Question 10: When we run spmdMPI, where are the four processes launched and run?

2.A.iii. Running the "Native" Version

Now, let's try the same thing with the MIC version:

   mpirun -np 4 ./mic-spmdMPI

Question 11: What happens when we try to run mic-spmdMPI on the host computer? Why?

We could run the "native" version by using scp to copy it over to the Xeon Phi, as we did with spmdOpenMP. However, since our sysadmin has set up the Network File System (NFS) on the Xeon Phi to mount our host's /home directory under the name /homex, we can run mic-spmdMPI without copying it, as we saw earlier.

  1. To run the "native" version, first enter:
       pwd
    
    to list the path to the working directory, where mic-spmdMPI resides.
  2. If necessary, use ssh to login to the Xeon Phi as before:
       ssh mic0
    
  3. On the Phi, use the cd command to change your working directory to the path produced by the pwd command in step 1, but with /home replaced by /homex.
       cd /homex/rest-of-path-to/04.nativeMPI
    
    Now you are logged into the Xeon Phi, and in the directory containing mic-spmdMPI !
  4. To run mic-spmdMPI, enter:
       mpirun -np 4 ./mic-spmdMPI
    
    You should see output indicating that the processes have successfully launched and run on the Xeon Phi. Congratulations -- you've run an MPI program on the Phi!

    Note that while the Xeon Phi may be a "cluster on a chip", the "nodes" in that cluster do not have distinct hostnames, as they are just cores residing on the same accelerator chip.

  5. While we are here on the Phi, try running spmdMPI the same way:
       mpirun -np 4 ./spmdMPI
    

    Question 12: What is the result of running a program on the Xeon Phi that was compiled for the host? Why?

Note that this same approach (i.e., using /homex) can be used to run OpenMP programs on the Phi, but it can only be used if a sysadmin has setup NFS to mount the host's home directory on the Phi. If not, you can always use the scp approach we used back in Section 1.A.ii.

2.B. MPI and "Offload" Mode

According to Intel, "Calling MPI functions within an offload region is not supported." but "The offload programming model is supported by the Intel MPI Library."

This means that we cannot use the offload mechanism to offload an MPI computation onto the Phi, but an MPI computation on the host can offload some of its work onto the Phi. We will see how to do this in Part 3.

2.C. Hybrid MPI Applications

In Section 1.C, we saw how to run "hybrid" applications in which some of the threads run on the host's Xeon cores and others run on the Phi's MIC cores. We can do a similar thing with MPI processes, albeit in a different manner.

Use the cd command to change your working directory to 05.hybridMPI. There, use the ls command to view the directory's contents: Makefile, README.txt, spmdMPI-hybrid.c, and hosts.

It is worth mentioning that the program in spmdMPI-hybrid.c is identical to that of spmdMPI.c; to make it a hybrid program, we just build and run it differently.

Since hosts is new, take a moment to view its contents. You should see localhost (Linux's generic local name for a host) and mic0 (the name of our Xeon Phi).

2.C.i. Building the Hybrid Application

Since the Makefile controls the build process and we are going to build this program differently than before, take a moment to examine the Makefile. Unlike the "hybrid" Makefile in Section 1.C, this one creates two binaries: (i) an x86 binary to run on the host's CPU, and (ii) a MIC binary to run on the Xeon Phi.

Run make to build the two programs. Note the names of the two binaries it creates: spmdMPI-hybrid and spmdMPI-hybrid.mic. The difference in these names is important, as we shall see shortly.

2.C.ii. Running the Hybrid Application

Before we can run a hybrid MPI program (i.e., on both the host and the Phi), Intel has us define three environment variables, as follows:

   export I_MPI_MIC=enable
   export I_MPI_MIC_POSTFIX=.mic
   export I_MPI_FABRICS=tcp
The first command tells MPI on the host that it will need to interact with MPI on the Phi coprocessor. The second command tells MPI on the host that binaries to be run on the coprocessor will have the .mic suffix. The third command tells MPI on the host that we will be using TCP to communicate with the Phi coprocessor. More details and options for this mechanism are available in the Intel MPI Reference Manual.

Recall that our Makefile produces a "native" MPI application for the Phi spmdMPI-hybrid.mic. This naming, in combination with the second export command above, will let MPI uniquely identify the program we want to run on the Phi.

With those environment variables defined, we are ready to run our program. Once again, enter:

   pwd
and note the full path to 05.hybridMPI. Then enter:
   mpirun -np 4 -machinefile ./hosts /homex/rest-of-path-to/05.hybridMPI/spmdMPI-hybrid
replacing rest-of-path-to with the relevant path information from the pwd command, and making certain you are beginning at /homex instead of /home. You should see output from spmdMPI-hybrid indicating that some of the MPI processes ran on the local host CPU while others ran on the Xeon Phi (MIC) coprocessor.

Note that our command is telling mpirun to run the program spmdMPI-hybrid on the machines listed in hosts. Because of the environment variables we set up previously, the MPI runtime runs spmdMPI-hybrid on our host CPU but runs spmdMPI-hybrid.mic on our Phi coprocessor.

Note also that mpirun is successfully finding /homex on our host. To make this work, /homex is defined as a symbolic link to /home on our host. Setting up the /homex mechanism on a host equipped with a Xeon Phi thus involves two steps:

Question 13: Which specific environment variable (that we set up previously) tells the MPI runtime to generate the name of the program we want to run on the Phi by appending the suffix .mic to the name of the program we want to run on the host?

Repeat the mpirun command, varying the number of processes.

Question 14: In our "hybrid" MPI program, how is mpirun deciding which processes to run on the host vs the Phi? Why? (Hint: Consider all the command line arguments to mpirun.)

Congratulations, you've run your first hybrid MPI program!

Part 3. MPI+OpenMP

Last but not least, we can combine MPI and OpenMP within the same program, and use the Phi's "offload" mechanism for the OpenMP part.

Use the cd command to change your working directory to the 06.offload-MPI+OpenMP directory. Use the ls command to view its contents, then use make to build the program. Note that the Makefile only produces a single program (spmd-MPI+OpenMP-offload).

Take a moment to examine the contents of spmd-MPI+OpemMP-offload.c. As you can see, it contains MPI commands, OpenMP #pragma directives, an Intel #pragma offload directive, and code to read a number of threads from the command line.

To run the code, try this initially:

   mpirun -np 2 ./spmd-MPI+OpenMP-offload 
You should see the program start up with multiple processes running on the host, and each process launching a single thread on the Phi.

Then add a command line argument:

   mpirun -np 2 ./spmd-MPI+OpenMP-offload 4
This should let you control the number of threads each process launches on the Phi. Congratulations -- you've just used MPI processes on the host to launch OpenMP threads on the Xeon Phi!

Question 15: We have seen six different mechanisms for taking advantage of the parallel capabilities of a host equipped with a Xeon Phi accelerator. Which seems preferable to you, and why? What tradeoffs are involved in using it?

To Explore Further

Intel maintains a catalog of software for the Phi, containing programs for a variety problems, organized by their disciplines. Many of these programs can be freely downloaded. Feel free to explore this catalog on your own.

Summary

Now that you've completed this exercise, you should be able to:

  1. Create a "native" OpenMP application for Intel's MIC coprocessor.
  2. Create a normal OpenMP application that uses Intel's offload pragma to offload work to the MIC coprocessor.
  3. Create a "hybrid" OpenMP application that runs some OpenMP threads on the host's CPU and other OpenMP threads on the MIC coprocessor, with the ability to control how many threads are running each place.
  4. Create a "native" MPI application for Intel's MIC coprocessor.
  5. Create a "hybrid" MPI application that runs some MPI processes on the host's CPU and other MPI processes on the MIC coprocessor.
  6. Create a program that combines MPI+OpenMP, using MPI to launch multiple processes on the host CPU, each of which offloads OpenMP threads to the MIC coprocessor.

Submit.

A one-page writeup containing:
  1. a writeup that summarizes what you have learned in this exercise, and
  2. answers to the questions in the exercise.


CS > 374 > Exercise > 11 > Hands-On Lab


This page maintained by Joel Adams.