Dense Matrix multiplication using a cluster of GPUs

What does this application do?

In this tutorial we’ll show how to setup, run and understand an example of JCL using a cluster of GPUs to solve the problem of dense matrix multiplication. We assume that the reader has enough knowledge to edit environment variables, edit .bat files and have already read the JCL installation​ ​and​ ​programming​ ​guides.

For the GPU programing in Java, we’ll use ​aparapi​, a simple and powerful framework. It provides an interface that should be implemented by your code in which you should implement a kernel with the code that will run inside the GPU. This kernel will have its bytecode converted into ​OpenCL by aparapi and then will be executed in GPU mode (if possible).​ ​The​ ​full​ ​documentation​ ​and​ ​description​ ​of​ ​the​ ​project​ ​can​ ​be​ ​found​ h​ere​​ ​and​ ​here​.


How to I run it?

 Firstly we show what you need to set up before running this example and then you'll how to run the code itself.

​GPU​ ​Driver​ ​download​ ​and​ ​installation:

Before running anything with aparapi, you must install some dependencies and prepare your execution environment. First you have to get the OpenCL drivers and runtime for your GPU. With those, aparapi will be able to run the converted code in GPU mode. The drivers should be available at your GPU vendor’ website. This tutorial will focus on nVidia GPUs (using Windows). For nVidia GPUs, the CUDA drivers installation should be enough. They already installs the OpenCL drivers and runtime, thus it makes the configuration process easier. The installer can be found here​.​ ​Just​ ​download,​ ​double​ ​click​ on it​ ​and​ ​follow​ ​the​ ​instructions​ ​presented.


Figure 1: GPU drivers installation step by step

 Figure​ ​ 2.1:​ ​Just​ ​click​ O​K​ ​here​ ​to​ ​let​ ​the​ installer​ ​use​ ​the​ ​default​ ​path

Figure​ ​2.2:​ ​Accept​ ​the​ ​EULA​ ​and​ ​hit​ ​the​ ​green​ ​button​ ​

Figure​ ​ 2.3:​ ​Wait​ ​the​ installation​​ proceed.


Figure​ ​2.4:​ ​Install​ ​the​ ​video​ ​driver

Figure​ ​ 2.5:​ ​Wait​ ​for​ ​the​ ​last​ ​installation​ ​section.


Figure​ ​2.6:​ ​Here​ ​you’ll​ ​see​ ​a​ ​summary​ ​of​ ​the​ ​installation.​ ​Just​ ​hit​ ​next.


Figure​ ​2.7:​ ​Uncheck​ ​all​ ​the​ ​boxes​ ​and​ ​close​ ​the​ ​window.


Aparapi​ ​download​ ​and​ ​installation:

After the CUDA installation wraps, reboot your computer as requested (do it anyway if don't). Now just download the aparapi release for your Windows version (x86 or x64) ​here​. Choose one of the ZIP’s, download and extract it. You now have the aparapi documentation, its jar and DLL on the extracted folder. Now inside you C: root, create a folder called aparapi and move the downloaded aparapi_<DIST>.dll to it (DIST is either x86 or​ ​x86_64,​ ​depending​ ​on​ ​what​ ​you​ ​downloaded).


Figure​ ​3:​ ​Download​ ​the​ ​highlighted​ ​aparapi​ ​ZIP​ ​from​ ​the​ ​GitHub​ ​page.


Figure 4: ​Here we show the extracted folder. Copy the highlighted file and paste it to a folder​ ​named​ ​“aparapi”​ ​that​ ​you​ ​should​ ​create​ ​at​ ​your​ ​C:​ ​root.


Environment​ ​variables:

The last step in the configuration process is setup some environment variables. They are used by aparapi in order to locate the OpenCL and aparapi DLLs we just installed. First locate the OpenCL.dll file, it is normally located inside your CUDA installation path. The default is: C:\Program Files\NVIDIA Corporation\OpenCL\. With the path, append it to your system’s PATH env. variable. Finally, insert ‘c:\aparapi\aparapi_<DIST>.dll’(without​ ​quotes)​ ​to​ ​PATH​ ​the​ ​same​ ​way​ ​you​ ​did​ ​for​ ​the​ ​OpenCL.dll.


Figure 5: How to access and edit the environment variables


Figure 6: How to edit the environment variables -part 2


Figure 7: How to edit the environment variables - part 3


Figure 8: How to edit the environment variables - part 4

Now repeat this process on every other machine you have in your cluster that has a GPU.​ ​With​ ​this​ ​done,​ ​let's​ ​run​ ​the​ ​code.

 Running the code:

 First of all, you must modify the scripts that are used to start the JCL-Hosts. In order to avoid registering a large number of Jar dependencies programmatically, we can start the JCL-Hosts with them and then avoiding the registering overhead inside the code. In our example, we have two dependencies: Matrix.jar and aparapi.jar. The first is a simple class that warps a Matrix with some useful information. The second is obvious: it’s the aparapi framework that provides the Kernel interface and the GPU code execution. Both are located in the lib folder. You should copy them to the jcl_binaries folder of the JCL release in every computer you are going to use. Now open the Host.bat file with a text editor and edit as​ ​follows.



Figure 9: Customizing the​ ​ file​ ​ Host.bat​ 

Now you just run the code itself. Start the JCL cluster, import the Eclipse project and run the Main class on one of the devices. After the execution finishes​ ​you​ ​should​ ​see​ ​the​ ​resultant​ ​matrix​ ​on​ ​the​ ​Eclipse​ ​console​ ​you​ ​started​ ​the​ ​code.


How do I use it?

 Here we are going through all the classes that are part of the project, explaning their logic and the decisions behind them.


Starting by the Main class, it’s pretty straightforward: it contains the main method that creates two integer matrices and constructs two Objects of the Matrix type from the the integer matrices created (lines 12 and 13). After that, a SplitMatrices object is created passing​ ​the​ ​two​ ​Matrix​ ​objects​ ​we​ ​created​ ​and,​ ​at​ ​line​ ​17,​ ​the​ ​resultant​ ​matrix​ ​is​ ​displayed.

As we mentioned above, the Matrix is just a support class that warps the integer matrices with some additional information and functionalities such as the number of lines and​ ​columns​ ​and​ ​methods​ ​that​ ​return​ ​entire​ ​lines​ ​and​ ​columns​ ​of​ ​values.

The header 


The next class we should go through is MatrixKernel. It is one of the most (if not the most) important class of the project. On the constructor, it receives two integer arrays: line and column. By this we now can optimize the matrix multiplication: the run method is overridden from Kernel and inside it we have the code that will be binded to OpenCL and execute in the GPU. Those four lines of code (of the run method) are executed multiple times in parallel inside the GPU, multiplying the elements from the line and column. For each execution we can get its ID (getGlobalID) to provide unique identifications. Since we have this absurd amount of concurrency, it’s not safe to use the same variable to store the multiplication of all the items, so each multiplication is stored in a different position of an array (element). The getElement method just sums each position the element​ ​array​ ​and​ ​returns​ ​a​ ​computed​ ​element​ ​of​ ​the​ ​final​ ​matrix.



The JCLMatrixBroker class will be called for each pair of line and column from the matrices that are being multiplied. The multiply method will be called via the execute method from the JCL API (as we shall see later on) and it receives as argument two integer arrays for line and column and a String ij that is used to identify the element from the result matrix. At line 8 we see one of the most useful structures from the JCL API which is the JCL-HashMap. This HashMap extends the default Map interface from the Java.util and implements a distributed version that stores its entries across the cluster of JCL-Hosts and enables any JCL application running on the same cluster to access and modify the data collaboratively. At line 10 we instantiate one new MatrixKernel with the line and column arguments. The execute method called at line 11 is from the Kernel class that is extend by MatrixKernel and it’s used to invoke the implemented run method n times, where n is the integer passed as argument. Note that the execute call waits until the last run execution warps. After that we simply recover the computed element and insert it at the JCL-HashMap ​(lines​ ​14​ ​and​ ​16​ ​respectively).


 The last class is SplitMatrices which splits the matrices you want to multiply lines (from A) and columns (from B). For each pair of line and column we call sendToJCL (line 35). This method just uses the execute method from the JCL API to invoke the multiply method on a remote JCL-Host., first part

Line 63 calls the execute method and stores the execution identifier in a ticket to enable collecting the result later. The generated ticket is stored in a ticket list. Going back to the SplitMatricesconstructor, at line 39 we block until every execute finishes, this way we are sure the all JCL-Hosts finished their jobs. The last method to cover is getResult. It only goes through the JCL-HashMap that contains each element computed and construct a new Matrix to be presented​ ​as​ ​the​ ​resultant​ ​matrix., last part


I still have some questions and comments, where can I go?

 Questions about the API or the code from that application? Check out our Installation Guide and also our Programming Guide for Lambari and Pacu versions. Remmaning any questions, contact us.