JCL Big Sorting

 

What does the sample do?

JCL_Big_Sorting is a distributed sort application that uses the frequency count idea to avoid very large data packets traveling in the network. If you do not understand do not worry, we will explain in more detail.

When we have to sort a collection of data, most of the time the domain of values to be sorted is small, meaning most values repeat several times. Because of this, JCL_Big_Sorting first counts how many times each value appears and then sorts only one representative of each number, all distributed.

How do I run it?

It is necessary, first, to start a JCL cluster. Then the Eclipse project needs to be imported into your IDE. Once this is done, just run the Main class.

How do I use it?

Let's look at the code.

In our Main.java file, inside the constructor, the first 22 lines refer to generation of the input files, which are generated randomly. An input file is created for each cluster core.

Once the files have already been generated, our Sorting.java class is registered, which is responsible for sorting, which is divided into 4 phases. The arguments for the first phase are created and phase 1 is executed on each core.

Phase 1 is responsible for reading the input file and counting the frequency. Each number is inserted into a map called values, which contains the number as key and its frequency as value. The map values ​​keys are inserted into an AVL tree to sort the keys. Then a list of strings is created that will be the return of phase 1. In this list are entered the values ​​that are chosen as pivots and the frequency accumulated until a pivot is reached. To be clear, all numbers are traversed one by one making a sum of their frequencies and when the sum of the frequencies of some numbers is greater than the sum of all frequencies divided by the number of cores in the cluster the last number read is chosen as pivot and inserted in the list of strings together with the frequency accumulated until it. The number of pivots will be more or less balanced with the number of cores in the cluster. A global variable is instantiated in the cluster with the id of the core in question and the map values. This completes phase 1.

Returning to Main, the builInputDataPartitionSchema function is called by receiving all the list of strings generated in phase 1 and the number of cores in the cluster as parameters. In the builInputDataPartitionSchema function all the pivots chosen in phase 1 are inserted, with their accumulated frequencies, in a map. Note that we are now joining the pivots of all cores in a single map. The pivots chosen by each core are not necessarily the same, so when the same number is chosen as pivot in more than one core its accumulated frequencies are summed. Then new pivots are chosen from the group of pivots of the first phase. The new pivots are inserted into a string that will be passed as a parameter in phase 2.

In phase 2 a map called sorted is created and receives the map values ​​of phase 1 that was saved in a global variable. Making it clear that each core has its own map values. A vector of maps called finais is created, the vector size is the number of cores in the cluster. Then each number of the map sorted is read and inserted, with its frequency, in one of the maps of finais according to the pivots. Since there is more or less a pivot for each core, each map of finais will have the numbers that fit the range from one pivot to another pivot. Next, a JCL_map, which contains maps, is created for each map of the vector finais. Phase 2 is finished.

Returning to Main, the arguments for the third phase are created and phase 3 is executed on each core. In phase 3, we will merge the maps of each JCL_map. Since a JCL_map is created for each map of vector finais, and the vector finais size is the number of cores in the cluster, each core will be responsible for joining the maps of a JCL_map, and each JCL_map is responsible for a range of pivots, that is, each of the maps in a JCL_map contains the numbers, with their frequencies, of all the cores in that range. The numbers read from JCL_map maps are inserted into a new map called result. If there are equal numbers in different maps their frequencies are summed. Phase 3 is finished.

Returning to Main, the arguments for the fourth phase are created and phase 4 is executed on each host. Phase 4 only checks if all input file numbers exist in JCL_maps, if any number of input files are not in the maps, an error is counted. A function called removeDirs is called. This function deletes all input files created.

Returning to Main, the variables are terminated and JCL_Big_Sorting is terminated.

Questions or comments, where can I go?

Questions about the API or about the codes of this application? See our Programming Guide and Installation Guide.

If you have any questions, please contact the HPC team.