Our main goal was to develop a novel approach for resource management in supercomputers, to enable changing resource assignment during the application run time. We call this approach dynamic resource management, also known as malleability (i.e. adaptable, able to change or adjust). Supercomputers use job scheduler software to match petitions of applications to be executed (i.e. jobs), to the available resources on the supercomputer (i.e. the nodes, with many resources to be managed, like CPUs, GPUs, main memory, secondary storage, …). To avoid interferences, job schedulers typically do not allow that different jobs share the same node on the supercomputer, so the maximum performance can be ensured for a single job, but at the price of achieving a lower overall utilisation of the resources, if the job is not able to fully exploit all resources reserved and assigned to it.
In addition, these days, supercomputer architectures tend to have more and more resources inside a single node. If we review current trends in computer architecture, we can see that nodes are commonly composed of chips with many computing cores (e.g. up to 72 cores in Intel’s KNL), specific purpose computing units, such as GPUs or FPGAs, and a new layer of hierarchy in the memory of the system (i.e. Non-Volatile Memory). With this tendency of having more and more resources and complexity inside a single node, it gets unlikely that a single application is able to exploit all resources inside a node at the same time. This means that if we want to increase the overall utilisation of the supercomputer, which is very important due to the economic cost of the infrastructure, sharing resources between applications becomes a must, and that is where dynamic resource management makes a difference.
The dynamic resource management Key Result is composed of a set of results dealing with different topics but all working towards the same objective. In particular, the list of results we have achieved is:
- CoreNeuron with enabled malleability
- NEST with enabled malleability
- Definition of APIs for job schedulers
- Job scheduler with dynamic resource management capabilities
- Job scheduler with new scheduling policies
Our strategy has been to use two of the most important brain simulators in the HBP, i.e. CoreNeuron and NEST – The Neural Simulation Tool, as representative use cases to show what can be achieved with dynamic resource management, and as real applications to gather requirements from. Both codes have been adapted to allow them to change the number of computing cores they use at runtime. We have made them publicly available, with the intention of showing the types of adaptations that need be done to existing codes when malleability is targeted, therefore as a proof-of-concept, rather than as production-ready versions. These modified codes correspond to the components Neuron with enabled malleability and NEST with enabled malleability; the code is publicly available.
Moreover, several tools have been used to study the performance of original, as compared to malleable, versions of the code such as Paraver and Extrae, that enable the performance analysis, as well as OmpSs, a parallel programming model.
To enable malleability in current systems, information needs to flow between the different layers of the software stack so that scheduling can happen at a finer grained level, and interactions between the different schedulers of the system are supported. With that objective in mind, we defined a simple architecture of the different schedulers in a supercomputer, acting at different levels. We identified four levels of schedulers (job level, node level, application level and kernel level) as depicted in the figure, and we handled three of them with the efforts available in the project, i.e. all except the kernel level. Once identified, we defined different APIs to enable the communication between these scheduling levels, to achieve a holistic view of the scheduling and to make all layers work towards the same objective. The interaction of these layers is achieved with the Dynamic Load Balancing Library (see section 126.96.36.199), a software library that enables the resource sharing between applications. These results are summarised in the component “Definition of APIs for job schedulers”. We produced an internal report describing these APIs, available in the HBP Collaboratory.
The central aspect of enabling malleability is to achieve that the main system in charge of scheduling of jobs to the supercomputer can support it. We have selected SLURM as a representative job scheduler to implement modifications to achieve the share of resources between jobs. SLURM is available at many supercomputing facilities around the world, and it is distributed as open source, so for us developing a prototype on top of it has been easier than it would have been for others. We have not distributed this modified version of SLURM yet, due to its early prototype status. However, in the future we consider to include this in the SLURM distribution, and we also keep a good record of the modifications needed to implement malleability, so any other job scheduling system could be adapted following similar steps. This is the component result named Job scheduler with dynamic resource management capabilities.
Another important aspect that dynamic resource management enables is the possibility of implementing new scheduling policies in the job manager (i.e. SLURM) that exploit malleability, which is the component referred as Job scheduler with new scheduling policies. Although it is listed in section 3.4 as a separate component, we have implemented our algorithms in our modified version of SLURM, that is described above. We have implemented two generic policies (that can be applied to any application) to enable malleability with the objective of reducing the response time of jobs and to increase the utilisation of the supercomputer. The first generic policy takes into account a maximum slowdown for a job when sharing its resources, to avoid excessive sharing of resources for a single application. The second generic policy uses the runtime of the application that is its sharing resources as a boundary, to ensure that the sharing does not make the application surpass a certain runtime. In addition, a specific policy has been tailored for CoreNeuron. This specific policy takes into account the phase in which the code is currently running (input reading or computation), and thus enables successful resource sharing with other processes with compatible requirements, considering the memory demands of each process, since CoreNeuron has a high memory utilisation. In this case we make sure that the new processes will not overload the node’s memory. By parsing CoreNeuron input files we have been able to determine their size, and therefore predict memory utilisation and initialisation time without any user intervention.
The impact of our work
Our work plan towards the achievement of this Key Result also includes investigating the sustainability of our solutions and a possible technology transfer to vendors. In particular, we worked with two of the vendors involved in the HBP Pre-Commercial Procurement (PCP), i.e. IBM-NVIDIA and Cray, which took place during the HBP Ramp-up Phase. In the first case, IBM has shown a clear acknowledgement of the future necessity of dynamic resource management, by arguing that malleability has been discussed with some of their customers. On the opposite, in the case of Cray, their experience with their clients does not show off that malleability may be needed by now. To gain insights about this necessity, Cray requested more real use cases from neuroscience, which we successfully delivered to them. For us, the contact with Cray has shown that the dynamic resource management technology is still in an early research phase, but that it has a lot of potential.
Apart from the PCP vendors, we have been in contact with Lenovo (Luigi Brochar) and Intel (Hans-Christian Hoppe), since BSC has specific, on-going research projects with these two companies, and we presented them the ideas and results achieved so far in dynamic resource management. Their feedback has been very valuable for us, and they have clearly shown interest in the future of the topic, as something that may be included in their products. This strategy was defined at the first year of the SGA1, and during the second year the contacts with the vendors have continued.