Tag: HPC

Materiomics Chronicles: week 11 & 12

After the exam period in weeks nine and ten, the eleventh and twelfth week of the academic year bring the second quarter of our materiomics program at UHasselt for the first master students. Although I’m not coordinating any courses in this quarter, I do have some teaching duties, including being involved in two of the hands-on projects.

As in the past 10 weeks, the bachelor students in chemistry had lectures for the courses introduction to quantum chemistry and quantum and computational chemistry. For the second bachelor this meant they finally came into contact with the H atom, the first and only system that can be exactly solved using pen and paper quantum chemistry (anything beyond can only be solved given additional approximations.) During the exercise class we investigated the concept of aromatic stabilization in more detail in addition to the usual exercises with simple Schrödinger  equations and wave functions. For the third bachelor, their travel into the world of computational chemistry continued, introducing post-Hartree-Fock methods with also include the missing correlation energy. This is the failure of Hartree-Fock theory, making it a nice framework, but of little practical use for any but the most trivial molecules (e.g. H2 for example already being out of scope). We also started looking into molecular systems, starting with simple diatomic molecules like H2+.

SnV split vacancy defect in diamond.

SnV split vacancy defect in diamond.

In the master materiomics, the course Machine learning and artificial intelligence in modern materials science hosted a guest lecture on Large Language Models, and their use in materials research as well as an exercise session during which the overarching ML study of the QM9 dataset was extended. During the course on  Density Functional Theory there was a second lab, this time on conceptual DFT. For the first master students, the hands-on project kept them busy. One group combining AI and experiments, and a second group combining DFT modeling of SnV0 defects in diamond with their actual lab growth. It was interesting to see the enthusiasm of the students. With only some mild debugging, I was able to get them up and running relatively smoothly on the HPC. I am also truly grateful to our experimental colleagues of the diamond growth group, who bravely set up these experiments and having backup plans for the backup plans.

At the end of week 12, we added another 12h of classes, ~1h of video lecture, ~2h of HPC support for the handson project and 6h of guest lectures, putting our semester total at 118h of live lectures. Upwards and onward to weeks 13 & 14.

Parallel Python?

As part of my machine learning research at AMIBM, I recently ran into the following challenge: “Is it possible to do parallel computation using python.” It sent me on a rather long and arduous journey, with the final answer being something like: “very reluctantly“.

Python was designed with one specific goal in mind; make it easy to implement small test programs to see if an idea is worth pursuing. This gave rise to a scripting language with a lot of flexibility, but also with significant limitations, most of which the “intended” user would never meet. However, as a consequence of its success, many are using it going far beyond this original scope (yours truly as well 🙂 ).

Python offers various libraries to parallelize your scripts…most of them wrappers adding minor additional functionality. However, digging down to the bottom one generally ends up at one of the following two libraries: the threading module and the multiprocessing module.

Of course, as with many things python, there is a huge amount of tutorials available with many of great quality.

import threading

Programmers experienced in a programming language such as C/C++, Pascal, or Fortran, may be familiar with the concept of multi-threading. With multi-threading, a CPU allows a program to distribute its work over multiple program-threads which can be performed in parallel by the different cores of the CPU (or while a core is idle, e.g., since a thread is waiting for data to be fetched).  One of the most famous API’s for writing multi-threaded applications is OpenMP. In the past I used it to parallelize my Hirshfeld-I implementation and the phonon-module of HIVE.

For Python, there is no implementation of the OpenMP API, instead there is the threading module. This provides access to the creation of multiple threads, each able to perform their own tasks while sharing data-objects. Unfortunately, python has also the Global Interpreter Lock, GIL for short, which allows only a single thread to access the interpreter at a time. This effectively reduces thread-based parallelization to a complex way of running a code in a serial way.

For more information on “multi-threading” in python, you can look into this tutorial.

import multiprocessing

In addition to the threading module, there is also the multiprocessing module. This module side-steps the GIL by creating multiple processes, each having its own interpreter. This however comes at a cost. Firstly, there is a significant computational cost starting the different processes. Secondly, objects are not shared between processes, so additional work is needed to collect and share data.

Using the “Pool” class, things are somewhat simplified, as can be seen in the code-fragment below.  With the pool class one creates a set of threads/processes available for your program. Then through the function apply_async function it is possible to run processes in parallel. (Note that you need to use the “async” version of the function, as otherwise you end up with running things serial …again)

  1. import multiprocessing as mp
  2.  
  3. def doOneRun(id:int): #trivial function to run in parallel
  4. return id**3
  5.  
  6.  
  7.  
  8. num_workers=10 #number of processes
  9. NRuns=1000 #number of runs of the function doOneRun
  10.  
  11. pool=mp.Pool(processes=num_workers) # create a pool of processes
  12. drones=[pool.apply_async(doOneRun, args=nr) for nr in range(NRuns)] #and run things in parallel
  13.  
  14. for drone in drones: #and collect the data
  15. Results.collectData(drone.get()) #Results.collectData is a function you write to recombine the separate results into a single result and is not given here.
  16.  
  17. pool.close() #close the pool...no new tasks can be run on any of the processes
  18. pool.join() #collapse all threads back into the main thread

 

how many cores does my computer have?

If you are used to HPC applications, you always want to get as much out of your machine as possible. With regard to parallelization this often means making sure no CPU cycle is left unused. In the example above we manually selected the number of processes to spawn. However, would it not be nice if the program itself could just set this value to be equal to the number of physical cores accessible?

Python has a large number of functions claiming to do just that. A few of them are given below.

  •  multiprocessing.cpu_count(): returns the number of logical cores it can find. So if you have a modern machine with hyper-threading technology, this will return a multiple of the number of physical cores (and you will be over-subscribing your CPU.
  • os.cpu_count(): same as multiprocessing.cpu_count().
  • psutil.cpu_count(logical=False): This implementation gives the same default behavior, however, the parameter logical allows for this function to return the correct number of cores in a single CPU. Indeed a single CPU. HPC architectures which contain multiples CPUs per node will again return an incorrect number, as the implementation makes use of a python “set”, and as such doesn’t increment for the same index core on a different CPU.

In conclusion, there seems to be no simple way to obtain the correct number of physical cores using python, and one is forced to provide this number manually. (If you do have knowledge of such a function which works in both windows and unix environments and both desktop and HPC architectures feel free to let me know in the comments.)

All in all, it is technically possible to run code in parallel using python, but you have to deal with a lot of python quirks such as GIL.

VSC-users day 2019

It is becoming an interesting yearly occurrence: the VSC user day. During this 5th edition, HPC users of the various Flemish universities gather together at the Belgian Royal Academy of Science (KVAB) to present their state-of-the-art work using the Flemish Tier-1 and Tier-2 supercomputers. This is done during a poster-presentation session. This year, I presented my work with regard to vibrational spectra in solids and periodic systems. In contrast to molecules, vibrational spectra in solids are rarely investigated at the quantum mechanical level due to their high cost. I show that imaginary modes are not necessarily a result of structural instabilities, and I present a method for identifying the vibrational spectrum of a defect.

Poster for the VSC user day 2019.

In addition, international speakers discuss recent (r)evolutions in High Performance Computing, and during workshops, the participants are introduced in new topics such as GPU-computing, parallelization, and the VSC Cloud and data platform. The possibilities of GPU were presented by Ehsan, of the VSC, showing extreme speedups of 10x to 100x, strongly depending on the application, the graphics card. It is interesting to see that simple CUDA prama’s can be used to obtain such effects…maybe I should have a go at them for the Hirshfeld and phonon parts of my HIVE code…if they can deal with quadruple precision, and very large arrays. During the presentation of Joost Vandevondele (ETH Zürich) we learned what the future holds with regard to next generation HPC machines. As increasing speed becomes harder and harder to obtain, people are again looking into dedicated hardware systems, a situation akin to the founding days of HPC. Whether this is a situation we should applaud remains to be seen, as it means that we are moving back to codes written for specific machines. This decrease in portability will probably be alleviated by high level scripting languages (such as python), which at the same time result in a significant loss of the initial gain. (Think of the framework approach to modern programming which leads to trivial applications requiring HPC resources to start.)

In addition, this year the HPC-team of the TIER-1 machine is present for a panel discussion, presenting the future of the infrastructure. The machine nearly doubled in size which is great news. Let us hope that in addition for financing hardware, there is also a significant budget considered for a serious extension of a dedicated HPC support team. Running a Tier-1 machine is not something one does as a side-project, but which requires a constant vigilance of a dedicated team to deal with software updates, resulting compatibility issues, conflicting scripts and just hardware and software running haywire because they can.

With this hope, I look toward the future. A future where computational research is steadily are every quickly is becoming common place in the fabric om academic endeavors.

VSC User Day 2018

Today, I am attending the 4th VSC User Day at the “Paleis de Academiën” in Brussels. Flemish researchers for whom the lifeblood of their research flows through the chips of a supercomputer are gathered here to discuss their experiences and present their research.

Some History

About 10 years ago, at the end of 2007 and beginning of 2008, the 5 Flemish universities founded the Flemish Supercomputer Center (VSC). A virtual organisation with one central goal:  Combine their strengths and know-how with regard to High Performance Compute (HPC) centers to make sure they were competitive with comparable HPC centers elsewhere.

By installing a super-fast network between the various university compute centers, each Flemish researcher has nowadays access to state-of-the-art computer infrastructure, independent of his or her physical location. A researcher at the University of Hasselt, like myself, can easily run calculations on the supercomputers installed at the university of Ghent or Leuven. In October 2012 the existing university supercomputers, so-called Tier-2 supercomputers, are joined by the first Flemish Tier-1 supercomputer, which was housed at the brand new data-centre of Ghent University. This machine is significantly larger than the existing Tier-2 machines, and allows Belgium to become the 25th member of the PRACE network, a European network which provides computational researchers access to the best and largest computer facilities in Europe. The fast development of computational research in Flanders and the explosive growth in the number of computational researchers, combined with the first shared Flemish supercomputer (in contrast to the university TIER-2 supercomputers, which some still consider private property rather than part of VSC) show the impact of the virtual organisation that is the VSC. As a result, on January 16th 2014, the first VSC User Day is organised, bringing together HPC users from all 5 universities  and industry. Here the users share their experiences and discuss possible improvements and changes. Since then, the first Tier-1 supercomputer has been decommissioned and replaced by a brand new Tier-1 machine, this time located at the KU Leuven. Furthermore, the Flemish government has put 30M€ aside for super-computing in Flanders, making sure that also in the future Flemish computational research stays competitive. The future of computational research in Flanders looks bright.

Today is User Day 2018

During the 4th VSC User Day, researchers of all 5 Flemish universities will be presenting the work they are performing on the supercomputers of the VSC network. The range of topics is very broad: from first principles materials modelling to chip design, climate modelling and space weather. In addition there will also be several workshops, introducing new users to the VSC and teaching advanced users the finer details of GPU-code and code optimization and parallelization. This later aspect is hugely important during the use of supercomputers in an academic context. Much of the software used is developed or modified by the researchers themselves. And even though this software can present impressive behavior, it doe not speed up automatically if you provide it access to more CPU’s. This is a very non-trivial task the researchers has to take care of, by carefully optimizing and parallelizing his or her code.

To support the researchers in their work, the VSC came up with ingenious poster-prizes. The three best posters will share 2018 node days of calculation time (about 155 years of calculations on a normal simple computer).

Wish me luck!

 

Single-slide presentation of my poster @VSC User Day 2018.

Single-slide presentation of my poster @VSC User Day 2018.

A Spectre and Meltdown victim: VASP

Over the last weekend, two serious cyber security issues were hot news: Meltdown and Spectre [more links, and links](not to be mistaken for a title of a bond-movie). As a result, also academic HPC centers went into overdrive installing patches as fast as possible. The news of the two security issues went hand-in-hand with quite a few belittling comments toward the chip-designers ignoring the fact that no-one (including those complaining now) discovered the problem for over decade. Of course there was also the usual scare-mongering (cyber-criminals will hack our devices by next Monday, because hacks using these bugs are now immediately becoming their default tools etc.) typical since the beginning of the  21st century…but now it is time to return back to reality.

One of the big users on scientific HPC installations is the VASP program(an example), aimed at the quantum mechanical simulation of materials, and a program central to my own work. Due to an serendipitous coincidence of a annoyingly hard to converge job I had the opportunity to see the impact of the Meltdown and Spectre patches on the performance of VASP: 16% performance loss (within the range of the expected 10-50% performance loss for high performance applications [1][2][3]).

The case:

  • large HSE06 calculation of a 71 atom defective ZnO supercell.
  • 14 irreducible k-points (no reduction of the Hartree-Fock k-points)
  • 14 nodes of 24 cores, with KPAR=14, and NPAR=1 (I know NPAR=24 is the recommended option)

The calculation took several runs of about 10 electronic steps (of each about 5-6 h wall-time, about 2.54 years of CPU-time per run) . The relative average time is shown below (error-bars are the standard deviation of the times within a single run). As the final step takes about 50% longer it is treated separately. As you can see, the variation in time between different electronic steps is rather small (even running on a different cluster only changes the time by a few %). The impact of the Meltdown/Spectre patch gives a significant impact.

Impact of Meltdown/Spectre patch on VASP performance

Impact of Meltdown/Spectre patch on VASP performance for a 336 core MPI job.

 

The HPC team is currently looking into possible workarounds that could (partially) alleviate the problem. VASP itself is rather little I/O intensive, and a first check by the HPC team points toward MPI (the parallelisation framework required for multi-node jobs) being ‘a’ if not ‘the’ culprit. This means that also an impact on other multi-node programs is to be expected. On the bright side, finding a workaround for MPI would be beneficial for all of them as well.

So far, tests I performed with the HPC team not shown any improvements (recompiling VASP didn’t help, nor an MPI related fix). Let’s keep our fingers crossed, and hope the future brings insight and a solution.

 

Exa-scale computing future in Europe?

As a computational materials scientist with a main research interest in the ab initio simulation of materials, computational resources are the life-blood of my research. Over the last decade, I have seen my resource usage grow from less than 100.000 CPU hours per year to several million CPU-hours per year. To satisfy this need for computational resources I have to make use of HPC facilities, like the TIER-2 machines available at the Flemish universities and the Flemish TIER-1 supercomputer, currently hosted at KU Leuven. At the international level, computational scientists have access to so called TIER-0 machines, something I no doubt will make use of in the future. Before I continue, let me first explain a little what this TIER-X business actually means.

The TIER-X notation is used to give an indication of the size of the computer/supercomputer indicated. There are 4 sizes:

  •  TIER-3: This is your personal computer(laptop/desktop) or even a small local cluster of a research group. It can contain from one (desktop) up to a few hundred CPU’s (local cluster). Within materials research, this is sufficient for quite a few tasks: post-processing of data, simple force-field based calculations, or even small quantum chemical or solid state calculations. A large fraction of the work during my first Ph.D. was performed on the local cluster of the CMS.
  • TIER-2: This is a supercomputer hosted by an institute or university. It generally contains over 1000 CPUs and has a peak performance of >10 TFLOPS (1012 Floating Point Operations Per Second, compare this to 1-50×10FLOPS or 1-25 GFLOPS of an average personal computer). The TIER-2 facilities of the VUB and UAntwerp both have a peak performance of about 75TFLOPS , while the machines at Ghent University and the KU Leuven/Uhasselt facilities both have a peak performance of about 230 TFLOPS. Using these machines I was able to perform the calculations necessary for my study of dopant elements in cerates (and obtain my second Ph.D.).
  • TIER-1: Moving up one more step, there are the national/regional supercomputers. These generally contain over 10.000 CPUs and have a peak performance of over 100 TFLOPS. In Flanders the Flemish Supercomputer Center (VSC) manages the TIER-1 machine (which is being funded by the 5 Flemish universities). The first TIER-1 machine was hosted at Ghent University, while the second and current one is hosted at KU Leuven, an has a peak performance of 623 TFLOPS (more than all TIER-2 machines combined), and cost about 5.5 Million € (one of the reasons it is a regional machine). Over the last 5 years, I was granted over 10 Million hours of CPU time, sufficient for my study of Metal-Organic Frameworks and defects in diamond.
  • TIER-0: This are international level supercomputers. These machines contain over 100.000 CPUs, and have a peak performance in excess of 1 PFLOP (1 PetaFLOP = 1000 TFLOPS). In Europe the TIER-0 facilities are available to researchers via the PRACE network (access to 7 TIER-0 machines, accumulated 43.49 PFLOPS).

This is roughly the status of what is available today for Flemish scientists at various levels. With the constantly growing demand for more processing power, the European union, in name of EuroHPC, has decided in march of this year, that Europe will host two Exa-scale computers. These machines will have a peak performance of at least 1 EFLOPS, or 1000 PFLOPS. These machines are expected to be build by 2024-2025. In June, Belgium signed up to EuroHPC as the eighth country participating, in addition, to the initial 7 countries (Germany, France, Spain, Portugal, Italy, Luxemburg and The Netherlands).

This is very good news for all involved in computational research in Flanders. There is the plan to build these machines, there is a deadline, …there just isn’t an idea of what these machines should look like (except: they will be big, massively power consuming and have a target peak performance). To get an idea what users expect of such a machine, Tier-1 and HPC users have been asked to put forward requests/suggestions of what they want.

From my user personal experience, and extrapolating from my own usage I see myself easily using 20 million hours of CPU time each year by the time these Exa-scale machines are build. Leading a computational group would multiply this value. And then we are talking about simple production purpose calculations for “standard” problems.

The claim that an Exa-scale scale machine runs 1000x faster than a peta-scale machine, is not entirely justified, at least not for the software I am generally encountering. As software seldom scales linearly, the speed-gain from Exa-scale machinery mainly comes from the ability to perform many more calculations in parallel. (There are some exceptions which will gain within the single job area, but this type of jobs is limited.) Within my own field, quantum mechanical calculation of the electronic structure of periodic atomic systems, the all required resources tend to grow with growth of the problem size. As such, a larger system (=more atoms) requires more CPU-time, but also more memory. This means that compute nodes with many cores are welcome and desired, but these cores need the associated memory. Doubling the cores would require the memory on a node to be doubled as well. Communication between the nodes should be fast as well, as this will be the main limiting factor on the scaling performance. If all this is implemented well, then the time to solution of a project (not a single calculation) will improve significantly with access to Exa-scale resources. The factor will not be 100x from a Pflops system, but could be much better than 10x. This factor 10 also takes into account that projects will have access to much more demanding calculations as a default (Hybrid functional structure optimization instead of simple density functional theory structure optimization, which is ~1000x cheaper for plane wave methods but is less accurate).

At this scale, parallelism is very important, and implementing this into a program is far from a trivial task. As most physicists/mathematicians/chemists/engineers may have the skills for writing scientifically sound software, we are not computer-scientists and our available time and skills are limited in this regard. For this reason, it will become more important for the HPC-facility to provide parallelization of software as a service. I.e. have a group of highly skilled computer scientists available to assist or even perform this task.

Next to having the best implementation of software available, it should also be possible to get access to these machines. This should not be limited to a happy few through a peer review process which just wastes human research potential. Instead access to these should be a mix of guaranteed access and peer review.

  • Guaranteed access: For standard production projects (5-25 million CPU hours/year) university researchers should have a guaranteed access model. This would allow them to perform state of the are research without too much overhead. To prevent access to people without the proven necessary need/skills it could be implemented that a user-database is created and appended upon each application. Upon first application, a local HPC-team (country/region/university Tier-1 infrastructure) would have to provide a recommendation with regard to the user, including a statement of the applicant’s resource usage at that facility. Getting resources in a guaranteed access project would also require a limited project proposal (max 2 pages, including user credentials, requested resources, and small description of the project)
  • Peer review access: This would be for special projects, in which the researcher requires a huge chunk of resources to perform highly specialized calculations or large High-throughput exercises (order of 250-1000 million CPU hours, e.g. Nature Communications 8, 15959 (2017)). In this case a full project with serious peer review (including rebuttal stage, or the possibility to resubmit after considering the indicated problems). The goal of this peer review system should not be to limit the number of accepted projects, but to make sure the accepted projects run successfully.
  • Pay per use: This should be the option for industrial/commercial users.

What could an HPC user as myself do to contribute to the success of EuroHPC? This is rather simple, run the machine as a pilot user (I have experience on most of the TIER-2 clusters of Ghent University and both Flemish Tier-1 machines. I successfully crashed the programs I am using by pushing them beyond their limits during pilot testing, and ran into rather unfortunate issues. 🙂 That is the job of a pilot user, use the machine/software in unexpected ways, such that this can be resolved/fixed by the time the bulk of the users get access.) and perform peer review of the lager specialized projects.

Now the only thing left to do is wait. Wait for the Exa-scale supercomputers to be build…7 years to go…about 92 node-days on Breniac…a starting grant…one long weekend of calculations.

Appendix

For simplicity I use the term CPU to indicate a single compute core, even though technically, nowadays a single CPU will contain multiple cores (desktop/laptop: 2-8 cores, HPC-compute node: 2-20 cores / CPU (or more) ). This to make comparison a bit more easy.

Furthermore, modern computer systems start more and more to rely on GPU performance as well, which is also a possible road toward Exa-scale computing.

Orders of magnitude:

  • G = Giga = 109
  • T = Tera = 1012
  • P = Peta = 1015
  • E = Exa = 1018

Resource management on HPC infrastructures.

Computational as a third pillar of science (next to experimental and theoretical) is steadily developing in many fields of science. Even some fields you would less expect it, such as sociology or psychology. In other fields such as physics, chemistry or biology it is much more widespread, with people pushing the boundaries of what is possible. Larger facilities provide access to larger problems to tackle. If a computational physicist is asked if larger infrastructures would not become too big, he’ll just shrug and reply: “Don’t worry, we will easily fill it up, even a machine 1000x larger than that.” An example is given by a pair of physicists who recently published their atomic scale study of the HIV-1 virus. Their simulation of a model containing more than 64 million atoms used force fields, making the simulation orders of magnitude cheaper than quantum mechanical calculations. Despite this enormous speedup, their simulation of 1.2 µs out of the life of an HIV-1 virus (actually it was only the outer skin of the virus, the inside was left empty) still took about 150 days on 3880 nodes of 16 cores on the Titan super computer of Oak Ridge National Laboratory (think about 25 512 years on your own computer).

In Flanders, scientist can make use of the TIER-1 facilities provided by the Flemish Super Computer (VSC). The first Tier-1 machine was installed and hosted at Ghent University. At the end of it’s life cycle the new Tier-1 machine (Breniac) was installed and is hosted at KULeuven. Although our Tier-1 supercomputer is rather modest compared to the Oak Ridge supercomputer (The HIV-1 calculation I mentioned earlier would require 1.5 years of full time calculations on the entire machine!) it allows Flemish scientists (including myself) to do things which are not possible on personal desktops or local clusters. I have been lucky, as all my applications for calculation time were successful (granting me between 1.5 and 2.5 million hours of CPU time every year). With the installment of the new supercomputer accounting of the requested resources has become fully integrated and automated. Several commands are available which provide accounting information, of which mam-balance is the most important one, as it tells how much credits are still available. However, if you are running many calculations you may want to know how many resources you are actually asking and using in real-time. For this reason, I wrote a small bash-script that collects the number of requested and used resources for running jobs:

Output of the Bash Script.

Currently, the last part, on the completed jobs, only provides data based on the most recent jobs. Apparently the full qstat information of older jobs is erased. However, it still provides an educated guess of what you will be using for the still queued jobs.

 

VSC-user day 2017: The Poster Edition

Last Friday, the HPC infrastructure in Flanders got celebrated by the VSC user day. Being one of the Tier-1 supercomputer users at UHasselt, I was asked if I could present a poster at the meeting, showcasing the things I do here. Although I was very interested in this event, educational obligations (the presentations of the bachelor projects, on which I will post later) prevented me from attending the meeting.

As means of a compromise, I created a poster for the meeting which Geert Jan Bex, our local VSC/HPC support team, would be so nice to put up at the event. The poster session was preceded by a set of 1-minute presentations of the posters, for which a slide had to be made. As I could not be physically present, I provided the organizers a slide which contained a short description that could be used as the 1-minute presentation. Unfortunately, things got a little mixed up, as Geert Jan accidentally printed this slide as the poster (which gave rise to some difficulties in the printing process 🙄 ). So for those who might have had an interest in the actual poster, let me put it up here:

This poster presents my work on linker functionalisation of the MIL-47, which got recently published in the Journal of physical chemistry C, and the diamond work on the C-vacancy, which is currently submitted. Clicking on the poster above will provide you the full size image. The 1-minute slide presentation, which erroneously got printed as poster:

Scaling of VASP 5.4.1 on TIER-1b BrENIAC

When running programs on HPC infrastructure, one of the first questions to ask yourself is: “How well does this program scale?

In applications for HPC resources, this question plays a central role, often with the additional remark: “But for your specific system!“. For some software packages this is an important remark, for other packages this remark has little relevance, as the package performs similarly for all input (or for given classes of input). The VASP package is one of the latter. For my current resource application at the Flemish TIER-1 I set out to do a more extensive scaling test of the VASP package. This is for 2 reasons. The first being the fact that I will be using a newer version of VASP: vasp 5.4.1 (I am currently using my own multiply patched version 5.3.3.). The second being the fact that I will be using a brand new TIER-1 machine (second Flemish TIER-1, as our beloved muk retired the end of 2016).

Why should I put in the effort to get access to resources on such a TIER-1 supercomputer? Because such machines are the life blood of the computational materials scientist. They are their sidekick in the quest for understanding of materials. Over the past 4 years, I was granted (and used) 20900 node-days of calculation time (i.e. over 8 Million hours of CPU time, or 916 years of calculation time) on the first TIER-1 machine.

Now back to the topic. How well does VASP 5.4.1 behave? That depends on the system at hand, and how good you choose the parallelization settings.

1. Parallelization in VASP

VASP provides several parameters which allow for straightforward parallelization of the simulation:

  • NPAR : This parameter can be set to parallelize over the electronic bands. As a consequence, the number of bands included in a calculation by VASP will be a multiple of NPAR. (Note : Hybrid calculations are an exception as they require NPAR to be set to the number of cores used per k-point.)
  • NCORE : The NCORE parameter is related to NPAR via NCORE=#cores/NPAR, so only one of these can be set.
  • KPAR : This parameter can be set to parallelize over the set of irreducible k-points used to integrate over the first Brillouin zone. KPAR should therefore best be a divisor of the number of irreducible k-points.
  • LPLANE : This boolean parameter allows one to switch on parallelization over plane waves. In general this will give rise to a small but consistent speedup (observed in previous scaling tests). As such we have chosen to set this parameter = .TRUE. for all calculations.
  • NSIM : Sets up a blocked mode for the RMM-DIIS algorithm. (cf. manual, and further tests by Peter Larsson). As our tests do not involve the RMM-DIIS algorithm, this parameter was not set.

In addition, one needs to keep the architecture of the HPC-system in mind as well: NPAR, KPAR and their product should be divisors of the number of nodes (and cores) used.

2. Results

Both NPAR and KPAR parameters can be set simultaneously and will have a serious influence on the speed of your calculations. However, not all possible combinations give the same speedup. Even worse, not all combinations are beneficial with regard to speed. This is best seen for a small 2 atom diamond primitive cell.

Timing primitive uni cell diamond.

Timing results for various combinations of the NPAR and KPAR parameters for a 2 atom primitive unit cell of diamond.

Two things are clear. First of all, switching on a parallelization parameter does not necessarily mean the calculation will speed up. In some case it may actually slow you down. Secondly, the best and worst performance is consistently obtained using the same settings. Best: KPAR = maximal, and NPAR = 1, worst: KPAR = 1 and NPAR = maximal.

This small system shows us what you can expect for systems with a lot of k-points and very few electronic bands ( actually, any real calculation on this system would only require 8 electronic bands, not 56 as was used here to be able to asses the performance of the NPAR parameter.)

In a medium sized system (20-100 atoms), the situation will be different. There, the number of k-points will be small (5-50) while the natural number of electronic bands will be large (>100). As a test-case I looked at my favorite Metal-Organic Framework: MIL-47(V).

scaling of MIL-47(V) at PBE level

Timing results for several NPAR/KPAR combinations for the MIL-47(V) system.

This system has only 12 k-points to parallelize over, and 224 electronic bands. The spread per number of nodes is more limited than for the small system. In contrast, the general trend remains the same: KPAR=high, NPAR=low, with an optimum performance when KPAR=#nodes. Going beyond standard DFT, using hybrid functionals, also retains the same picture, although in some cases about 10% performance can be gained when using one half node per k-point. Unfortunatly, as we have very few k-points to start from, this will only be an advantage if the limiting factor is the number of nodes available.

An interesting behaviour is seen when one keeps the k-points/#nodes ratio constant:

scaling at constant k-point/node ratio

Scaling behavior for either a constant number of k-points (for dense k-point grid in medium sized system) or constant k-point/#nodes ratio.

As you can see, VASP performs really well up to KPAR=#k-points (>80% efficiency). More interestingly, if the k-point/#node ratio is kept constant, the efficiency (now calculated as T1/(T2*NPAR) with T1 timing for a single node, and T2 for multiple nodes) is roughly constant. I.e. if you know the walltime for a 2-k-point/2-nodes job, you can use/expect the same for the same system but now considering 20-k-points/20-nodes (think Density of States and Band-structure calculations, or just change of #k-points due to symmetry reduction or change in k-point grid.)  😆

3. Conclusions and Guidelines

If one thing is clear from the current set of tests, it is the fact that good scaling is possible. How it is attained, however, depends greatly on the system at hand. More importantly, making a poor choice of parallelization settings can be very detrimental to the obtained speed-up and efficiency. Unfortunately when performing calculations on an HPC system, one has externally imposed limitations to work with:

  • Memory available per core
  • Number of cores per node[1]
  • Size of your system: #atoms, #k-points, & #bands

Here are some guidelines (some open doors as well):

  • Wherever possible k-point parallelization should be driven to a maximum (KPAR as large as possible). The limiting factor here is the actual number of k-points and the amount of memory available. The latter due to the fact that a higher value of KPAR leads to a higher memory requirement.[2]
  • Use the Γ-version of VASP for Γ-point only calculations. It reduces memory usage significantly (3.7Gb→ 2.8Gb/core for a 512 atom diamond system) and increases the computational efficiency, sometimes even by a factor of 2.
  • NPAR parallelization can be used to reduce the memory load for high KPAR calculations, but increasing KPAR will always outperform the same increase of NPAR.
  • In case only NPAR parallelization is available, due to too few k-points, and working with large systems, NPAR parallelization is your last resort, and will perform reasonably well, up to a point.
  • Electronic steps show very consistent timing, so scaling tests can be performed with only a 5-10 electronic steps, with standard deviations comparable in absolute size for PBE and HSE06.

 

In short:

K-point parallelism will save you, wherever possible !

 

[1] 28 is a lousy number in that regard as its prime-decomposition is 2x2x7, leaving little overlap with prime-decompositions of the number of k-points, which more often than you wish end up being prime numbers themselves 😥
[2] The small system’s memory requirements varied from 0.15 to 1.09 Gb/core for the different combinations.

Simple Parallelization in Fortran: OpenMP

Pentium2speedThe first PC we got at our home was a Pentium II. My dad got it, because I was going to university, and I would be able to do something “useful” with it. (Yup, I survived my entire high school career searching stuff in the library and the home encyclopedia. Even more, Google didn’t even exist before we got our computer, as the company was only founded in 1998 🙂 ). The machine was advertised as state of the art with a clock rate of a whooping 233 MHz! During the decade that followed, the evolution of the clock rates kept going at a steady pace, until it saturated at about 3-4 GHz(15 times faster than the 233 MHz) around 2005. Since then, the clock rate has not increased a bit. If anything, the average clock rate has even decreased to the range 2-3 GHz. As power-consumption grows quadratically with with the clock rate, this means that (1) there is much more heat produced, that needs to be transported away from your CPU (otherwise it get’s destroyed), (2) reducing the clock rate by a factor 2, allows you to power 4 CPU’s at half the clock rate, effectively doubling your calculation power. (There are even more tricks involved in modern CPU’s which crack up performance such that the clock rate isn’t a real measure for performance any longer, and sales people need to learn more new buzzword to sell your computer/laptop 👿 )

intelCoreCloneWhere in 2005 you bought a single CPU with a high clock rate, you now get a machine with multiple cores. Most machines you can get these days have a minimum of 2 cores, with quad-core machines becoming more and more common. But, there is always a but, even though you now have access to multiple times the processing power of 2005, this does not mean that your own code will be able to use it. Unfortunately there is no simple compiler switch which makes your code parallel (like the -m64 switch which makes your code 64-bit), you have to do this yourself (the free lunch is over). Two commonly used frameworks for this task are OpenMP and MPI. The former mainly focuses on shared memory configurations (laptops, desktops, single nodes in a cluster), while the latter focuses of large distributed memory setups (multi-node clusters) and is thus well-suited for creating codes that need to run on hundreds or even thousands of CPU’s. The two frameworks differ significantly in their complexity, fortunately for us, OpenMP is both the easier one, and the one most suited for a modern multi-core computer. The OpenMP framework consists of pragma’s (or directives) which can be added in an existing code as comment lines, and tell a compiler knowledgeable of OpenMP how to parallelize the code. (It is interesting to note that MPI and OpenMP are inteded for parallel programming in either C, C++ or fortran … a hint that what the important programming languages are.)

OpenMP in Fortran: Basics

A. Compiler-options and such

As most modern fortran compilers also are well aware of openMP (you can check which version of openMP is supported here), you generally will not need to install a new compiler to write parallel fortran code. You only need to add a single compiler flag: -fopenmp (gcc/gfortran), -openmp (intel compiler), or -mp (Portland Group). In Code::Blocks you will find this option under Settings > Compiler > Compiler Settings tab > Compiler Flags tab (If the option isn’t present try adding it to “other compiler options” and hope your compiler recognizes one of the flags). 

Secondly, you need to link in the OpenMP library. In Code::Blocks go to Settings > Compiler > Linker Settings tab > Link Libraries: add. Where you add the libgomp.dll.a library (generally found in the folder of your compiler…in case of 64 bit compilers, make sure you get the 64 bit version)

Finally, you may want to get access to OpenMP functions inside your code. This can be achieved by a use statement: use omp_lib.

B. Machine properties

OpenMP contains several functions which allow you to query and set several environment variables (check out these cheat-sheets for OpenMP v3.0 and v4.0).

  • omp_get_num_procs() : returns the number of processors your code sees (in hyper-threaded CPU’s this will be double of the actual number of processor cores).
  • omp_get_num_threads() : returns the number of threads available in a specific section of the code.
  • omp_set_num_threads(I): Sets the number of threads for the openMP parallel section to I
  • omp_get_thread_num() : returns the index of the specific thread you are in [0..I[

 

  1. subroutine OpenMPTest1()
  2.         use omp_lib;
  3.  
  4.         write(*,*) "Running OpenMP Test 1: Environment variables"
  5.         write(*,*) "Number of threads :",omp_get_num_threads()
  6.         write(*,*) "Number of CPU's available:",omp_get_num_procs()
  7.         call omp_set_num_threads(8) ! set the number of threads to 8
  8.         write(*,*) "#Threads outside the parallel section:",omp_get_num_threads()
  9.         !below we start a parallel section
  10.         !$OMP PARALLEL
  11.         write(*,*) "Number of threads in a parallel section :",omp_get_num_threads()
  12.         write(*,*) "Currently in thread with ID = ",omp_get_thread_num()
  13.         !$OMP END PARALLEL
  14.  
  15. end subroutine OpenMPTest1

 

Notice in the example code above that outside the parallel section indicated with the directives $OMP PARALLEL and $OMP END PARALLEL, the program only sees a single thread, while inside the parallel section 8 threads will run (independent of the number of cores available). 

C. Simple parallelization

The OpenMP frameworks consists of an set of directives which can be used to manage the parallelization of your code (cheat-sheets for OpenMP v3.0 and v4.0). I will not describe them in detail as there exists several very well written and full tutorials on the subject, we’ll just have a look at a quick and easy parallelization of a big for-loop. As said, OpenMP makes use of directives (or Pragma’s) which are placed as comments inside the code. As such they will not interfere with your code when it is compiled as a serial code (i.e. without the -fopenmp compiler flag). The directives are preceded by what is called a sentinel ( $OMP ). In the above example code, we already saw a first directive: PARALLEL. Only inside blocks delimited by this directive, can your code be parallel.

  1. subroutine OMPTest2()
  2.         use omp_lib;
  3.  
  4.         integer :: IDT, NT,nrx,nry,nrz
  5.         doubleprecision, allocatable :: A(:,:,:)
  6.         doubleprecision :: RD(1:1000)
  7.         doubleprecision :: startT, TTime, stt
  8.  
  9.         call random_seed()
  10.         call random_number(RD(1:1000))
  11.         IDT=500 ! we will make a 500x500x500 matrix
  12.         allocate(A(1:IDT,1:IDT,1:IDT))
  13.  
  14.         write(*,'(A)') "Number of preferred threads:"
  15.         read(*,*) NT
  16.         call omp_set_num_threads(NT)
  17.         startT=omp_get_wtime()
  18.         !$OMP PARALLEL PRIVATE(stt)
  19.         stt=omp_get_wtime()
  20.        
  21.         !$OMP DO
  22.         do nrz=1,IDT
  23.            do nry=1,IDT
  24.               do nrx=1,IDT
  25.               A(nrx,nry,nrz)=RD(modulo(nrx+nry+nrz,1000)+1)
  26.               end do
  27.            end do
  28.         end do
  29.         !$OMP END DO
  30.         write(*,*) "time=",(omp_get_wtime()-stt)/omp_get_wtick()," ticks for thread ",omp_get_thread_num()
  31.         !$OMP END PARALLEL
  32.         TTime=(omp_get_wtime()-startT)/omp_get_wtick()
  33.         write(*,*)" CPU-resources:",Ttime," ticks."
  34.  
  35.         deallocate(A)
  36.     end subroutine RunTest2

The program above fills up a large 3D array with random values taken from a predetermined list. The user is asked to set the number of threads (lines 14-16), and the function omp_get_wtime() is used to obtain the number of seconds since epoch, while the function omp_get_wtick() gives the number of seconds between ticks. These functions can be used to get some timing data for each thread, but also for the entire program. For each thread, the starting time is stored in the variable stt. To protect this variable of being overwritten by each separate thread, this variable is declared as private to the thread (line 18: PRIVATE(stt) ). As a result, each thread will have it’s own private copy of the stt variable.

The DO directive on line 21, tells the compiler that the following loop needs to be parallelized. Putting the !$OMP DO pragma around the outer do-loop has the advantage that it minimizes the overhead produced by the parallelization (i.e. resources required to make local copies of variables, calculating the distribution of the workload over the different threads at the start of the loop, and combining the results at the end of the loop).

As you can see, parallelizing a loop is rather simple. It takes only 4 additional comment lines (!$OMP PARALLEL , !$OMP DO, !$OMP END DO and !$OMP END PARALLEL) and some time figuring out which variables should be private for each thread, i.e. which are the variables that get updated during each cycle of a loop. Loop counters you can even ignore as these are by default considered private. In addition, the number of threads is set on another line giving us 5 new lines of code in total. It is of course possible to go much further, but this is the basis of what you generally need.

Unfortunately, the presented example is not that computationally demanding, so it will be hard to see the full effect of  the parallelization. Simply increasing the array size will not resolve this as you will quickly run out of memory. Only with more complex operations in the loop will you clearly see the parallelization. An example of a more complex piece of code is given below (it is part of the phonon-subroutine in HIVE):

  1. !setup work space for lapack
  2.         N = this%DimDynMat
  3.         LWORK = 2*N - 1
  4.         call omp_set_num_threads(this%nthreads)
  5.         chunk=(this%nkz)/(this%nthreads*5)
  6.         chunk=max(chunk,1)
  7.         !$OMP PARALLEL PRIVATE(WORK, RWORK, DM, W, RPart,IO)
  8.         allocate(DM(N,N))
  9.         allocate( WORK(2*LWORK), RWORK(3*N-2), W(N) )
  10.         !the write statement only needs to be done by a single thread, and the other threads do not need to wait for it
  11.         !$OMP SINGLE
  12.         write(uni,'(A,I0,A)') " Loop over all ",this%nkpt," q-points."
  13.         !$OMP END SINGLE NOWAIT
  14.         !we have to loop over all q-points
  15.         !$OMP DO SCHEDULE(DYNAMIC,chunk)
  16.         do nrz=1,this%nkz
  17.             do nry=1,this%nky
  18.                 do nrx=1,this%nkx
  19.                     if (this%kpointListBZ1(nrx,nry,nrz)) then
  20.                         !do nrk=1,this%nkpt
  21.                         WORK = 0.0_R_double
  22.                         RWORK = 0.0_R_double                                                
  23.                         DM(1:this%DimDynMat,1:this%DimDynMat)=this%dynmatFIpart(1:this%DimDynMat,1:this%DimDynMat) ! make a local copy
  24.                         do nri=1,this%poscar%nrions
  25.                             do nrj=1,this%poscar%nrions
  26.                                 Rpart=cmplx(0.0_R_double,0.0_R_double)
  27.                                 do ns=this%vilst(1,nri,nrj),this%vilst(2,nri,nrj)
  28.                                     Rpart=Rpart + exp(i*(dot_product(this%rvlst(1:3,ns),this%kpointList(:,nrx,nry,nrz))))
  29.                                 end do
  30.                                 Rpart=Rpart/this%mult(nri,nrj)
  31.                                 DM(((nri-1)*3)+1:((nri-1)*3)+3,((nrj-1)*3)+1:((nrj-1)*3)+3) = &
  32.                                     & DM(((nri-1)*3)+1:((nri-1)*3)+3,((nrj-1)*3)+1:((nrj-1)*3)+3)*Rpart
  33.                             end do
  34.                         end do
  35.                         call MatrixHermitianize(DM,IOS=IO)
  36.                         call ZHEEV( 'V', 'U', N, DM, N, W, WORK, LWORK, RWORK, IO )
  37.                         this%FullPhonFreqList(:,nrx,nry,nrz)=sign(sqrt(abs(W)),W)*fac
  38.                     end if
  39.                 end do
  40.             end do
  41.         end do
  42.         !$OMP END DO
  43.         !$OMP SINGLE
  44.         write(uni,'(A)') " Freeing lapack workspace."
  45.         !$OMP END SINGLE NOWAIT
  46.         deallocate( WORK, RWORK,DM,W )
  47.         !$OP END PARALLEL

In the above code, a set of equations is solved using the LAPACK eigenvalue solver ZHEEV to obtain the energies of the phonon-modes in each point of the Brillouin zone. As the calculation of the eigenvalue spectrum for each point is independent of all other points, this is extremely well-suited for parallelization, so we can add !$OMP PARALLEL and !$OMP END PARALLEL on lines 7 and 47. Inside this parallel section there are several variables which are recycled for every grid point, so we will make them PRIVATE (cf. line 7, most of them are work-arrays for the ZHEEV subroutine).

Lines 12 and 44 both contain a write-statement. Without further action, each thread will perform this write action, and we’ll end up with multiple copies of the same line (Although this will not break your code it will look very sloppy to any user of the code). To circumvent this problem we make use of the !$OMP SINGLE directive. This directive makes sure only 1 thread (the first to arrive) will perform the write action. Unfortunately, the SINGLE block will create an implicit barrier at which all other threads will wait. To prevent this from happening, the NOWAIT clause is added at the end of the block. In this specific case, the NOWAIT clause will have only very limited impact due to the location of the write-statements. But this need not always to be the case.

On line 15 the !$OMP DO pragma indicates a loop will follow that should be parallelized. Again we choose for the outer loop, as to reduce the overhead due to the parallelization procedure. We also tell the compiler how the work should be distributed using the SCHEDULE(TYPE,CHUNK) clause. There are three types of scheduling:

  1. STATIC: which is best suited for homogeneous workloads. The loop is split in equal pieces (size given by the optional parameter CHUNK, else equal pieces with size=total size/#threads)
  2. DYNAMIC: which is better suited if the workload is not homogeneous.(in this case the central if-clause on line 19 complicates things). CHUNK can again be used to define the sizes of the workload blocks.
  3. GUIDED: which is a bit like dynamic but with decreasing block-sizes.

From this real life example, it is again clear that OpenMP parallelization in fortran can be very simple.

D. Speedup?

On my loyal sidekick (with hyper-threaded quad-core core i7) I was able to get following speedups for the phonon-code (the run was limited to performing only a phonon-DOS calculation):

speedup due to openMP parallelization

Speedup of the entire phonon-subroutine due to parallelization of the main-phonon-DOS loop.

The above graph shows the speed-up results for the two different modes for calculating the phonon-DOS. The reduced mode (DM red), uses a spectrum reduced to that of a unit-cell, but needs a much denser sampling of the Brillouin zone (second approach), and is shown by the black line. The serial calculation in this specific case only took 96 seconds, and the maximum speedup obtained was about x1.84. The red and green curves give the speedup of the calculation mode which makes use of the super-cell spectrum (DM nored, i.e. much larger matrix to solve), and shows for increasing grid sizes a maximum speedup of x2.74 (serial time: 45 seconds) and x3.43 (serial time 395 seconds) respectively. The reason none of the setups reaches a speedup of 4 (or more) is twofold:

  1. Amdahl’s law puts an upper limit to the global speedup of a calculation by taking into account that only part of the code is parallelized (e.g. write sections to a single file can not be parallelized.)
  2. There needs to be sufficient blocks of work for all threads (indicated by nkz in the plot)

In case of the DM nored calculations, the parallelized loop clearly takes the biggest part of the calculation-time, while for the DM red calculation, also the section generating the q-point grid takes a large fraction of the calculation time limiting the effect of parallelization. An improvement here would be to also parallelize  the subroutine generating the grid, but that will be for future work. For now, the expensive DM nored calculations show an acceptable speedup.