Tag: machine learning

Parallel Python in classes…now you are in a pickle

In the past, I discussed how to create a python script which runs your calculations in parallel.  Using the multiprocessing library, you can circumvent the GIL and employing the async version of the multiprocessing functions, calculations are even performed in parallel. This works quite well, however, when using this within a python class you may run into some unexpected behaviour and errors due to the pickling performed by the multiprocessing library.

For example, if the doOneRun function is a class function defined as

class MyClass:
...
    def doOneRun(self, id:int):
       return id**3
...

and you perform some parallel calculation in another function of your class as

class MyClass:
...
    def ParallelF(self, NRuns:int):
       import multiprocessing as mp

       nproc=10
       pool=mp.Pool(processes=nprocs) 
       drones=[pool.apply_async(self.doOneRun, args=(nr,)) for nr in range(NRuns)] 

       for drone in drones: 
           Results.collectData(drone.get()) 
       pool.close() 
       pool.join() 
       
...

you may run into a runtime error complaining that a function totally unrelated to the parallel work (or even to the class itself) can not be pickled. 😯

So what is going on? In the above setup, you would expect the pool.apply_async function to take just a function pointer to the doOneRun function. However, as it is provided by a the call self.doOneRun, the pool-function grabs the entire class and everything it contains, and tries to pickle it to distribute it to all the processes.  In addition to the fact that such an approach is hugely inefficient, it has the side-effect that any part associated to your class needs to be pickleable, even if it is a class-function of a class used to generate an object which is just a property of the MyClass Class above.

So both for reasons of efficiency and to avoid such side-effects, it is best to make the doOneRun function independent of a class, and even placing it outside the class.

def doOneRun(id:int):
    return id**3
  
class MyClass:
...
    def ParallelF(self, NRuns:int):
       import multiprocessing as mp

       nproc=10
       pool=mp.Pool(processes=nprocs) 
       drones=[pool.apply_async(doOneRun, args=nr) for nr in range(NRuns)] 

       for drone in drones: 
           Results.collectData(drone.get()) 
       pool.close() 
       pool.join() 
       
...

This way you avoid pickling the entire class, reducing initialization times of the processes and the  unnecessary communication-overhead between processes. As a bonus, you also reduce the risk of unexpected crashes unrelated to the calculation performed.

Permanent link to this article: https://dannyvanpoucke.be/parallel-python-classes-pickle/

Practical Machine-Learning for the Materials Scientist

Scilight graphic

Individual model realizations may not perform that well, but the average model realization always performs very well.

Machine-Learning  is up and trending. You can’t open a paper, magazine or website without someone trying to convince you their new AI-improved app/service will radically change your life. It will make the production of your company more efficient and cheaper, make costumers flock to your shop and possibly cure cancer on the side. Also in science, a lot of impressive claims are being made. General promises entail that it makes the research of interest faster, better, more efficient,… There is, however, a bit of fine print which is never explicitly mentioned: you need a LOT of data. This data is used to teach your Machine-Learning algorithm whatever it is intended to learn.

In some cases, you can get lucky, and this data is already available while in other, you still need to create it yourself. In case of computational materials science this often means performing millions upon millions of calculations to create a data set on which to train the Machine-Learning algorithm.[1] The resulting Machine-Learning model may be a thousand times faster in direct comparison, but only if you ignore the compute-time deficit you start from.

In materials science, this is not only a problem for those performing first principles modeling, but also for experimental researchers. When designing a new material, you generally do not have the resources to generate thousands or millions of samples while varying the parameters involved. Quite often you are happy if you can create even a few dozen samples. So, can this research still benefit from Machine-Learning if only very small data sets are available?

In my recent work on materials design using Machine-Learning combined with small data sets, I discuss the limitations of small data sets in the context of Machine-Learning and present a natural approach for obtaining the best possible model.[2] [3]

The Good, the Bad and the Average.

(a) Simplified representation of modeling small data sets. (b) Data set size dependence of the distribution of model coefficients. (c) Evolution of model-coefficients with data set size. (d) correlation between model coefficient value and model quality.

In Machine-Learning a data set is generally split in two parts. One part to train the model, and a second part to test the quality of the model. One of the underlying assumptions to this approach is that each subset of the data set provides an accurate representation of the “true” data/model. As a result, taking a different subset to train your data should give rise to “the same model” (ignoring small numerical fluctuations). Although this is generally true for large (and huge) data sets, for  small data sets this is seldomly the case (cf. figure (a) on the side). There, the individual data points considered will have a significant impact on the final model, and different subsets give rise to very different models. Luckily the coefficients of these models still present a peaked distribution. (cf. figure (b)).

On the down side, however, if one isn’t careful in preprocessing the data set correctly, these distributions will not converge upon increasing the data set size, giving rise to erratic model behaviour.[2]

Not only the model coefficients give rise to a distribution, the same is true for the model quality. Using the same data set, but making a different split between training and test data can give rise to large differences in  quality for the model instances. Interestingly, the model quality presents a strong correlation with the model coefficients, with the best quality model instances being closer to the “true” model instance. This gives rise to a simple approach: just take many train-test splittings, and select the best model. There are quite some problems with such an approach, which are discussed in the manuscript [2]. The most important one being the fact that the quality measure on a very small data set is very volatile itself. Another is the question of how many such splittings should be considered? Should it be an exhaustive search, or are any 10 random splits good enough (obviously not)? These problems are alleviated by the nice observation that “the average” model shows not the average quality or the average model coefficients, but instead it presents the quality of the best model (as well as the best model coefficients). (cf. figure (c) and (d))

This behaviour is caused by the fact that the best model instances have model coefficients which are also the average of the coefficient distributions. This observation hold for simple and complex model classes making it widely applicable. Furthermore, for model classes for which it is possible to define a single average model instance, it gives access to a very efficient predictive model as it only requires to store model coefficients for a single instance, and predictions only require a single evaluation. For models where this is not the case one can still make use of an ensemble average to benefit from the superior model quality, but at a higher computational cost. 

References and footnotes

[1] For example, take “ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost“, one of the most downloaded papers of the journal of Chemical Science. The data set the authors generated to train their neural network required them to optimize 58.000 molecules using DFT calculations. Furthermore, for these molecules a total of about 17.200.000 single-point energies were calculated (again at the DFT level). I leave it to the reader to estimate the amount of calculation time this requires.

[2] “Small Data Materials Design with Machine Learning: When the Average Model Knows Best“, Danny E. P. Vanpoucke, Onno S. J. van Knippenberg, Ko Hermans, Katrien V. Bernaerts, and Siamak Mehrkanoon, J. Appl. Phys. 128, 054901  (2020)

[3] “When the average model knows best“, Savannah Mandel, AIP SciLight 7 August (2020)

Permanent link to this article: https://dannyvanpoucke.be/practical-machine-learning/

Small Data Materials Design with Machine Learning: When the Average Model Knows Best

Authors:  Danny E. P. Vanpoucke, Onno S. J. van Knippenberg, Ko Hermans, Katrien V. Bernaerts, and Siamak Mehrkanoon
Journal: Journal of Applied Physics 128, 054901 (2020)
doi: 10.1063/5.0012285
IF(2019): 2.286
export: bibtex
pdf: <JApplPhys>   (Open Access)
github: <Amadeus>

 

Vulcanoplot
Graphical Abstract: Correlation plot of the RMSE of the validation set and the intercept value for linear model instances trained on 1000 subsets of a 25 point data set. The distribution of the correlation data is indicated by the black curve.

Abstract

Machine Learning is quickly becoming an important tool in modern materials design. Where many of its successes are rooted in huge data sets, the most common applications in academic and industrial materials design deal with data sets of at best a few tens of data points. Harnessing the power of Machine Learning in this context is therefore of considerable importance. In this work, we investigate the intricacies introduced by these small data sets. We show that individual data points introduce a significant chance factor in both model training and quality measurement. This chance factor can be mitigated by the introduction of an ensemble-averaged model. This model presents the highest accuracy while at the same time it is robust with regard to changing data set size. Furthermore, as only a single model instance needs to be stored and evaluated, it provides a highly efficient model for prediction purposes, ideally suited for the practical materials scientist.

Permanent link to this article: https://dannyvanpoucke.be/paper_averagemodel-en/

Parallel Python?

As part of my machine learning research at AMIBM, I recently ran into the following challenge: “Is it possible to do parallel computation using python.” It sent me on a rather long and arduous journey, with the final answer being something like: “very reluctantly“.

Python was designed with one specific goal in mind; make it easy to implement small test programs to see if an idea is worth pursuing. This gave rise to a scripting language with a lot of flexibility, but also with significant limitations, most of which the “intended” user would never meet. However, as a consequence of its success, many are using it going far beyond this original scope (yours truly as well 🙂 ).

Python offers various libraries to parallelize your scripts…most of them wrappers adding minor additional functionality. However, digging down to the bottom one generally ends up at one of the following two libraries: the threading module and the multiprocessing module.

Of course, as with many things python, there is a huge amount of tutorials available with many of great quality.

import threading

Programmers experienced in a programming language such as C/C++, Pascal, or Fortran, may be familiar with the concept of multi-threading. With multi-threading, a CPU allows a program to distribute its work over multiple program-threads which can be performed in parallel by the different cores of the CPU (or while a core is idle, e.g., since a thread is waiting for data to be fetched).  One of the most famous API’s for writing multi-threaded applications is OpenMP. In the past I used it to parallelize my Hirshfeld-I implementation and the phonon-module of HIVE.

For Python, there is no implementation of the OpenMP API, instead there is the threading module. This provides access to the creation of multiple threads, each able to perform their own tasks while sharing data-objects. Unfortunately, python has also the Global Interpreter Lock, GIL for short, which allows only a single thread to access the interpreter at a time. This effectively reduces thread-based parallelization to a complex way of running a code in a serial way.

For more information on “multi-threading” in python, you can look into this tutorial.

import multiprocessing

In addition to the threading module, there is also the multiprocessing module. This module side-steps the GIL by creating multiple processes, each having its own interpreter. This however comes at a cost. Firstly, there is a significant computational cost starting the different processes. Secondly, objects are not shared between processes, so additional work is needed to collect and share data.

Using the “Pool” class, things are somewhat simplified, as can be seen in the code-fragment below.  With the pool class one creates a set of threads/processes available for your program. Then through the function apply_async function it is possible to run processes in parallel. (Note that you need to use the “async” version of the function, as otherwise you end up with running things serial …again)

[codesyntax lang=”python” lines=”normal” container=”pre” capitalize=”no” title=”multiprocessing backbone”]

import multiprocessing as mp

def doOneRun(id:int): #trivial function to run in parallel
	return id**3



num_workers=10  #number of processes
NRuns=1000      #number of runs of the function doOneRun

pool=mp.Pool(processes=num_workers)   # create a pool of processes
drones=[pool.apply_async(doOneRun, args=nr) for nr in range(NRuns)] #and run things in parallel

for drone in drones: #and collect the data
	Results.collectData(drone.get()) #Results.collectData is a function you write to recombine the separate results into a single result and is not given here.
    
pool.close() #close the pool...no new tasks can be run on any of the processes
pool.join()  #collapse all threads back into the main thread

[/codesyntax]

 

how many cores does my computer have?

If you are used to HPC applications, you always want to get as much out of your machine as possible. With regard to parallelization this often means making sure no CPU cycle is left unused. In the example above we manually selected the number of processes to spawn. However, would it not be nice if the program itself could just set this value to be equal to the number of physical cores accessible?

Python has a large number of functions claiming to do just that. A few of them are given below.

  •  multiprocessing.cpu_count(): returns the number of logical cores it can find. So if you have a modern machine with hyper-threading technology, this will return a multiple of the number of physical cores (and you will be over-subscribing your CPU.
  • os.cpu_count(): same as multiprocessing.cpu_count().
  • psutil.cpu_count(logical=False): This implementation gives the same default behavior, however, the parameter logical allows for this function to return the correct number of cores in a single CPU. Indeed a single CPU. HPC architectures which contain multiples CPUs per node will again return an incorrect number, as the implementation makes use of a python “set”, and as such doesn’t increment for the same index core on a different CPU.

In conclusion, there seems to be no simple way to obtain the correct number of physical cores using python, and one is forced to provide this number manually. (If you do have knowledge of such a function which works in both windows and unix environments and both desktop and HPC architectures feel free to let me know in the comments.)

All in all, it is technically possible to run code in parallel using python, but you have to deal with a lot of python quirks such as GIL.

Permanent link to this article: https://dannyvanpoucke.be/parallel-python-en/

Workshop Machine Learning for Coatings: ML in the Lab (day 5)

On the fifth and final day of the workshop we return to the lab. Our task as a group: optimize our raspberry pink lacquer with regard to hardness, glossiness and chemical resistance.

The four cans of base material made during day 1 of the workshop were mixed to make sure we were all using the same base material (there are already sufficient noise introducing variables present, so any that can be eliminated should be.). Next, each team got a set of recipes generated with the ML algorithm to create. The idea was to parallelise the human part of the process. This would actually also have made for a very interesting exercise to perform in a computer science program. It showed perfectly how bottlenecks are formed and what impact is of serial sections and access/distribution of resources (or is this just in my mind? 😎  ).  After a first round of samples, we already tried to improve the performance of our unit by starting the preparation of the next batch (prefetching 😉 ) while the results of the previous samples were entered into the ML algorithm, and that was run.

At the end of two update rounds, we discussed the results, there were already some clear improvements visible, but a few more rounds would have been needed to get to the best situation. A very interesting aspect to notice during such an exercise, is the difference in the concept of accuracy for the experimental side and the computational side of the story. While the computer easily spits out values in grams with 10 significant digits, at the experimental side of the story it was already extremely hard to get the same amounts with an accuracy of 0.02 gram (the present air currents give larger changes on the scale).

This workshop was a very satisfying experience. I believe I learned most with regard to Machine Learning from the unintentional observation in the lab. Thank you Christian and Kevin!  

 

Permanent link to this article: https://dannyvanpoucke.be/dnlhit_mlworkshop_day5-en/

Workshop Machine Learning for Coatings: Stochos and DGCN (day 4)

Day 4 of the workshop is again a machine learning centered day. Today we were introduced into the world of Gaussian Processes, and ML approach which is rooted in statistics and models data by looking at the averages of a distribution of functions…it is a function of functions. In contrast to most other ML approaches it is also very well suited for small data sets, which is why I had my eye on them already for quite some time. However, Gaussian Processes are not perfect and interestingly enough, their drawbacks and benefits seem quite complementary with the benefits and drawbacks to neural networks. Deep Gaussian Covariance Networks (DGCN) find their origin in this observation, and were designed with the idea of compensating the drawbacks of both approaches by combining them. The resulting approach is rather powerful and in contrast to any other ML approach: it does not have any hyper-parameter!!

Tomorrow, during the last day of the workshop, we will be using this DGCN to optimize our raspberry pink lacquers.

Permanent link to this article: https://dannyvanpoucke.be/dnlhit_mlworkshop_day4-en/

Workshop Machine Learning for Coatings: First Machine Learning (day 3)

Gartner hype cycle. Courtesy of Kevin Cremanns.

Gartner hype cycle. Courtesy of Kevin Cremanns.

Today the workshop shifted gears a bit. We left the experimental side of the story and moved fully into the world of machine learning. This change went hand-in-hand with a doubling of the number of participants, showing how a hot-topic machine learning really is.  Kevin Cremanns, who is presenting this part of the workshop, started by putting things into perspective a bit, and warned everyone not to hope for magical solutions (ML and AI have their problems), while at the same time presenting some very powerful examples of what is possible. A fun example is the robotic arm learning to flip pancakes:



During the introduction, all the usual suspects of machine learning passed the stage. And although you can read about them in every ML-book, it is nice to hear them discussed by someone who uses them on a daily basis. This mainly because practical details (often omitted in text-books) are also mentioned, helping one to avoid the same mistakes many have made before you. Furthermore, the example codes provided are extremely well documented, making them an interesting source of teaching material (the online manuals for big libraries like sci-kit learn or pandas tend to be too abstract, too big, and too intertwined for new users).

All-in-all a very interesting day. I look forward to tomorrow, as then we will be introduced into the closed source machine learning library developed at the University Hochschule Niederrhein.

Permanent link to this article: https://dannyvanpoucke.be/dnlhit_mlworkshop_day3-en/

Workshop Machine Learning for Coatings: Magical Humans (day 2)

Today was the second day of the machine learning workshop on coatings. After having focused on the components of coatings, today our focus went to characterization and deposition. The set of available characterization techniques is as extensive as the possible components to use. There was, however, one thing which grabbed my attention: “The magical human observer”. Several characterization techniques were presented to heavily rely on the human observer’s opinion and Fingerspitzengefühl.  Sometimes this even came with the suggestion that such a human observer outperforms the numerical results of characterization machinery. This makes me wonder if this isn’t an indication of a poor translation of the human concept to the experiment intended to perform the same characterization. Another important factor to keep in mind when building automation frameworks and machine learning models.

In the afternoon, we again put on our lab coats and goggles. The task of the day: put our raspberry pink lacquer on different substrates and characterize the glossiness (visually) and the pendulum hardness.

Tomorrow the machine learning will kick in.

Permanent link to this article: https://dannyvanpoucke.be/dnlhit_mlworkshop_day2-en/

Workshop Machine Learning for Coatings (day 1)

Today was the first day of school…not only for my son, but for me as well. While he bravely headed for the second grade of primary school, I was en route to the first day of a week-long workshop on Machine Learning and Coatings technology at the Hochschule Niederrhein in Krefeld. A workshop combining both the practical art of creating coating formulations and the magic of simulation, more specifically machine learning.

During my career as a computational materials researcher, I have worked with almost every type of material imaginable (from solids to molecules, including the highly porous things in between called MOFs), and looked into every aspect available, be it configuration (defects , surfaces, mixtures,…) or materials properties (electronic structure, charge transfer, mechanical behavior and spin configurations). But each and every time, I did this from a purely theoretical perspective*. As a result, I have not set foot in a lab (except when looking for a colleague) since 2002 or 2003, so you can imagine my trepidation at the prospect of having to do “real” lab-work during this workshop.

Participating in such a practical session— even such a ridiculously simple and safe one— is a rather interesting experience. The safety-goggles, white-coat and gloves are cool to wear, true, but from my perspective as a computational researcher who wants to automate things, this gives me a better picture of what is going on. For example, we** carefully weigh 225.3 grams of a liquid compound and add 2.2 grams of another (each with an accuracy of about 0.01 gram). In another cup, we collect two dye compounds (powders), again trying our best to perfectly match the prescribed quantities. But when the two are combined in the mixer it is clear that a significant quantity (multiple grams) are lost, just sticking to the edge of the container and spatula. So much for carefully weighing (of course a pro has tricks and skills to deal with this better than we did, but still). Conclusion: (1)Error bars are important, but hard to define. (2) Mixtures made by hand or by a robot should be quite different in this regard.

For the theoretical part of my brain, mixing 10 compounds is just putting them in the same box and stir, mix or shake. Practice can be quite different, especially if you need 225 grams of compound A, and 2.2 grams of compound B. This means that for the experimentalist there is a “natural order” for doing things. This order does not exist at the theoretical side of the spectrum***, where I build my automation and machine learning. This, in addition to the implicit interdependence of combined compounds, gives the high-dimensional space of possible mixtures a rather contorted shape. This gives rise to several questions begging for answers, such as: how important is this order, and can we (ab)use all this to make our search space smaller (but still efficient to sample).

At the end of the day, I learned a lot of interesting things and our team of three ended up with a nice raspberry pink varnish.

Next, day two, where we will characterize our raspberry pink varnish.

 

Footnotes

* Yes, I do see how strange this may appear for someone whose main research focus is aimed at explaining and predicting experiments. 🙂
** We were divided in teams of 2-3 people, so there were people with actual lab skills nearby to keep me safe. However, if this makes you think I was just idly present in the background, I have to disappoint you. I am brave enough to weigh inanimate powders and slow flowing resins 😉 .
*** Computational research in its practice uses aspects of both the experimental and theoretical branches of research. We think as theoreticians when building models and frameworks, and coax our algorithms to a solution with a gut-feeling and Fingerspitzengefühl only experimentalists can appreciate.

Permanent link to this article: https://dannyvanpoucke.be/dnlhit_mlworkshop_day1-en/

New year’s resolution

A new year, a new beginning.

For most people this is a time of making promises, starting new habits or stopping old ones. In general, I forgo making such promises, as I know they turn out idle in a mere few weeks without external stimulus or any real driving force.

In spite of this, I do have a new years resolution for this year: I am going to study machine learning and use it for any suitable application I can get my hands on (which will mainly be materials science, but one never knows).  I already have a few projects in mind, which should help me stay focused and on track. With some luck, you will be reading about them here on this blog. With some more luck, they may even end up being part of an actual scientific publication.

But first things first, learn the basics (beyond hear-say messages of how excellent and world improving AI is/will be). What are the different types of machine learning available, is it all black box or do you actually have some control over things. Is it a kind of magic? What’s up with all these frameworks (isn’t there anyone left who can program?), and why the devil seem they all to be written in a script langue (python) instead of a proper programming language? A lot of questions I hope to see answered. A lot of things to learn. Lets start by building some foundations…the old fashioned way: By studying using a book, with real paper pages!

Happy New Year, and best wishes to you all!

Permanent link to this article: https://dannyvanpoucke.be/new-years-resolution-en/