# Tag: AI

## Creating online forms and catching spam-bots

Recently, I decided to add a custom registration form  to my website, as part of an effort to improve and streamline the “HIVE-STM tool experience” 😉 . Up until now, potential users had to directly send me an e-mail, telling me a bit more about themselves and their work. I would then e-mail them the program, and add their information to a user list for future reference (i.e., support and some statistics for my personal entertainment).

This has the drawback that any future user needs to wait until I find the time to reply. To improve on the user-friendliness, I thought it would be nice to automate this a bit. A first step in this process entails making the application a bit more uniform: using an online registration form.

### The art of learning something new: Do it from scratch

What started out with the intention of being an almost trivial exercise in building a web-form, turned into a steep learning curve about web-development and cyber-security. I am aware there exists many tools which generate forms for websites or even provide you a platform which hosts the form (e.g., google-forms, which I used in the past), but I wanted to do implement it myself (…something to do with pride 😉 ). Having build websites using HTML and CSS in the past, and having some basic experience with Javascript, this looked like a fun afternoon project. The HTML for the form was easily created using the tutorials found on w3schools.com and an old second edition “Handboek HTML5 en CSS3“, I picked up a few years ago browsing a second hand bookshop. Trouble, however, started rearing its ugly head as soon as I wanted to integrate this form in this WordPress website. Just pasting this into a page or post doesn’t really work, as WordPress wants to “help” you, and prevent you from hurting yourself. This is a fantastic feature if you have no clue about HTML/CSS/… or don’t want to care about it. Unfortunately, if you want to do something slightly more  advanced you are in for a hell of a ride, as you find out the relevant bits get redacted or disabled.

Searching for specific solutions with regard to creating a custom form in WordPress I was astounded at how often the default suggestion is: “use plugin XXX” or “use tool YYY”. Are we loosing the ability to want to craft something ourselves? Yes of course, there are professional tools available which can be better than anything you yourself can build in a short amount of time…but should it discourage you of trying, and feeling the satisfaction of having created something? I digress.

In the end, I discovered a good quality tutorial (once you get past the reasons why not to do it) and I started a long uphill battle trying to bend WordPress to my will:

1. Paste form-code in postWP countermove: remove relevant tags essentially killing the form.
2. Solution: put the form in a dedicated template ⇒ WP countermove: hard to integrate in existing theme, will be removed upon update of the theme
3. Solution: create a child-theme ⇒ WP countermove: interesting exercise is getting the CSS style-sheet to work together with that of the parent theme. (wp_enqueue_style, wp_enqueue_scripts, get_template_directory_uri() and get_stylesheet_directory_uri() saved the day.)
4. Add PHP back-end to the form…and deal with the idiosyncrasies of this scripting language. Crashed the website a few time due to missing “;”… error messages would be nice, instead of the blank web-page.

### Trying not to torture future users

At this point, the form accepted input, and collected it via the PHP \$_POST global variable. En route to this point, I read quite a few warnings about Cross-Site Request Forgery (CSRF) and that one should protect against it. Luckily, the tutorial practically showed how to do this in WordPress using nonces…in contrast to WordPress theme handbook which gives in formation, but not easy to understand if you are new to the subject.

With a basic sense of security, I was aiming at making things user-friendly, i.e., if something goes wrong it would be nice if you do not need to again fill out the form entirely. Searching for ways to keep this information I came across a lot of options, none of which seemed to work (cookies, PHP variables, global variables, etc). The problem appeared to originate from the fact that the information was not persistent. Once the web page started reloading, everything got erased. It was only at this point that I learned about “transients” in WordPress, and using get_transient() and set_transient() resolved all the issues instantaneously. There is only one caveat at this point: If two potential users submit their registration at almost the same time one may end up seeing the registration information of the other. (However, at this time the program is far from famous enough to present any issues, so statistics will save us from this).

Only one thing remained to be done: put all relevant information into two e-mail messages, one to be sent to myself, and one to be sent to the potential user. For this, I made use of the PHP mail() function. It works quite nicely, and after playing around with it for a bit (and convincing myself a nice HTML formatted layout will not work for example in gmail) the setup was complete. That evening, I went to bed, happy with the accomplishment: I had created something.

### Too popular for comfort

Bot Activity on the HIVE registration form during February and March of 2021.

The next morning, I was amazed to find already several applications for the HIVE-STM program in my mailbox (that is, in addition to my own test runs). These were not sent by real humans, but appeared to be the work of bots just filling out the form and sending it off. This left me a bit puzzled, and I have been looking for the reason why anyone would actually bother writing a bot for this purpose. So far I’ve seen the suggestion that this is to improve the SEO of websites, generate spam-email (to yourself or with you as middleman), DOS-attacks, get access to your SQL database via code injection,…and after all my searches, I start to get the impression this may also be a means of promoting all the plugins, tools, frameworks that block these bots? In roughly each discussion you find, there will be at least one person promoting such a foolproof perfect tool 😯 🙄 …but might just be me.

So how do we deal with these bots, preferably without driving potential users crazy? Reading all the suggestions (which unfortunately provide extremely little information on the actual working and logic of spam-bots themselves) I added, in several rounds, some tricks to block/catch the bots, and have been tracking the submits since the form went live. As you can see there is a steady stream of some 50 bots weekly trying to fill out the form. The higher number in the first week is due to any submission being redirected to the original form page, as such the same bots performed multiple attempts within the time-range of a few minutes. In about two months, I collected the results of 400 registration attempts by bots (and 4 by humans).

Analyzing the results, I learned learned some interesting things.

How to catch a bot? I track 4 different signals which may be indicative of bot behavior.

One of the first things to add, from a human perspective is “a captcha”. The captcha is manually implemented simple random sum/product/subtraction. It should be easy for humans, but it is annoying as they need to fill out an extra field (and may fill it out incorrectly). Interestingly, 56% of the bots fill out the Captcha correctly. Of course more complicated versions could be implemented or used…but the bottom line is simple: it generally does not do the job, and annoys the actual human being.

#### 2. Bot Trapping for furs?

Going beyond captcha’s, a lot of tutorials suggest the use of a honeypot. One can either make use of automated options of existing frameworks, plugins or …implement these oneself. This option appears to be very successful in targeting bots. The 1% successful cases coincided with the only human submissions. At this point we appear to have a “fool-proof” method for distinguishing between humans and bots.

#### 3. Dropping the bot down the box?

Interestingly, drop-down menu’s with not generally used topics seem to throw off bots as well. The seniority drop-down menu shows failure rates even better than the captcha.

### Conclusion

Writing your own form from scratch is a very interesting exercise, and well worth the time if you want to learn more about web-security as well as the inner workings of the framework used for your website. Bots are an interesting nuisance, and captcha’s just bother your user as most bots can easily deal with them. Logging the inputs of the bots does show a wide range in quality of these bots. Some just fill out garbage, while others appear to be quite smart, filling out reasonable answers. Other bots clearly have malignant purposes, which becomes clear from the code they try to plug into the form fields.

For now, the registration form seems to be able to distinguish between human-and bot-users. As such, we have successfully completed another step in upgrading the STM-program

## Building your own scikit-learn Regressor-Class: LS-SVM as an example

The world of Machine-Learning (ML) and Artificial Intelligence (AI) is governed by libraries, as the implementation of a full framework from scratch requires a lot of work. ML and data-science engineers and researchers, therefore don’t generally build their own libraries. Instead they use and extend existing libraries written in python or R. One of the most popular current python ML libraries is scikit-learn. This library provides access to scores of ML-models and methods which can be combined at will via the use of a consistent global API.

However, no matter how many models there are included in such a library, chances are that a model you wish to use (or the extension you envision for an existing model) is not implemented.  In such a case, you do not want to write an entire ML framework from scratch, but just create your own model and fit it into the existing framework.  Within the scikit-learn framework this can be done with relative ease, as is explained in this short tutorial. As an example, I will be building a regressor class for the LS-SVM model.

## 1. The ML-model: LS-SVM?

Least-Squares Support Vector Machines is a type of support vector machines (SVM) initially developed some 20 years ago by researchers at the KULeuven (and is still being further developed, funded via several ERC grants). It’s a supervised learning machine learning approach in which a system of linear equations is solved using the kernel-trick.

So how does it work in practice? Assume, we have a data set of data points (xi,yi), with xi the feature vector and yi the target of the data point (or sample) i. Depending on whether you want to perform classification or regression, training the model corresponds to solving the following system of equations (represented in their matrix form as):

Classification:

$\begin{bmatrix} 0 & Y^T \\ Y & \Omega + \gamma^{-1}\mathbb{I} \end{bmatrix} \left[ \begin{array}{c} b \\ \alpha \end{array} \right] = \left[ \begin{array}{c} 0 \\ 1 \end{array} \right]$

Regression:

$\begin{bmatrix} 0 & 1^T \\ 1 & \Omega + \gamma^{-1}\mathbb{I} \end{bmatrix} \left[ \begin{array}{c} b \\ \alpha \end{array} \right] = \left[ \begin{array}{c} 0 \\ Y \end{array} \right]$

with $Y$ the vector containing all targets yi, $\gamma$ a hyperparameter, and $\Omega_{k,l}$ a kernel function $K(\mathbf{x_k,x_l})$.

Once trained, results are predicted (in case of regression) by solving the following equation:

$y(\mathbf{x})=\sum_{k=1}^{N}{\alpha_k K(\mathbf{x_k,x}) + b}$

More details on these can be found in the book of Suykens, or (if you prefer a shorter read) this paper by Dilmen.

The above model is available through the Matlab library developed by the Suykens group, and has been translated to R, but no implementation in the python scikit-learn library is available, therefore we set out to create such an implementation following the scikit-learn API. Our choice to follow the scikit-learn API is twofold: (1) we want our new class to smoothly integrate with the functionalities of the scikit-learn library (I’m building a framework for automated machine learning on this library, hence all my models need to show the same behavior and functionality) and (2) we want to be lazy and implement as little as possible.

## 2. Creating a Simple Regressor Class.

### 2.1. Initialization

Designing this Class, we will make full use of OOP (Similar ideas as in my fortran tutorials), inheriting behavior from scikit-learn base classes. All estimators in scikit-learn are derived from the BaseEstimator Class. The use of this class requires you to define all parameters of your class as keyword arguments in the __init__ function of your class. In return, you get the get_params and set_params methods for free.

As our goal is to create a regressor class, the class also needs to inherit from the  RegressorMixin Class which provides access to the score method used by all scikit-learn regressors. With this, the initial implementation of our LS-SVM regressor class quickly takes shape:

class LSSVMRegression(BaseEstimator, RegressorMixin):
"""
An Least Squared Support Vector Machine (LS-SVM) regression class

Attributes:
- gamma : the hyper-parameter (float)
- kernel: the kernel used (string: rbf, poly, lin)
- kernel_: the actual kernel function
- x : the data on which the LSSVM is trained (call it support vectors)
- y : the targets for the training data
- coef_ : coefficents of the support vectors
- intercept_ : intercept term
"""

def __init__(self, gamma:float=1.0, kernel:str=None, c:float=1.0,
d:float=2, sigma:float=1.0):
self.gamma=gamma
self.c=c
self.d=d
self.sigma=sigma
if (kernel is None):
self.kernel='rbf'
else:
self.kernel=kernel

params=dict()
if (kernel=='poly'):
params['c']=c
params['d']=d
elif (kernel=='rbf'):
params['sigma']=sigma

self.kernel_=LSSVMRegression.__set_kernel(self.kernel,**params)

self.x=None
self.y=None
self.coef_=None
self.intercept_=None

All parameters have a default value in the __init__ method (and with a background in Fortran, I find it very useful to explicitly define the intended type of the parameters). Additionally, the same name is used for the attributes to which they are assigned. The kernel function is provided as a string (here we have 3 possible kernel functions: the linear (lin), the polynomial (poly), and the radial basis function (rbf) ) and linked to a function pointer via the command:

self.kernel_=LSSVMRegression.__set_kernel(self.kernel,**params)

The static private __set_kernel method returns a pointer to the correct kernel-function, which is later-on used during training and fitting.  The get_params, set_params, and score methods, we get for free so no implementation is needed, but you could override them if you wish. (Note that some tutorials recommend against overriding the get_params and set_params methods.)

### 2.2. Fitting and predicting

As our regressor class should be interchangeable with any regressor class available by scikit-learn, we look at some examples to see which method-names are being used for which purpose. Checking the LinearRegression model and the SVR model, we learn that the following methods are provided for both classes:

__init__ Initialize object of the class. Implemented above (ourselves)
get_params Get a dictionary of class parameters. Inherited from BaseEstimator
set_params Set the class parameters via a dictionary. Inherited from BaseEstimator
score Return the R2 value of the prediction. Inherited from RegressorMixin
fit Fit the model. to do
predict Predict using the fitted model. to do

Only the fit and predict methods are still needed to complete our LS-SVM regressor class. The implementation of the equations presented in the previous section can be done in a rather straight forward way using the numpy library.

import numpy as np

def fit(self,X:np.ndarray,y:np.ndarray):
self.x=X
self.y=y
Omega=self.kernel_(self.x,self.x)
Ones=np.array([[1]]*len(self.y))

A_dag = np.linalg.pinv(np.block([
[0, Ones.T ],
[Ones, Omega + self.gamma**-1 * np.identity(len(self.y))]
]))
B = np.concatenate((np.array([0]),self.y), axis=None)

solution = np.dot(A_dag, B)
self.intercept_ = solution[0]
self.coef_ = solution[1:]

def predict(self,X:np.ndarray)->np.ndarray:
Ker = self.kernel_(X,self.x)
Y=np.dot(self.coef_,Ker.T) +self.intercept_
return Y

Et voilà, all done. With this minimal amount of work, a new regression model is implemented and capable of interacting with the entire scikit-learn library.

## 3. Getting the API right: Running the Model using Scikit-learn Methods.

The LS-SVM model has at least 1 hyperparameter: the $\gamma$ factor and all hyperparameters present in the kernel function (0 for the linear, 2 for a polynomial, and 1 for the rbf kernel). To optimize the hyperparameters, the GridsearchCV Class of scikit-learn can be used, with our own class as estimator.

For the LS-SVM model, which is slightly more complex than the trivial examples found in most tutorials, you will encounter some unexpected behavior. Assume you are optimizing the hyperparameters of an LS-SVM with an rbf kernel: $\gamma$ and $\sigma$.

from sklearn.model_selection import GridSearchCV
...
parameters = {'kernel':('rbf'),
'gamma':[0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],
'sigma':[0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}
lssvm = LSSVMRegression()
clf = GridSearchCV(lssvm, parameters)
clf.fit(X, y)
...

When you plot the quality results as a function of $\gamma$, you’ll notice there is very little (or no) variation with regard to $\sigma$. Some deeper investigation shows that the instances of the LSSVMRegression model use different values of the $\gamma$ attribute, however, the $\sigma$ attribute does not change in the kernel function. This behavior is quite odd if you expect the GridsearchCV class to create a new class instance (or object) using the __init__ method for each grid point (a natural assumption within the context of parallelization). In contrast, the GridsearchCV class appears to be modifying the attributes of a set of instances via the set_params method, as can be found in the 2000+ page manual of scikit-learn, or here in the online manual:

Scikit-learn manual section of parameter initialization of classes

In programming languages like C/C++ or Fortran, some may consider this as bad practice as it entirely negates the use of your constructor and splits the initialization section. For now, we will consider this a feature of the Python scripting language. This also means that getting a static class function linked to the kernel_ attribute requires us to override the get_params method (initializing attributes in a fit function is just a bridge too far 😉 ).

def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)

params=dict()
if (self.kernel=='poly'):
params['c']=self.c
params['d']=self.d
elif (self.kernel=='rbf'):
params['sigma']=self.sigma
self.kernel_=LSSVMRegression.__set_kernel(self.kernel,**params)

return self

For consistency the get_params method is also overridden. The resulting class is now suitable for use in combination with the rest of the scikit-learn library.

## 4. The LS-SVM Regressor on Github

At the moment of witting no LS-SVM regressor class compatible with the scikit-learn library was available. There are some online references available to Python libraries which claim to have the LS-SVM model included, but these tend to be closed source.  So instead of trying to morph these to fit my framework, I decided to use this situation as an opportunity to learn some more on the implementation of an ML model and the integration of this model in the scikit-learn framework. The resulting model is extended further to deal with the intricacies of my own framework aimed at small datasets, which is beyond the scope of the current tutorial. Since I believe the LS-SVM regressor may be of interest to other users of the scikit-learn library, you can download it from my github-page:

## 5. References

• J.A.K. Suykens et al., “Least Squares Support Vector Machines“, World Scientific Pub. Co., Singapore, 2002 (ISBN 981-238-151-1)
• E. Dilmen and S. Beyhan, “A Novel Online LS-SVM Approach for Regression and Classification”, IFAC-PapersOnLine Volume 50(1), 8642-8647 (2017)
• D. Hnyk, “Creating your own estimator in scikit-learn“, webpage
• T. Book, “Building a custom model in scikit-learn“, webpage
• User guide: create your own scikit-learn estimator“, webpage

DISCLAIMER: Since Python codes depreciate as fast as they are written, links to the scikit-learn library documentation may be indicated as outdated by the time you read this tutorial. Check out the most recent version in that case. Normally, the changes should be sufficiently limited not to impact the conclusions drawn here. However, if you discover a code-breaking update, feel free to mention it here in the comments section.

## Parallel Python in classes…now you are in a pickle

In the past, I discussed how to create a python script which runs your calculations in parallel.  Using the multiprocessing library, you can circumvent the GIL and employing the async version of the multiprocessing functions, calculations are even performed in parallel. This works quite well, however, when using this within a python class you may run into some unexpected behaviour and errors due to the pickling performed by the multiprocessing library.

For example, if the doOneRun function is a class function defined as

class MyClass:
...
def doOneRun(self, id:int):
return id**3
...

and you perform some parallel calculation in another function of your class as

class MyClass:
...
def ParallelF(self, NRuns:int):
import multiprocessing as mp

nproc=10
pool=mp.Pool(processes=nprocs)
drones=[pool.apply_async(self.doOneRun, args=(nr,)) for nr in range(NRuns)]

for drone in drones:
Results.collectData(drone.get())
pool.close()
pool.join()

...

you may run into a runtime error complaining that a function totally unrelated to the parallel work (or even to the class itself) can not be pickled. 😯

So what is going on? In the above setup, you would expect the pool.apply_async function to take just a function pointer to the doOneRun function. However, as it is provided by a the call self.doOneRun, the pool-function grabs the entire class and everything it contains, and tries to pickle it to distribute it to all the processes.  In addition to the fact that such an approach is hugely inefficient, it has the side-effect that any part associated to your class needs to be pickleable, even if it is a class-function of a class used to generate an object which is just a property of the MyClass Class above.

So both for reasons of efficiency and to avoid such side-effects, it is best to make the doOneRun function independent of a class, and even placing it outside the class.

def doOneRun(id:int):
return id**3

class MyClass:
...
def ParallelF(self, NRuns:int):
import multiprocessing as mp

nproc=10
pool=mp.Pool(processes=nprocs)
drones=[pool.apply_async(doOneRun, args=nr) for nr in range(NRuns)]

for drone in drones:
Results.collectData(drone.get())
pool.close()
pool.join()

...

This way you avoid pickling the entire class, reducing initialization times of the processes and the  unnecessary communication-overhead between processes. As a bonus, you also reduce the risk of unexpected crashes unrelated to the calculation performed.

## Parallel Python?

As part of my machine learning research at AMIBM, I recently ran into the following challenge: “Is it possible to do parallel computation using python.” It sent me on a rather long and arduous journey, with the final answer being something like: “very reluctantly“.

Python was designed with one specific goal in mind; make it easy to implement small test programs to see if an idea is worth pursuing. This gave rise to a scripting language with a lot of flexibility, but also with significant limitations, most of which the “intended” user would never meet. However, as a consequence of its success, many are using it going far beyond this original scope (yours truly as well 🙂 ).

Python offers various libraries to parallelize your scripts…most of them wrappers adding minor additional functionality. However, digging down to the bottom one generally ends up at one of the following two libraries: the threading module and the multiprocessing module.

Of course, as with many things python, there is a huge amount of tutorials available with many of great quality.

Programmers experienced in a programming language such as C/C++, Pascal, or Fortran, may be familiar with the concept of multi-threading. With multi-threading, a CPU allows a program to distribute its work over multiple program-threads which can be performed in parallel by the different cores of the CPU (or while a core is idle, e.g., since a thread is waiting for data to be fetched).  One of the most famous API’s for writing multi-threaded applications is OpenMP. In the past I used it to parallelize my Hirshfeld-I implementation and the phonon-module of HIVE.

For Python, there is no implementation of the OpenMP API, instead there is the threading module. This provides access to the creation of multiple threads, each able to perform their own tasks while sharing data-objects. Unfortunately, python has also the Global Interpreter Lock, GIL for short, which allows only a single thread to access the interpreter at a time. This effectively reduces thread-based parallelization to a complex way of running a code in a serial way.

## import multiprocessing

In addition to the threading module, there is also the multiprocessing module. This module side-steps the GIL by creating multiple processes, each having its own interpreter. This however comes at a cost. Firstly, there is a significant computational cost starting the different processes. Secondly, objects are not shared between processes, so additional work is needed to collect and share data.

Using the “Pool” class, things are somewhat simplified, as can be seen in the code-fragment below.  With the pool class one creates a set of threads/processes available for your program. Then through the function apply_async function it is possible to run processes in parallel. (Note that you need to use the “async” version of the function, as otherwise you end up with running things serial …again)

 multiprocessing backbone
import multiprocessing as mp def doOneRun(id:int): #trivial function to run in parallel	return id**3   num_workers=10  #number of processesNRuns=1000      #number of runs of the function doOneRun pool=mp.Pool(processes=num_workers)   # create a pool of processesdrones=[pool.apply_async(doOneRun, args=nr) for nr in range(NRuns)] #and run things in parallel for drone in drones: #and collect the data	Results.collectData(drone.get()) #Results.collectData is a function you write to recombine the separate results into a single result and is not given here. pool.close() #close the pool...no new tasks can be run on any of the processespool.join()  #collapse all threads back into the main thread

## how many cores does my computer have?

If you are used to HPC applications, you always want to get as much out of your machine as possible. With regard to parallelization this often means making sure no CPU cycle is left unused. In the example above we manually selected the number of processes to spawn. However, would it not be nice if the program itself could just set this value to be equal to the number of physical cores accessible?

Python has a large number of functions claiming to do just that. A few of them are given below.

•  multiprocessing.cpu_count(): returns the number of logical cores it can find. So if you have a modern machine with hyper-threading technology, this will return a multiple of the number of physical cores (and you will be over-subscribing your CPU.
• os.cpu_count(): same as multiprocessing.cpu_count().
• psutil.cpu_count(logical=False): This implementation gives the same default behavior, however, the parameter logical allows for this function to return the correct number of cores in a single CPU. Indeed a single CPU. HPC architectures which contain multiples CPUs per node will again return an incorrect number, as the implementation makes use of a python “set”, and as such doesn’t increment for the same index core on a different CPU.

In conclusion, there seems to be no simple way to obtain the correct number of physical cores using python, and one is forced to provide this number manually. (If you do have knowledge of such a function which works in both windows and unix environments and both desktop and HPC architectures feel free to let me know in the comments.)

All in all, it is technically possible to run code in parallel using python, but you have to deal with a lot of python quirks such as GIL.

## New year’s resolution

A new year, a new beginning.

For most people this is a time of making promises, starting new habits or stopping old ones. In general, I forgo making such promises, as I know they turn out idle in a mere few weeks without external stimulus or any real driving force.

In spite of this, I do have a new years resolution for this year: I am going to study machine learning and use it for any suitable application I can get my hands on (which will mainly be materials science, but one never knows).  I already have a few projects in mind, which should help me stay focused and on track. With some luck, you will be reading about them here on this blog. With some more luck, they may even end up being part of an actual scientific publication.

But first things first, learn the basics (beyond hear-say messages of how excellent and world improving AI is/will be). What are the different types of machine learning available, is it all black box or do you actually have some control over things. Is it a kind of magic? What’s up with all these frameworks (isn’t there anyone left who can program?), and why the devil seem they all to be written in a script langue (python) instead of a proper programming language? A lot of questions I hope to see answered. A lot of things to learn. Lets start by building some foundations…the old fashioned way: By studying using a book, with real paper pages!