pyDarwin machine learning algorithms application and comparison in nonlinear mixed-effect model selection and optimization

The basic concept of explicitly defining the search space for machine learning model selection, or, for defining it implicitly for traditional model selection has been described previously by Bies [1]. As described above, a multi-dimensional discrete search space is defined, where each dimension represents a set of mutually exclusive model features. In general, these values are strictly categorical (i.e., not ordered categorical). While the options in each dimension will be coded as integers (and therefore ordered categorical), in the algorithms, they are, in all cases, treated as strictly categorical. In the case of GA, the values are recoded as a bit string, reducing them to strictly categorical (values of 0 or 1). Other algorithms have the option to treat the values as strictly categorical and this option is used in all cases.

The options in each dimension are then concatenated into a string of integers. Different algorithms may require different representations of this integer string, such as conversion to a bit string (for GA) or to a “minimal binary” (a binary code without redundancy) string for the one- and two-bit downhill search.

Algorithms

The algorithms examined include:

GA – Genetic algorithm attempts to reproduce the mathematics of evolution and survival of the fittest. The algorithm begins by defining a random population (randomly selected bit strings), typically 50–100 models. These bitstrings are translated into NONMEM control files. The resulting NONMEM output is used to calculate the fitness. Models are then randomly selected from the population as candidate “parents”, 2 pairs of 2 models are selected. Each pair undergoes “tournament selection” wherein the model with the better fitness of the pair “wins” and is selected as a parent for the subsequent generation. The two bitstrings of the two parents then undergo crossover and mutation to generate new options. Mutations serve to generate features that have not yet appeared in the population. The Python package DEAP (https://deap.readthedocs.io/en/master/) is used to implement the GA method.

GP – Gaussian process is implemented with the scikit-learn python package (https://scikit-learn.org/stable/modules/gaussian_process.html). Briefly GP represents the fitness as a generalization of a Gaussian probability distribution. The distribution begins with an uninformative prior. The initial sample is again essentially random, as no information is available to inform it. With the results of that first iteration, the probability distribution is updated. Samples are taken from that updated distribution selected to inform the parameters of the distribution. The models from those samples are then run, and the distribution is updated again. An excellent review of GP can be found here (https://gaussianprocess.org/gpml/chapters/RW.pdf).

RF and GBRT – Random forest and gradient boosted random tree are implemented using the scikit-learn package (https://scikit-learn.org/stable/modules/ensemble.html). Random forest implements a set of decision trees, classifying the search space based on the fitness values. The multiple trees from random samples (samples, in this case, being the set of models and corresponding fitness) are used to prevent overfitting. Gradient boosted random tree uses a different approach to prevent overfitting. In RF, each tree is independent and created from a random sample of the data. In GBRT, the tree is built additively, with each tree taking the previous, and “boosting” the gradient with respect to fitness by adding some selected low performance models. The addition of low performance models reduces the chances of overfitting as correct classification of poor models is also required.

PSO – Particle swarm optimization is another attempt to reproduce a natural process, in this case of a flock of birds or a school of fish. The “particles” represent candidate models. As for other algorithms, an initial random population is first created. The movement in the search space is then defined as:

Equation 1. Equation for updating velocity in PSO

$$_=_+ _\centerdot _\centerdot (pbest-_) + _\centerdot _\centerdot (gbest-_)$$

(1)

where Vt+1 is the velocity at time t + 1, Vt is the velocity at time t, \(_\) and \(_\) are random numbers between 0 and 1, and w is inertia weight (a particle continuing in the same direction), cognitive constant (\(_\)) and social constant (\(_\)) are PSO algorithm parameters, \(pbes_\) is the best position that particle I has explored, xt is the current particle position, and gbest is the best position explored by the entire (global) particle swam. PSO is implemented with the package pySwarms (https://pyswarms.readthedocs.io/en/latest/).

pyDarwin package

The python package pyDarwin was used for the model selection. Details of the implementation can be found on a public repository on Git Hub (https://github.com/certara/pyDarwin). Details of generation of syntactically correct NONMEM control files from the template and tokens file have been described previously by Bies [1] as well as the pyDarwin documentation (https://certara.github.io/pyDarwin/html/index.html). Briefly, a control file template is created by the user. The control file template is a text file, similar to a NONMEM control file, but with “tokens” that may be replaced to specify features from the search space. Next, a tokens file is created. Both the template and tokens file are identical for all algorithms examined in this analysis. The tokens file is a JSON (Java script object notation) formatted file where each group of sets of key-text pairs represents a dimension of the search space. The sets of key-text pairs correspond to the options in that dimension. The selected element of the key-text pair is substituted into the control file template, creating a syntactically correct NONMEM control file. The control file is then executed with the NMFE.bat command (where is the version number of NONMEM). After execution, the user-defined metric of “model goodness” is calculated. The “model goodness” metric is, for simplicity, called “fitness” for all algorithms, the term used in GA. Other algorithms conventionally use different terms, but we will use “fitness” throughout. Initially, a “population” (again a term from GA) is created with perhaps 50–100 models. When execution of all of these models is complete, the fitness is calculated and the resulting values are used to select the next generation of models. The term “generation” is used throughout to avoid confusion with the term “iteration” used in NONMEM.

Local downhill search

pyDarwin includes an option for alternating between the ML algorithm and the local downhill search. Including the local downhill search in ML algorithms is from the recognition that while ML algorithms are able to robustly find the area containing global minima, they may struggle in locating the optimal solution within that region. Conversely, local downhill search is expected to find the local minima efficiently. Therefore, a combination of global search and local search strategies will enhance both robustness and efficiency in finding the true optimal solution [10].

For the downhill search, first a “one bit” search is performed. One or more models from the current ML results are used to start the local downhill search. If multiple models are used as the starting point for the one-bit local downhill search, a minimal difference between the models (Hamming distance [7] or niche radius) is defined, so that multiple downhill searches are not started from nearly identical models which would likely lead to the same (potentially local) minimum. Rather, the multiple models are in different “niches”. Each bit in the binary representation of the models is “flipped”, 0 to 1 or 1 to 0 and the resulting models are executed. If any model in the generated set of models is better (has a lower fitness), the process is repeated with the best model. This iteration is continued until no further improvement is seen.

If requested by the user, the one-bit search is followed by a two-bit search. For the two-bit search, the single best model from the one-bit search is used as the base model. All possible combinations of two-bit changes are generated and the resulting models are executed. For illustration, the one and two bit changes for a [0,1;0,0] genome are given in Table 1. Note that the 1 bit changes are on the diagonal and the 2 bit on the off diagonal elements. The first bit change is in columns, and the 2nd in rows. That is, the 2 bit change for the 1st and 2nd bit is given in row 2, column 1 ([1,0;0,0]).

Table 1 Table of 1 and 2 bit search genomes for a [0,1;0,0] genome

An illustration of the two-bit search is given in Fig. 2. In this figure, assume that the dark gray genome (0,1 for dimension 1 and 0,0 for dimension 2, i.e., [0,1;0,0]) is the current “best” model after the ML step. The one bit changes from this are shown with horizontal hatching ([0,0;0,0],[0,1;0,1], and [0,1;1,0]. (N.B, [1,0;0,0] is a two bit change despite being adjacent in the figure, with both the 1st and 2nd bit of dimension 1 changed). All of these are worse (higher fitness) than the current best model [0,1;0,0]. Essentially the current “best” [0,1;0,0] is surrounded by a 1 bit wide ridge of worse models. This is a local minimum. Two bit changes (off diagonals Table 1) are shown with vertical hatching. Of these, [1,0;0,1] is has a lower fitness. This model would not be evaluated in a one bit search. Experience suggests this is common in model search spaces. This sort of interaction between model features was first described by Wade et. al [5].

Fig. 2figure 2

The 2 bit search can be computationally expensive, as the number of models for each iteration is:

Equation 2. Number of models in each two-bit search step

$$Number\;of\;models= \frac$$

(2)

where n is the number of bits. In principle the search could be extended to 3 bit changes, but the number of models to be evaluated increases rapidly. These local downhill search steps can be alternated with the ML algorithm at an interval determined by the user. Typically, the downhill period is set as 5, which means after 5 generations of ML search, the local downhill search will begin to run.

留言 (0)

沒有登入
gif