# Uncovering equations

When we train a machine learning (ML) model on data, we are essentially learning a rule that maps X to y, X being the input data and y being the labels (classes in classification or continuous numbers in regression). There have been numerous ML techniques, spanning from simple decision trees all the way to complex and powerful artificial neural networks (ANN). Despite their success a thing that most of these ML techniques have in common is their “black-box” nature. Once a ML algorithm is trained, it can be difficult to understand why it gives a particular response to a set of inputs [1]. The latter means that we do not end up necessarily with a formula that defines the mapping. This can be disadvantage when these models are to be used in mission-critical applications, such as predictive maintenance, as if there is an erroneous decision it can be a daunting task discovering the behavior that might have caused the event.

Thankfully there are alternatives that can shed some light. One of these goes by the name symbolic regression (SR). Symbolic Regression (SR) is a form of regression analysis that searches the space of mathematical expressions (formulas) to find the model that best fits a given data-set, both in terms of accuracy and simplicity (complexity). Thus, what we get in the end is not a learn black-box model, but an actual, interpretable algebraic expression that describes the relationship between input and output. SR differs from traditional regression techniques, since it does not rely on a specific *a priori* determined model structure (I.e., linear). The only assumption that SR makes is that the response surface can be described by an algebraic expression (formula) [4]. This means that the expression need not be linear. Instead of the traditional approach where the model structure is fixed and the remaining free parameters are optimized, SR reformulates the regression problem as a search problem for the optimal model structure. Once a model structure of sufficient quality is determined, traditional techniques can then be employed to find the optimal model coefficients [4].

SR is based on the existence of the so-called evolutionary algorithms (EA), which is a class of algorithms based on the idea of Darwinian theory of evolution, survival of the fittest. In which case, survival of the fittest solutions. There are various ways of performing SR in the context of EA. Some of these are genetic programming (GP), grammar evolution (GE), and analytic programming (AP). Here we will emphasize SR in the framework of GP. Genetic programming [2],[3], developed by J. R. Koza, is a method of automatic “program” (mathematical formulas, computer programs, logical expressions, etc.) estimation by means of genetic algorithms (GA), which belong in the context of EA. Like GA, in which solutions (individuals) are represented as chromosomes of bit-strings, in GP programs are represented as chromosomes like syntactic trees. Based on GA principles (crossover, mutation, and selection) the programs are evolved to best fit the data.

In CIMPLO, in the context of predictive maintenance, we performed SR to estimate the exhaust gas temperature (EGT) of turbofan engines, from continuous engine operational data (CEOD). These are data recorded on flight, typically once per second. After the data are pre-processed, we would like to identify the algebraic expression that relates EGT to other measured parameters in flight. Figure 1 shows the equation returned by SR and Figure 2 shows the actual vs the estimated (via the equation) data. Finally, in Table 1, we show the performance metrics of the model on the training set and in Table 2 the performance metrics on the test set..

Evidently, this method returns an interpretable model that can be used in numerous methods. For example, one can use this model as a proxy to estimate the EGT, if other parameters (not included in the model) are malfunctioning. Moreover, if SR is trained on healthy engines only, one can derive a “healthy” model for the EGT, which can be used as a ground-truth model in order to check for engine degradation. Of course, this list is by no means exhaustive, as certainly there are various ways that such an approach can be valuable in industrial applications.

**Figure 1:** Equation returned from SR describing the relationship between EGT (Y1^{1}) and other CEOD parameters (here shown as X_{i})

**Figure 2:** Actual vs predicted on test set. Note the data are scaled on the y-axis. The x-axis represents the data point numbers.

**Table 1:** Performance metrics on the training set. RMSE is measured in degrees Celsius.

**Table 2:** Performance metrics on the testing set. RMSE is measured in degrees Celsius.

**References:**

[1]: Mittelstadt, Brent Daniel; Allo, Patrick; Taddeo, Mariarosaria; Wachter, Sandra; and Floridi, Luciano (2016, in press). The Ethics of Algorithms: Mapping the Debate. Big Data & Society.

[2]: Koza J.R. Genetic Programming, MIT Press, ISBN 0-262-11189-6, 1998

[3]: Koza J.R., Bennet F.H., Andre D., Keane M., Genetic Programming III, Morgan Kaufnamm pub., ISBN 1-55860-543-6, 1999

[4]: Wouter Minnebo; Sean Stijven (2011). “Chapter 4: Symbolic Regression” (PDF). Empowering Knowledge Computing with Variable Selection (M.Sc. thesis). University of Antwerp.