Data defines the model by dint of genetic programming, producing the best decile table.

A Genetic Programming Model: 
Data-defined, Data Mining, Variable Selection, and Decile Optimization
Bruce Ratner, Ph.D.

The purpose of this article is to demonstrate the predictive power and features of a new genetic programming (GP) model – the GenIQ Model© – an alternative model to the statistical ordinary least squares and logistic regression models. The GenIQ Model, which is based on the assumption-free, nonparametric GP paradigm inspired by Darwin’s Principle of Survival of the Fittest, offers theoretical and usable advantages over the two statistical regression models, which are long-standing, and widely used. [1, 2, 3] GenIQ automatically (requiring no programming despite the suggestive term programming in GP) evolves a model by letting the data define it: The GenIQ Model is data-defined. [4, 5] As well, GenIQ has four usable features, which are unique in the way they automatically and simultaneously begin and carry through to completion: 1) Data mine, 2) Variable selection, and 3) Set forth the model equation itself – such that, 4) The decile table, a measure of model performance, is optimized. [6, 7] The open-worked GenIQ Model and its wordbook are both generally regarded as not demanding on newcomers to GP modeling.

I use a small, real study using three variables (body fat percentage, age, and gender) to make GenIQ modeling tractable and attractive for the everyday modeler, with hope that GenIQ enjoys widespread and constant use. Addressing the study’s first objective that requires a classification model, I build three classification models, a logistic regression model (LRM), a GenIQ Model, and a hybrid statistics-GenIQ Model, from which each model predicts the likelihood of a subject's gender is male. I illustrate GenIQs theoretical advantage via the GPP over the antithetical statistical paradigm – fitting data to a pre-specified model, which is defined by the rigorous methodology of significance testing, recently the focus of an ideological debate to abandon it. [8] Likewise, I illustrate that GenIQ variable selection provides more information and generates a better subset of predictor variables than statistical variable selection methods, which are viewed by many statisticians as suboptimal. [9] As for a GenIQ data-mining counterpart, no data mining capability for the two statistical regressions models exists. Certainly, I explain GenIQ output that consists of two parts – the equation, actually, a computer program, and a visual display of the program, often likened to a mathematical tree with Picasso-like abstractness – that account for the limited use of GenIQ. Admittedly, GenIQ is distinguished by its singular weakness of difficulty in interpreting the GP-based model.

There is a second study objective that I unparalleled address with GenIQ: How are AGE and PERCENT_FAT related? The latter objective points to nine extra-GenIQ applications. [10] I arbitrarily choose a LRM scenario for presenting GenIQ in an orderly, detailed way, but all that is presented and implied holds true for an ordinary least-squares regression model (OLS) scenario for presenting GenIQ.

Presenting the GenIQ Model as a viable alternative of LRM or OLS, I in effect put forward a trinity of the regression paradigm, from which modelers can now consider:

1) The GPP/GenIQ with an explicit fitness function for decile optimization; four features with unique execution; and GenIQs ungainly interpretability.

2) The Statistics paradigm/LRM/OLS with an unknowingly implied decile optimization achieved with LRM/OLS fitness functions (i.e., joint probability likelihood function, and mean squared error, respectively) serving as surrogates for explicit fitness functions for decile optimization; and the salient feature of model interpretability that is made possible by the regression coefficients.

3) The Hybrid Statistics-GP paradigm – integrating the best characteristics of two paradigms – yields a utile alternative of LRM/OLS or the GenIQ Model. The hybrid paradigm is: The modeler fits the data to LRM/OLS with the modeler's preferred variable selection method to determine the best subset among the original and GenIQ genetically data-mined variables. Of primary import, the hybrid LRM/OLS-GenIQ Model includes the regression coefficients, which provide the necessary comfort level of model interpretability for model acceptance.

For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at
Sign-up for a free GenIQ webcast: Click here.