Data defines the model by dint of genetic programming, producing the best decile table.

Statistical and Machine-Learning Data Mining:
Techniques for Better Predictive Modeling and Analysis of Big Data, Third Edition

June 7, 2017 by Chapman and Hall/CRC/Talyor&Francis
Reference - 662 Pages - 200 B/W Illustrations
ISBN 9781498797603 - CAT# K30454

• One of only two books on big data on Intel's prestigous recommended reading list.
• Provides step-by-step solutions to common problems facing data scientists, modelers, and marketers; other books typically provide outlined-solutions.
• Illustrations involve real problems, real data, and better solutions.
• Uniquely introduces two new machine-learning methods specifically tailored to database assessment of optimal model performance.
• Includes many new methodologies for unsolved, real problems along with corresponding SAS programs, easily converted into R and Python scripts.

The third edition of a bestseller, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data is still the only book, to date, to distinguish between statistical data mining and machine-learning data mining. is a compilation of new and creative data mining techniques, which address the scaling-up of the framework of classical and modern statistical methodology, for predictive modeling and analysis of big data. SM-DM provides proper solutions to common problems facing the newly minted data scientist in the data mining discipline. Its presentation focuses on the needs of the data scientists (commonly known as statisticians, data miners and data analysts), delivering practical yet powerful, simple yet insightful quantitative techniques, most of which use the "old" statistical methodologies improved upon by the new machine-learning influence.

Table of Contents (click)

Preface to Third Edition
Predictive analytics of big data has maintained a steady presence in the four years since the publication of the second edition. My decision to write this third edition is not a result of the success (units) of the second edition but is due to the countless positive feedback (personal correspondence from the readership) I have received. And, importantly, I have the need to share my work on problems that do not have widely accepted, reliable, or known solutions. As in the previous editions, John Tukey’s tenets, necessary to advance statistics, flexibility, practicality, innovation, and universality, are the touchstones of each chapter’s new analytic and modeling methodology.

My main objectives in preparing the third edition are to:
1. Extend the content of the core material by including strategies and methods for problems, which I have observed on the top of statistics on the table by reviewing predictive analytics conference proceedings and statistical modeling workshop outlines.
2. Reedit current chapters for improved writing and tighter endings.
3. Provide the statistical subroutines used in the proposed methods of analysis and modeling. I use Base SAS© and STAT/SAS. The subroutines are also available for downloading from my website: The code is easy to convert for users who prefer other languages.

I have added 13 new chapters (resulting in total 44 chapters, over 40% new material) that are inserted between the chapters of the second edition to yield the greatest flow of continuity of material. The titles of the new chapters are:

Chapter 2, Science Dealing with Data: Statistics and Data Science.
Chapter 8, Market Share Estimation: Data Mining for an Exceptional Case.
Chapter 11, Predicting Share of Wallet without Survey Data.
Chapter 19, Market Segmentation Based on Time-Series Data Using Latent Class Analysis.
Chapter 20, Market Segmentation: An Easy Way to Understand the Segments.
Chapter 21, The Statistical Regression Model: An Easy Way to Understand the Model
Chapter 23, Model Building with Big Complete and Incomplete Data.
Chapter 24, Art, Science, Numbers, and Poetry, is a high-order blend of artwork, science,
numbers, and poetry, all inspired by the Egyptian pyramids, da Vinci, and Einstein. Love
it or hate it, this chapter makes you think.
Chapter 27, Decile Analysis: Perspective and Performance.
Chapter 28, Net T-C Lift Model: Assessing the Net Effects of Test and Control Campaigns
extends the practice of assessing response models to the proper use of a control group
by offering a simple, straightforward, reliable model that is easy to implement and understand.
Chapter 34, Opening the Dataset: A Twelve-Step Program for Dataholics, has valuable content
for statisticians as they embark on the first step of any journey with data. Set in prose,
I provide a light reading on the expectant steps of what to do when cracking open the
dataset. Enjoy.
Chapter 43, Text Mining: Primer, Illustration, and TXTDM Software, has three objectives:
First, to serve as a primer, readable, brief though detailed, about what text mining encompasses,
and how to conduct basic text mining; second, to illustrate text mining with a small body of text,
yet interesting in its content; and third, to make text mining available to interested readers.
Chapter 44, Some of My Favorite Statistical Subroutines, includes subroutines referenced throughout
the book and generic subroutines for some 2nd-edition chapters for which I no longer have the data.


For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at
Sign-up for a free GenIQ webcast: Click here.