Deep Feature Synthesis, the algorithm that will automate machine learning

4 min reading
Development / 26 January 2016
Deep Feature Synthesis, the algorithm that will automate machine learning
Deep Feature Synthesis, the algorithm that will automate machine learning

BBVA API Market

Years ago, the problem in obtaining present or future (predictions) knowledge with data science was the lack of systems for storing and processing large amounts of information. This is not the case today. The efforts are therefore focused on how to analyze the data in order to extract true value. Deep Feature Synthesis helps in this task because it is an algorithm that automates automatic learning.

Machine learning is the set of processes whereby an algorithm is capable of making predictions from data, and where the result of that projection enhances the machine’s own learning to improve its predictions. Machines learning from their mistakes and successes. It is therefore a specific branch of artificial intelligence that is applied in fields as diverse as banking to detect fraud processes, health for hospital management, or retailing for optimized price calculation.

The two creators of the algorithm are two prominent members of the MIT‘s Computing Science and Artificial Intelligence Laboratory in Cambridge. Their names: James Max Kanter and Kalyan Veeramachaneni. They both presented the project in a document entitled ‘Deep Feature Synthesis: Towards Automating Data Science Endeavors’ (PDF), in which they summarize the characteristics of their ‘creature’.   

Deep Feature Synthesis does exactly what its own name announces. It is an algorithm capable of automatically creating features between sets of relational data to synthesize automatic learning processes. The algorithm applies mathematical functions to the data sets in the source field and transforms them into new groups with new and deeper features.

In this process of evolution of the source data, one can begin with some information variables referring to the gender or age and at the end of the process applied by Deep Feature Synthesis have features that will make it possible to make other deeper calculations as percentages.

Deep Feature Synthesis and the Gaussian copula

This automatic machine learning process is perfected, according to its creators, thanks to the probability theory of the Gaussian copula. Many of the stages in an automatic learning process have parameters that, looking for an appropriate result, require a tuning process. The less tuning is used, the less predictive the model will be.

Small variations in that improvement can result in chaos. The large amount of combinations of parameters turns any minimum deviation at the beginning into a huge error at the end. In economic predictive models, this means billions of dollars. The Gaussian copula helps the algorithm’s creators to model the relationship between the options of the parameters and the performance of the entire model. From there, a decision is made as to the best parameters for optimizing the result.

The Gaussian copula was the statistical model used to avoid a major credit crisis like the one in 2008 and it obviously did not succeed. This method was used in VAR (Value at Risk) analyses, that measure the losses that the market could sustain in normal conditions with a confidence level of 95%. In other words, that an investor with a one million euros portfolio can only lose 25,000 euros each of the 20 days (1/20=5% remaining from the confidence level). The day when the great international crisis broke out in 2008, those values soared outside the margins. 

Implementation of Deep Feature Synthesis

The Deep Feature Synthesis algorithm and its Data Science machine are implemented on top of an MySQL database, using InmoDB as the engine for the tables, an open source data storage solution for this type of relational database. InmoDB replaces the previous table technology for MySQL and MyISAM. It is more reliable, more consistent, more scalable and, therefore, it offers more performance.

All the data sets with which Deep Feature Synthesis works are converted into the data scheme used by MySQL. The calculation logic, the management and the handling of the features of all this information is done through the programming language Python, the most widely used syntax for the design and configuration of Data Science processes.

Why did the creators of Deep Feature Synthesis use a relational database like MySQL? Because the algorithm’s requirements and the way in which the data in this type of database are sorted are similar. The Data Science machine used by the algorithm’s creators implements AVG (), MAX (), MIN (), SUM (), STD () and COUNT () type functions. It also adds others for another type of operations with the data, such as length () or WEEKDAY () and MONTH () to convert the dates into the days of the week and the month when they occurred.

The use of functions, plus the creation of filters, enable the algorithm to address two really important matters in predictive models:

●      Application of functions only to the cases in which a given condition is taken to be true. Without data filtering it is impossible.

●      Construction of time interval values, with limits above and also below, based on a time date.

This makes it much easier to optimize database queries.

The three processes of Deep Feature Synthesis

The Data Science based on Deep Feature Synthesis uses the three usual processes with the data to prepare predictive models: 

●      Data pre-processing: the preliminary work with the data is essential before considering automatic learning work. The parameters need to be reviewed in order to reject, for example, null values.

●      Selection of features and reduction of dimensionality: the algorithm generates a large amount of features for each entity and a preliminary selection and reduction task is necessary.

●      Modeling: decision trees are used for data modeling.

Follow us on @BBVAAPIMarket

It may interest you