Many developers strive to automate all or at least some steps that feature engineering for ML needs, preferably without losing predictive accuracy. The ideal strategy assumes that any user can take “raw” data, build a model on it, and obtain predictions with the best possible (for the available sample) accuracy.
To help in this endeavor, there are many frameworks and libraries recommended by expert Python developers, which we will talk about today.
Frameworks
The frameworks described below are just some of the Python frameworks you can find, but they have proved their effectiveness in many use cases.
-
MLBox
The MLBox framework has proved itself over the past few years. For example, in the competition Two Sigma Connect: Rental Listing Inquiries. Using MLBox the participants got into the top 5% of the rating. Overall, it can solve the following tasks:
- Data preparation (the most advanced part of the library);
- Model selection;
- Searching for hyperparameters.
Among the disadvantages, we should mention that installing the system on Linux is much easier than on Mac or Windows.
-
Auto Sklearn
As the name implies, the Auto Sklearn framework is based on the popular scikit-learn machine learning library. It can perform the following:
- Feature generation;
- Model selection;
- Hyperparameter tuning.
Auto Sklearn copes well with small datasets but sometimes doesn’t work correctly with large datasets.
-
TPOT
TPOT is positioned as a framework in which the entire machine-learning pipeline is automated. Many different models can be built then the best model in terms of predictive accuracy is chosen. Like Auto Sklearn, this framework is an add-on to scikit-learn. But TPOT has its regression and classification algorithms.
The disadvantages include TPOT’s inability to interact with natural language and categorical strings.
-
H2O
The H2O framework supports both traditional machine learning models and neural networks. Especially suitable for those looking for a way to automate deep learning.
-
Auto Keras
Auto Keras follows the design of the classic scikit-learn API but uses a powerful neural network search for model parameters using Keras.
Libraries
All these Python libraries and packages are used in one way or another in almost any machine-learning task. Often they are enough to build a complete model, at least to do the first approximation.
-
NumPy
NumPy is an open-source library for performing linear algebra operations and numerical transformations. Typically, such operations are needed to transform datasets that can be represented as a matrix.
The library implements numerous operations for working with multidimensional arrays, Fourier transforms, and random number generators. The NumPy storage formats are the de-facto standard for storing numerical data in many other libraries (such as Pandas, Scikit-learn, and SciPy).
-
Pandas
A library for data processing. It’s used to load data from almost any source (integration with the primary data storage formats for machine learning), calculate various functions and create new parameters, and build queries on the data using aggregative functions similar to those implemented in SQL.
In addition, there are various functions of matrix transformation, sliding window method, and other methods for getting information from data.
-
Tensorflow
The Tensorflow library was developed by Google and used for building neural networks. It has a C++ version that supports GPU computing. Based on this library, more high-level libraries are built to work with neural networks at the level of integer layers. Thus, some time ago, the popular Keras library started to use Tensorflow as the main backend for calculations instead of the similar Theano library.
You most likely should use this library if you work with pictures (with convolutional neural networks).
-
Keras
Library for building neural networks that support basic layers and building blocks. It supports both recurrent and convolutional neural networks and has an implementation of well-known neural network architectures (for example, VGG16). Some time ago, layers from this library became available inside the Tensorflow library.
There are ready-to-use functions for working with images and text (Embedding words, etc.). It’s integrated into Apache Spark with the dist-Keras distribution.
Conclusion
Like many other topics in machine learning, automated feature engineering using frameworks and libraries is a complex concept built on simple ideas. The tools described in this article are publicly available and save valuable time. Just remember: models are only as good as the data you provide them, and automated feature engineering can help make the feature creation process more efficient.