The data scientist’s toolbox

4 min reading
Innovation / 16 April 2015
The data scientist’s toolbox
The data scientist’s toolbox

BBVA API Market

Caja de herramientas del científico de datos

Data Science stands today as a multidisciplinary profession, in which knowledge from various areas overlap in a profile more typical of the Renaissance than from this super-specialized 21st century.

Given the scarcity of formal training in this field, data scientists are forced to collect dispersed knowledge and tools to optimally develop their skills.

The following is intended to be a basic guide, obviously not exhaustive, of some useful resources available for each of the facets performed by these professionals.

Data management

Part of the work of the data scientist it to capture, clean-up and store information in a format suitable for its processing and analysis.

The most usual scenario is to access a copy of the data source for a one-time or periodic capture.

You will need to know SQL to access the data stored in relational databases. Each database has a console to execute SQL queries, even though most people prefer to use a graphical environment with information about tables, fields and indexes. Some of the most popular data management tools are Toadproprietary software for Microsoft’s platform, and Torawhich is open-source and cross-platform.

Once the data is extracted we can store it in plain text files which we will upload to our working environment, for machine learning or to be used with a tool such as SQlite.

SQlite is a lightweight relational database with no external dependencies and which does not require to be installed in a server. Moving a database is as easy as copying a single file. In our case, when processing information we can do it without concurrence or multiple access to the source data, which perfectly suits the characteristics of SQlite.

The languages we use for our algorithms have connectivity to SQlite (Python, through SQlite3 and R, trhough RSQlite)so we can choose to import the data before preprocessing or to do part of it in the database itself, which will help us to avoid more than one problem after a certain amount of records.

Another alternative to bulk data capture is to use a tool including the full ETL cycle (Extraction, Transformation and Load), i.e. RapidMiner, Knime or Pentaho. With them, we can graphically define the acquisition and debugging cycles of data using connectors.

Once we have guaranteed access to the data source during preprocessing, we can use an ODBC connection (RODBC and RJDBC in R, and pyODBC, mxODBC and SQLAlchemy in Python) and benefit from making connections (JOIN) and groups (GROUP BY) using the database engine and subsequently importing the results.

For the external processing, pandas (a Python library) and data.table (a package in R) are our first choice. Data.table allows to circumvent one of R’s weaknesses, memory management, performing vector operations and reference groups without having to duplicate objects temporarily.

A third scenario would be to access information generated in real time and transmit it in formats like XML or JSON. These are called incremental learning projects, and among them we find recommendation systems, online advertising and high frequency trading.

For this we will use tools like XML or jsonlite (R packages), or xml and json (Python modules). With them we will make a streaming capture, make our predictions, send it back in the same format, and update our model once the source system provides us, later on, with the results observed in reality.

Analysis

Even though the Business Intelligence, Data Warehousing and Machine Learning fields are part of Data Science, the latter is the one which requires a greater number of specific utilities.

Hence, our toolbox will need to include R y Python, the programming language most widely used in machine learning.

For Python we highlight the suite scikit-learn, which covers almost all techniques, except perhaps neural networks. For these we have several interesting alternatives, such as Caffe and Pylearn2. The latter is based on Theano, an interesting Python library that allows symbolic definitions and a transparent use of GPU processors.

Some of the most used packages for R:

– Gradient boosting: gbm y xgboost.

– Random forests for classification and regression: randomForest and randomForestSRC.

– Support vector machines: e1071, LiblineaR and kernlab.

– Regularized regression (Ridge, Lasso y ElasticNet): glmnet.

– Generalized additive models: gam.

– Clustering: cluster.

If we need to change any R package we will need C++ and some utilities that allow us to re-generate them: Rtools, an environment for creating packages in R under Windows, and devtools, which facilitates all processes related to development.

There are also some general purpose tools that will make our life easier in R:

– Data.table: Fast reading of text files; creation, modification and deletion of columns by reference; joins by a common key or group, and summary of data.

– Foreach: Execution of parallel processes against a previously defined backend with utilities such as doMC or doParallel.

– Bigmemory: Manage massive matrices in R and share information across multiple sessions or parallel analyses.

– Caret: Compare models, control data partitions (splitting, bootstrapping, subsampling) and tuning parameters (grid search).

– Matrix: Manage sparse matrices and transformation of categorical variables to binary (onehote encoding) using the sparse.model.matrix function.

Distributed environments deserve a special mention. If we have dealt with data from a large institution or company, we will probably have experience working with the so-called Hadoop ecosystem. Hadoop is a distributed file system (HDFS) equipped with algorithms (MapReduce) that allows to perform information processing in parallel.

Among the machine learning tools compatible with Hadoop we find:

– Vowpal Wabbit: Online learning methods based on gradient descent.

– Mahout: A suite of algorithms, including among them recommendation systems, clustering, logistic regression, and random forest.

– h2o: Perhaps the tool experiencing a higher growth phase, with a large number of parallelizable algorithms. It can be executed from a graphical environment or from R or Python.

The data scientist should also keep abreast of new trends of generational change of Hadoop to Spark.

Spark has several advantages over Hadoop to process information and the execution of algorithms. The main one is speed, as it is 100 times faster because, unlike Hadoop, it uses in-memory management and only writes to disk when necessary.

Spark can run independently or may coexist as a component of Hadoop, allowing migration to be planned in a non-traumatic way. You can, for example, use HBase as a database, even though Cassandra is emerging as a storage solution thanks to its redundancy and scalability.

As a sign of the change of scenery, Mahout is working since last year in its integration with Spark, distancing itself from MapReduce and Hadoop, while H2O.ai has launched Sparkling Water, a version of its h2o suite on Spark.

 

Visualization

Finally, a brief reference to the presentation of results.

The most popular tools for R are clearly lattice y ggplot2, and Matplotlib for Python. But if we need professional presentations embedded in web environments the best choice is certainly D3.js.

Among the integrated Business Intelligence environments with a clear approach to presentations we should highlight the well known Tableau, and as alternatives for graphical exploration of data, Birst and Necto.

It may interest you