Three solid real-time Big Data alternatives: Spark, Storm and DataTorrent RTS

4 min reading

Development / 07 August 2015

Data, data, data. Value, value,value. And if possible, in real time. The concept of real-time business intelligence has been on the market for some time, but until very recently only a limited number of companies used it. Today, Hadoop's stability makes it the most commonly used platform for analyzing large volumes of data, but when streaming calculations are needed, solutions such as Spark, Storm or DataTorrent RTS are a great choice.

These kinds of practices used to have no real market penetration, for two main reasons: the first, obviously, was the lack of real-time business intelligence tools; the second, that existing solutions were only geared to batch data analysis and were expensive. Spark, Storm and DataTorrent RTS provide a solution to these two problems.

1. Apache Spark

Apache Spark is undoubtedly the great new star of Big Data analytics. It is an open-code platform for processing data in real time, and may be executed and operated using four types of different languages: Scala, the syntax in which the platform is written; Python; R; and Java. The idea of Spark is to offer advantages in the handling of constant data entries with speeds far above those offered by Hadoop MapReduce.

Some of its key features are:

– Speed in the calculation processes in memory and on disc: Apache promises a calculation speed 100 times quicker than that currently offered by Hadoop MapReduce in memory and 10 times better in disc.

– Execution on all types of platforms: Spark can be executed on Hadoop, Apache Mesos, and EC2, in independent cluster mode or in the cloud. In addition, Spark can access numerous databases such as HDFS, Cassandra, HBase or S3, Amazon's data warehouse.

– It incorporates a package of very useful tools for developers: the MLlib library for implementing automated learning solutions and GraphX, Spark's API for computation services with graphs.

– It has other interesting tools: Spark Streaming, which allows the processing of millions of data among the clusters, and Spark SQL which makes it easier to exploit the data through the SQL language.

2. Apache Storm

Apache Storm is an open-source distributed real-time computation system. It allows the simple and reliable processing of large volumes of analytics data (for example, for the continuous study of information from social networks), distributed RPC, ETL processes…

While Hadoop carries out batch data processing, Storm does it in real time. In Hadoop the data are entered in a file system (HDFS) and then distributed through the nodes to be processed. When the task is complete, the information returns from the nodes to HDFS to be used. In Storm there is no process with an origin and an end: the system is based on the construction of Big Data topologies that are transformed and analyzed in a continuous process of information entries.

That is why Storm is something more than a system of Big Data analytics: it is a system for Complex Event Processing (CEP). This type of solution allows companies to respond to the arrival of sudden and continuous data (information collected in real time by sensors, millions of comments generated on social networks such as Twitter, WhatsApp and Facebook, bank transfers…).

It is also of particular interest for developers for a number of reasons:

– It can be used in various programming languages. Storm has been developed in Clojure, a dialect of Lisp which is executed in Java Virtual Machine (JVM). Its great strength is that it offers compatibility with components and applications written in various languages such as Java, C#, Python, Scala, Perl and PHP.

– It is scalable.

– It is fault-tolerant.

– It is easy to install and operate.

3. DataTorrent RTS

DataTorrent RTS is an open-source solution for the batch or real-time processing and analysis of big data. It is an all-in-one tool that aims to revolutionize not only what can be done in the Hadoop MapReduce environment, but also what is already offered in Spark and Storm in performance. The platform is capable of processing billions of events per second and recover any node outages with no data loss and no human intervention.

Some of its key features include:

– Guaranteed event processing.

– High in-memory performance.

– It is scalable.

– Fault-tolerance at platform level.

– Easy to execute.

– Applications programmed in Java.

This Big Data solution provides mechanisms for ingesting data from many different sources, directly from external databases or through their integration with native corporate applications. DataTorrent RTS provides technical teams with a group of connectors previously developed for SQL and NoSQL databases, Apache Sqoop, Apache Kafka, Apache Flume and social networks such as Twitter… Anything that generates data.

At the end of the day, these Big Data tools allow companies to discover where their real business opportunities lie, cutting study and analysis times and reducing costs. It is a battle by real-time and predictive models to gain competitiveness and win the game against the competition.

Follow us on @BBVAAPIMarket

It may interest you

APIs in selling: the final push

Taking a customer through the entire buying process until it is formalized is an arduous journey and one that faces the constant possibility of the customer leaving. However, there are ways to make the buying decision happen if you are given facilities such as agile, secure financing.

API Business Models , Development , Digital Ecosystems / 15 October 2020
APIs are everywhere, but… what about their documentation?

In a connected world, APIs are the glue that keeps all the parts that form our day-to-day lives in place. The same way the power of glue depends on the material it is used on and the knowledge of its properties, APIs are only as useful as their documentation allows for.

Development / 19 February 2020
Tools to measure the success and effectiveness of your API

There are different solutions to monitor the performance of an API, in terms of traffic, quality and speed of the answers it provides. Detecting faults in the code or quantifying the generated revenues are also some of the options offered by these useful tools.

Development / 03 February 2020

Name	Owner	Duration	Description
gobp.lang	BBVA	1 month	Language preference
aceptarCookies	BBVA	1 year	Configuration Accepted Cookies
_abck	BBVA	1 year	Helps protect against malicious website attacks
bm_sz	BBVA	4 hours	Helps protect against malicious website attacks
ADRUM_BTs	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT1	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BTa	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
xt_0d95e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
__s9744cdb192d044faa1bf201d29fafd1e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
wpml_browser_redirect_test	WPML	Session	Text translation in the portal
wp-wpml_current_language	WPML	24 hours	Text translation in the portal

Name	Owner	Duration	Description
AMCV_***	Adobe Analytics	Session	Unique Visitor IDs used in Cloud Marketing solutions
AMCVS_***	Adobe Analytics	2 years	Unique Visitor IDs used in Cloud Marketing solutions
demdex (safari)	Adobe Analytics	180 days	Create and store unique and persistent identifiers
sessionID	Adobe Analytics	Session	Launch's internal cookie used to identify the user
gpv_URL	Adobe Analytics	Session	Adobe Analytics plugin: getPreviousValue Capture the value of a certain variable in the following page view, in this case the prop1
gpv_level1	Adobe Analytics	Session	Cookie used to store the DataLayer levl1 of the previous page.
gpv_pageIntent	Adobe Analytics	Session	Cookie used to store the pageIntent of the previous page.
gpv_pageName	Adobe Analytics	Session	Cookie used to store the pagename of the previous page.
aocs	Adobe Analytics	Session	Cookie that stores the first values collected at the beginning of a process.
TTC	Adobe Analytics	Session	Cookie used to store the time between the App Page Visit event and the App Completed event.
TTCL	Adobe Analytics	Session	Cookie used to store the time between the LogIn event and App Completed.
s_cc	Adobe Analytics	Session	Determine if cookies are active
s_hc	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_ht	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_nr	Adobe Analytics	2 years	Determine the number of user visits
s_ppv	Adobe Analytics	Permanent	Adobe Analytics plugin: getPercentPageViewed Determine what percentage of the page a user views
s_sq	Adobe Analytics	Session	ClickMap/ActivityMap features
s_tp	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_visit	Adobe Analytics	2 years	Cookie used by Adobe to know when a session has been started.

Name	Owner	Duration	Description
OT2	VersaTag	90 days	VersaTag Cookie used to store a user id and the number of user visits.
u2	VersaTag	90 days	VersaTag Cookie where the user ID is stored
TargetingInfo 2	MediaMind	1 year	Cookie that serves to assign a unique random number that generates MediaMind.

Name	Owner	Duration	Description
mbox	Adobe Target	9 days	Cookie used by Adobe Target to test user experience customization.

Three solid real-time Big Data alternatives: Spark, Storm and DataTorrent RTS

It may interest you

APIs in selling: the final push

APIs are everywhere, but… what about their documentation?

Tools to measure the success and effectiveness of your API