Apache Spark Turbo Charges Big Data Analysis

This post has already been read 1459 times!
0 Flares Twitter 0 Facebook 0 0 Flares ×

Apache Spark Turbo Charges Big Data analysisApache Projects[1] “Spark” provides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. Last year, Spark took over Hadoop by completing the 100 TB Daytona GraySort contest 3x faster on one tenth the number of machines and it also became the fastest open source engine for sorting a petabyte.

Spark also makes it possible to write code more quickly as you have over 80 high-level operators at your disposal. To demonstrate this, let’s have a look at the “Hello World!” of BigData: the Word Count example. Written in Java for MapReduce it has around 50 lines of code, whereas in Spark (and Scala) you can do it as simply as inputting some simple code.

The Spark core is complemented by a set of powerful, higher-level libraries which can be seamlessly used in the same application. These libraries currently include SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX, each of which is further detailed in this article. Additional Spark libraries and extensions are currently under development as well.

Transformations in Spark are “lazy”, meaning that they do not compute their results right away. Instead, they just “remember” the operation to be performed and the dataset (e.g., file) to which the operation is to be performed. The transformations are only actually computed when an action is called and the result is returned to the driver program. This design enables Spark to run more efficiently. For example, if a big file was transformed in various ways and passed to first action, Spark would only process and return the result for the first line, rather than do the work for the entire file.


Spark Streaming

Spark Streaming supports real time processing of streaming data, such as production web server log files (e.g. Apache Flume and HDFS/S3), social media like Twitter, and various messaging queues like Kafka. Under the hood, Spark Streaming receives the input data streams and divides the data into batches. Next, they get processed by the Spark engine and generate final stream of results in batches, as depicted below.


Example cases using Spark

Now that we have answered the question “What is Apache Spark?”, let’s think of what kind of problems or challenges it could be used for most effectively.

I came across an article recently about an experiment to detect an earthquake by analyzing a Twitter stream. Interestingly, it was shown that this technique was likely to inform you of an earthquake in Japan quicker than the Japan Meteorological Agency. Even though they used different technology in their article, I think it is a great example to see how we could put Spark to use with simplified code snippets and without the glue code.

First, we would have to filter tweets which seem relevant like “earthquake” or “shaking”. We could easily use Spark Streaming for that purpose. Then, we would have to run some semantic analysis on the tweets to determine if they appear to be referencing a current earthquake occurrence. Tweets like ”Earthquake!” or ”Now it is shaking”, for example, would be consider positive matches, whereas tweets like “Attending an Earthquake Conference” or ”The earthquake yesterday was scary” would not. The authors of the paper used a support vector machine (SVM) for this purpose. We’ll do the same here, but can also try a streaming version.

If we are happy with the prediction rate of the model, we could move onto the next stage and react whenever we discover an earthquake. To detect one we need a certain number (i.e., density) of positive tweets in a defined time window (as described in the article). Note that, for tweets with Twitter location services enabled, we would also extract the location of the earthquake. Armed with this knowledge, we could use SparkSQL and query an existing Hive table (storing users interested in receiving earthquake notifications) to retrieve their email addresses and send them a personalized warning email.

Another example use of Spark is in the e-commerce industry. Real-time transaction information could be passed to a streaming clustering algorithm like k-means or collaborative filtering like ALS. Results could then even be combined with other unstructured data sources, such as customer comments or product reviews, and used to constantly improve and adapt recommendations over time with new trends.

In the finance or security industry, the Spark stack could be applied to a fraud or intrusion detection system or risk-based authentication. It could achieve top-notch results by harvesting huge amounts of archived logs, combining it with external data sources like information about data breaches and compromised accounts.



According to Radek Ostowski of Toptal.com, “Spark helps to simplify the challenging and compute-intensive task of processing high volumes of real-time or archived data, both structured and unstructured, seamlessly integrating relevant complex capabilities such as machine learning and graph algorithms. Spark brings Big Data processing to the masses.”



[1] The Apache Software Foundation is a decentralized open source community of developers. The software they produce is distributed under the terms of the Apache License and is free and open source software (FOSS). The Apache projects are characterized by a collaborative, consensus-based development process and an open and pragmatic software license. Each project is managed by a self-selected team of technical experts who are active.


Additional Reading

Big Data Adds Real Value for the Oil and Gas Industry

Big Data, CEM and Customer Loyalty- A Key To Profitability


If you liked this article, we'll be happy to send you one email a month to let you know the newest edition of the MetaOps/MetaExperts MegEzine has been published. Just fill the form below.