A Data Processing Guide in the Big Data Jungle

14514437527_f687202d5d_k
Too many choices? Don’t get lost!

We are deep in the Big Data jungle. According to Gartner’s Hype Cycle for Emerging Technologies, Big Data has now officially passed the “peak of inflated expectations”, and is now on a one-way trip to the “trough of disillusionment”. Gartner says it’s done so rather fast, because we already have consistency in the way we approach this technology, and because most new advances are additive rather than revolutionary.

Pig, Hive, Impala, Tez and Spark: which one suits for which use case?

With so much hype and so many new advances, it’s easy to get lost. This little guide gives you an overview on data processing technologies in the Big Data jungle and tries to identify the best use cases for each.

  • Pig: Pig is often useful for pulling apart unstructured and nested data like text or JSON. Since Pig Latin is a procedural language, it is a very good choice for developing data pipelines on Hadoop. Pig is based on MapReduce and has tools for data storage, data execution and data manipulation.
  • Hive: Hive was original “relational on Hadoop” and is the first Hadoop SQL (HiveQL to be precise) query engine. Hive is still the most mature engine from all in this guide, as well as the slowest one. Hive is also based on MapReduce and is a very good choice for heavy ETL tasks where reliability is important, eg. daily aggregation jobs.
  • Impala: Impala is the only native open-source SQL query engine in the Hadoop world. It skips MapReduce entirely and is best used for SQL queries over big volumes. Impala is also capable of delivering results interactively over bigger volumes and with a much faster speed than other Hadoop query engines.
  • Tez: Tez may be considered as a better and faster base for query engines like Pig and Hive. Tez gets around limitations imposed by MapReduce and enables use cases with near-real-time performance and Machine Learning, which do not fit well into the MapReduce paradigm.
  • Spark: Spark is an in-memory query engine that also skips MapReduce. Perfect use cases for Spark are streaming, interactive data processing and ad-hoc analysis of moderate-sized data sets (as big as the cluster’s RAM). The ability of Spark to reuse data in-memory is the real highlight for these use cases. Spark SQL offers relational connectivity.

7 Big Data Analytics Use Cases for Financial Institutions

Big Data Analytics
Big Data Analytics

Recently we hear a lot about Big Data Analytics‘ ability to deliver usable insight – but what does this mean exactly for the financial service industry?

While much of the Big Data activity in the market up to now has been experimenting about Big Data technologies and proof-of-concept projects, I like to show in this post seven issues banks and insurances can address with Big Data Analytics:

1. Dynamic 360º View of the Customer:
Extend your existing customer views by incorporating dynamic internal and external information sources. Gain a full understanding of customers – what makes them tick, why they buy, how they prefer to shop, why they switch, what they’ll buy next, and what factors lead them to recommend a company to others.

2. Enhanced Commercial Scorecard Design and Implementation:
Financial institutions use Big Data solutions to analyze commercial loan origination, developing scorecards and scoring, and ultimately improving accuracy as well as optimizing price and risk management.

3. Risk Concentration Identification and Management:
Identify risk concentration hotspots by decomposing risk into customized insights. Clearly see factor contribution to risks and gain allocation consensus through downside risk budgeting.

4. Next Best Action Recommendations:
Make „next best action“ an integral part of your marketing strategy and proactive customer care. With analytical insight from Big Data, you can answer such questions as: What approach will get the most out of the customer relationship? Is selling more important than retention?

5. Fraud Detection Optimization:
Preventing fraud is a major priority for all financial services organizations. But to deal with the escalating volumes of financial
transaction data, statisticians need better ways to mine data for insight. Optimization for your current fraud detection techniques help to leverage your existing fraud detection assets.

6. Data and Insights Monetization:
Use your customer transaction data to improve targeting of cross-sell offers. Partners are increasingly promoting merchant based reward programs which leverage a bank’s or credit card issuer’s data and provide discounts to customers at the same time.

7. Regulatory and Data Retention Requirements:
The need for more robust regulatory and data retention management is a legal requirement for financial services organizations across the globe to comply with the myriad of local, federal, and international laws (such as Basel III) that mandate the retention of certain types of data.

What is the Best Onshore/Offshore Ratio for Consulting Firms?

Challenges and potential of the extended usage of offshore resources for consulting firms
Screenshot of Onshore/Offshore Ratio survey

At the time of a new engagement, managers take into consideration many activities like project planning, effort estimation, defining goals and metrics, cost, outcome, etc. One factor that is most important for any project to succeed is engaging the right onshore/offshore staffing ratio to execute the project. This factor is mostly not given adequate importance in many recent delivery models. For managers to meet project profit margins, they try to limit the cost spent on project resources and execution. With the limited resourcing budget, it is not feasible to have a default onshore/offshore ratio that fits all projects.

After gathering some experience in working offshore (2007-2008 in Bangalore, India) and onshore (in Germany and Switzerland) I started to wonder if there is a optimal onshore/offshore ratio. Quite soon I concluded that this question is not easy to answer. So I did a breakdown to certain aspects and instead of answering them by myself, I set up a survey and hope to get your support!

Start the survey: http://bit.ly/offshoreratio
[Update 15 Nov 2014]: After collecting data over four weeks (18 Oct – 14 Nov), the survey is closed. Results will follow soon.

Basically, I’d like to address three groups to answer this survey:

  • Employees of traditional consulting firms
  • Employees of Indian pure players (such as Infosys, TCS, HCL, Wipro, etc.)
  • Employees of clients of consulting firms

Of course, I’m going to share the results after evaluation. Thank you for participating and sharing the link with your colleagues! Also retweets are highly appreciated…

India still the Top Destination for Outsourcing

SAP Labs India Pvt. Ltd. in Bangalore
SAP Labs India Pvt. Ltd. in Bangalore

Asian countries, especially countries in South Asia and Southeast Asia, keep on being favored picks among organizations interested in contract out business processes offshore. India remains the top outsourcing destination, with its unrivaled advantages in scale and people skills, said the 2014 Global Services Location Index (GSLI) released by A.T. Kearney. China and Malaysia are second and third respectively.

The GSLI, which tracks offshoring patterns to lower-cost developing countries and the ascent of new locations, measures the underlying fundamentals of 51 nations focused on measurements in three general classifications, such as financial attractiveness, people skills and availability, and business environment.

Distributed since 2004 the GSLI, revealed that leading IT-services companies in India, to whom IT-related functions were outsourced, are extending their traditional offerings to incorporate research and development, product development and other niche services. The line between IT and business-procedure outsourcing there is obscuring, as players offer packages and specialized services to their customers and are developing skills in niche domains.

Furthermore, the GSLI identified a trend of multinationals reassessing their outsourcing strategies, after having aggressively outsourced back office operations in the mid-2000s; it has been noted that some companies are starting to reclaim some of these functions and undertaking them in-house again.

[flickr_set id=“72157647531971537″]

Data Science Toolbox: How to use R with Tableau

Recently, Tableau released an exciting feature that enhances the capabilities of data analytics: R integration via RServe. By bringing together Tableau and R, data scientists and analysts can now enjoy a more comprehensive and powerful data science toolbox. Whether you’re an experienced data scientist or just starting your journey in data analytics, this tutorial will guide you through the process of integrating R with Tableau.

Step by Step: Integrating R in Tableau

1. Install and start R and RServe

You can download base R from r-project.org. Next, invoke R from the terminal to install and run the RServe package:

> install.packages("Rserve")
> library(Rserve)
> Rserve()

To ensure RServe is running, you can try Telnet to connect to it:

Telnet

Protip: If you prefer an IDE for R, I can highly recommend you to install RStudio.

2. Connecting Tableau to RServe

Now let’s open Tableau and set up the connection:

Tableau 10 Help menu
Tableau 10 External Service Connection

3. Adding R code to a Calculated Field

You can invoke R scripts in Tableau’s Calculated Fields, such as k-means clustering controlled by an interactive parameter slider:

SCRIPT_INT('
kmeans(data.frame(.arg1,.arg2,.arg3),' + STR([Cluster Amount]) + ')$cluster;
',
SUM([Sales]), SUM([Profit]), SUM([Quantity]))
Calculated Field in Tableau 10

4. Use Calculated Field in Tableau

You can now use your R calculation as an alternate Calculated Field in your Tableau worksheet:

Tableau 10 showing k-means clustering

Feel free to download the Tableau Packaged Workbook (twbx) here.

Connect and Stay Updated

Stay on top of the latest in data science and analytics by following me on Twitter and LinkedIn. I frequently share tips, tricks, and insights into the world of data analytics, machine learning, and beyond. Join the conversation, and let’s explore the possibilities together!

Blog post updates: