This YouTube tutorial shows you a handy way to load your Excel data to Cloudera Hadoop with Alteryx, and how to see and understand your data even faster with Tableau connected to Impala.
The same tool chain to load and access data can be used with Hive (eg. on Hortonworks) or Spark SQL (eg. on MapR). A overview on common data process technologies can be found in the Big Data jungle guide.
We are deep in the Big Data jungle. According to Gartner’s Hype Cycle for Emerging Technologies, Big Data has now officially passed the “peak of inflated expectations”, and is now on a one-way trip to the “trough of disillusionment”. Gartner says it’s done so rather fast, because we already have consistency in the way we approach this technology, and because most new advances are additive rather than revolutionary.
Pig, Hive, Impala, Tez and Spark: which one suits for which use case?
With so much hype and so many new advances, it’s easy to get lost. This little guide gives you an overview on data processing technologies in the Big Data jungle and tries to identify the best use cases for each.
Pig: Pig is often useful for pulling apart unstructured and nested data like text or JSON. Since Pig Latin is a procedural language, it is a very good choice for developing data pipelines on Hadoop. Pig is based on MapReduce and has tools for data storage, data execution and data manipulation.
Hive: Hive was original “relational on Hadoop” and is the first Hadoop SQL (HiveQL to be precise) query engine. Hive is still the most mature engine from all in this guide, as well as the slowest one. Hive is also based on MapReduce and is a very good choice for heavy ETL tasks where reliability is important, eg. daily aggregation jobs.
Impala: Impala is the only native open-source SQL query engine in the Hadoop world. It skips MapReduce entirely and is best used for SQL queries over big volumes. Impala is also capable of delivering results interactively over bigger volumes and with a much faster speed than other Hadoop query engines.
Tez: Tez may be considered as a better and faster base for query engines like Pig and Hive. Tez gets around limitations imposed by MapReduce and enables use cases with near-real-time performance and Machine Learning, which do not fit well into the MapReduce paradigm.
Spark: Spark is an in-memory query engine that also skips MapReduce. Perfect use cases for Spark are streaming, interactive data processing and ad-hoc analysis of moderate-sized data sets (as big as the cluster’s RAM). The ability of Spark to reuse data in-memory is the real highlight for these use cases. Spark SQL offers relational connectivity.
Recently we hear a lot about Big Data Analytics’ ability to deliver usable insight – but what does this mean exactly for the financial service industry?
While much of the Big Data activity in the market up to now has been experimenting about Big Data technologies and proof-of-concept projects, I like to show in this post seven issues banks and insurances can address with Big Data Analytics:
1. Dynamic 360º View of the Customer:
Extend your existing customer views by incorporating dynamic internal and external information sources. Gain a full understanding of customers – what makes them tick, why they buy, how they prefer to shop, why they switch, what they’ll buy next, and what factors lead them to recommend a company to others.
2. Enhanced Commercial Scorecard Design and Implementation:
Financial institutions use Big Data solutions to analyze commercial loan origination, developing scorecards and scoring, and ultimately improving accuracy as well as optimizing price and risk management.
3. Risk Concentration Identification and Management:
Identify risk concentration hotspots by decomposing risk into customized insights. Clearly see factor contribution to risks and gain allocation consensus through downside risk budgeting.
4. Next Best Action Recommendations:
Make “next best action” an integral part of your marketing strategy and proactive customer care. With analytical insight from Big Data, you can answer such questions as: What approach will get the most out of the customer relationship? Is selling more important than retention?
5. Fraud Detection Optimization:
Preventing fraud is a major priority for all financial services organizations. But to deal with the escalating volumes of financial
transaction data, statisticians need better ways to mine data for insight. Optimization for your current fraud detection techniques help to leverage your existing fraud detection assets.
6. Data and Insights Monetization:
Use your customer transaction data to improve targeting of cross-sell offers. Partners are increasingly promoting merchant based reward programs which leverage a bank’s or credit card issuer’s data and provide discounts to customers at the same time.
7. Regulatory and Data Retention Requirements:
The need for more robust regulatory and data retention management is a legal requirement for financial services organizations across the globe to comply with the myriad of local, federal, and international laws (such as Basel III) that mandate the retention of certain types of data.
There are four key issues to overcome if you want to tame Big Data: volume (quantity of data), variety (different forms of data), velocity (how fast the data is generated and processed) and veracity (variation in quality of data). You have to be able to deal with lots and lots, of all kinds of data, moving really quickly.
That is why Big Data Analytics has a huge impact on how we plan CERN’s overall technology strategy as well as specific strategies for High-Energy Physics analysis. We want to profit from our data investment and extract the knowledge. This has to be done in a proactive, predictive and intelligent way.
The following presentation shows you how we use Big Data Analytics to improve the operation of the Large Hardron Collider.
The raw data from the experiments is stored in structured files (using CERN’s ROOT Framework), which are better suited to physics analysis. Transactional relational databases (Oracle 11g with Real Application Clusters) store metadata information that is used to manage that raw data. For metadata residing on the Oracle Database, Oracle TimesTen serves as an in-memory cache database. The raw data is analysed on PROOF (Parallel ROOT Facility) clusters. Hadoop Distributed File System (HDFS), however, is used to store the monitoring data.
Just as in the CERN example, there are some significant trends in Big Data Analytics:
Descriptive Analytics, such as standard business reports, dashboards and data visualization, have been widely used for some time, and are the core applications of traditional Business Intelligence. This ad hoc analysis looks at the static past and reveal what has occurred. One recent trend, however, is to include the findings from Predictive Analytics, such as forecasts of sales on the dashboard.
Predictive Analytics identify trends, spot weaknesses or determine conditions for making decisions about the future. The methods for Predictive Analytics such as machine learning, predictive modeling, text mining, neural networks and statistical analysis have existed for some time. Software products such as SAS Enterprise Miner have made these methods much easier to use.
Discovery Analytics is the ability to analyse new data sources. This creates additional opportunities for insights and is especially important for organizations with massive amounts of various data.
Prescriptive Analytics suggests what to do and can identify optimal solutions, often for the allocation of scarce resources. Prescriptive Analytics has been researched at CERN for a long time but is now finding wider use in practice.
Semantic Analytics suggests what you are looking for and provides a richer response, bringing some human level into Analytics that we have not necessarily been getting out of raw data streams before.
As these trends bear fruit, new ecosystems and markets are being created for broad cross-enterprise Big Data Analytics. Use cases like the CERN’s LHC experiments provide us with greater insight into how important Big Data Analytics is in the scientific community as well as to businesses.