This YouTube tutorial shows you a handy way to load your Excel data to Cloudera Hadoop with Alteryx, and how to see and understand your data even faster with Tableau connected to Impala.
The same tool chain to load and access data can be used with Hive (eg. on Hortonworks) or Spark SQL (eg. on MapR). A overview on common data process technologies can be found in the Big Data jungle guide.
We are deep in the Big Data jungle. According to Gartner’s Hype Cycle for Emerging Technologies, Big Data has now officially passed the “peak of inflated expectations”, and is now on a one-way trip to the “trough of disillusionment”. Gartner says it’s done so rather fast, because we already have consistency in the way we approach this technology, and because most new advances are additive rather than revolutionary.
Pig, Hive, Impala, Tez and Spark: which one suits for which use case?
With so much hype and so many new advances, it’s easy to get lost. This little guide gives you an overview on data processing technologies in the Big Data jungle and tries to identify the best use cases for each.
Pig: Pig is often useful for pulling apart unstructured and nested data like text or JSON. Since Pig Latin is a procedural language, it is a very good choice for developing data pipelines on Hadoop. Pig is based on MapReduce and has tools for data storage, data execution and data manipulation.
Hive: Hive was original “relational on Hadoop” and is the first Hadoop SQL (HiveQL to be precise) query engine. Hive is still the most mature engine from all in this guide, as well as the slowest one. Hive is also based on MapReduce and is a very good choice for heavy ETL tasks where reliability is important, eg. daily aggregation jobs.
Impala: Impala is the only native open-source SQL query engine in the Hadoop world. It skips MapReduce entirely and is best used for SQL queries over big volumes. Impala is also capable of delivering results interactively over bigger volumes and with a much faster speed than other Hadoop query engines.
Tez: Tez may be considered as a better and faster base for query engines like Pig and Hive. Tez gets around limitations imposed by MapReduce and enables use cases with near-real-time performance and Machine Learning, which do not fit well into the MapReduce paradigm.
Spark: Spark is an in-memory query engine that also skips MapReduce. Perfect use cases for Spark are streaming, interactive data processing and ad-hoc analysis of moderate-sized data sets (as big as the cluster’s RAM). The ability of Spark to reuse data in-memory is the real highlight for these use cases. Spark SQL offers relational connectivity.
Kürzlich stand ich vor der Herausforderung einen Datenbestand von einem Datenbanksystem (SAP MaxDB) in ein anderes (Microsoft SQL Server) zu überführen. Das Unterfangen war manuell jedoch kaum zu realisieren, da die Datenbank mehrere hundert Tabellen und unzählige Datensätze umfasst.
Abhilfe schaffte der Microsoft SQL Server Enterprise Manager. Dort finden sich die Data Transformation Services wieder, Hilfsprogramme, die es erlaubt, ETL-Prozesse (Extract, Transform, Load) beim Import in oder Export aus einer Datenbank zu automatisieren. Dabei werden verschiedene Datenbanksysteme unterstützt, sofern diese über eine ODBC– oder eine OLE DB-Schnittstelle verfügen, was auch bei SAP MaxDB der Fall ist.
Konkret bestehen die Data Transformation Services (DTS) aus folgenden Komponenten:
DTS Import/Export Wizard: Assistenten, die es erlauben Daten von oder zu einem MS SQL Server zu übertragen, sowie Map Transformations ermöglichen.
DTS Designer: Ermöglicht das erstellen von komplexen ETL-Workflows einschließlich event-basierter Logik.
DTS Run Utility: Planung und Ausführung von DTS-Packages; auch via Kommandozeile möglich.
DTS Query Designer: Eine GUI für das Erstellen von SQL-Abfragen für DTS.
Manage Cookie Consent
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.