Data Science: Enabling Research at CERN with Big Data

Wow, time flies. One year has passed since I started to work at CERN as a data scientist. CERN, surrounded by snow-capped mountains and Lake Geneva, is known for its particle accelerator Large Hadron Collider (LHC) and its adventure in search of the Higgs boson. Underneath the research there is an tremendous amount of data that are analysed by data scientists.

Filters, known as High Level Triggers, reduce the flow of data from a petabyte (PB) a second to a gigabyte per second, which is then transferred from the detectors to the LHC Computing Grid. Once there, the data is stored on about 50PB of tape storage and 20PB of disk storage. The disks are managed as a cloud service (Hadoop), on which up to two millions of tasks are performed every day.

High Level Trigger data flow
High Level Trigger data flow, as applied in the ALICE experiment

CERN relies on software engineers and data scientists to streamline the management and operation of its particle accelerator. It is crucial for research to allow real-time analysis. Data extractions need to remain scalable and predictive. Machine learning is applied to identify new correlations between variables (LHC data and external data) that were not previously connected.

So what is coming up next? Scalability remains a very important area, as the CERN’s data will continue to grow exponentially. However, the role of data scientists goes much further. We need to transfer knowledge throughout the organisation and enable a data-driven culture. In addition, we need to evaluate and incorporate new innovative technologies for data analysis that are appropriate for our use cases.

Analyzing High Energy Physics Data with Tableau at CERN

Screenshot of Tableau 4.0 analyzing High Energy Physics Data at CERN
Screenshot of Tableau 4.0 analyzing High Energy Physics Data at CERN

About a year ago, I had a first try with Tableau and some survey data for a university project. Last week, I finally found time to test Tableau with High Energy Physics (HEP) data from CERN’s Proton Synchrotron (PS). Tableau enjoys a stellar reputation among the data visualization community, while the HEP community heavily uses Gnuplot and Python.

Tableau 4.0: Connect to Data
Tableau 4.0: Connect to Data

I was using an ordinary CSV file as data source for this quick visualization. Furthermore, Tableau can connect to other file types such as Excel, as well as to databases like Microsoft SQL Server, Oracle, and Postgres.

I’m also quite impressed by the ease and speed with which insightful analysis seems to appear out of bland data. Even though your analysis toolchain is script-based (as usual at CERN where batch processing is mandatory), I highly recommend using Tableau for prototyping and for ad-hoc data exploration.

MS SQL Server: ETL mit Data Transformation Services

Screenshot von SQL Server Enterprise Manager mit SAP MaxDB
Screenshot von SQL Server Enterprise Manager mit SAP MaxDB

Kürzlich stand ich vor der Herausforderung einen Datenbestand von einem Datenbanksystem (SAP MaxDB) in ein anderes (Microsoft SQL Server) zu überführen. Das Unterfangen war manuell jedoch kaum zu realisieren, da die Datenbank mehrere hundert Tabellen und unzählige Datensätze umfasst.

Abhilfe schaffte der Microsoft SQL Server Enterprise Manager. Dort finden sich die Data Transformation Services wieder, Hilfsprogramme, die es erlaubt, ETL-Prozesse (Extract, Transform, Load) beim Import in oder Export aus einer Datenbank zu automatisieren. Dabei werden verschiedene Datenbanksysteme unterstützt, sofern diese über eine ODBC– oder eine OLE DB-Schnittstelle verfügen, was auch bei SAP MaxDB der Fall ist.

Konkret bestehen die Data Transformation Services (DTS) aus folgenden Komponenten:

  • DTS Import/Export Wizard: Assistenten, die es erlauben Daten von oder zu einem MS SQL Server zu übertragen, sowie Map Transformations ermöglichen.
  • DTS Designer: Ermöglicht das erstellen von komplexen ETL-Workflows einschließlich event-basierter Logik.
  • DTS Run Utility: Planung und Ausführung von DTS-Packages; auch via Kommandozeile möglich.
  • DTS Query Designer: Eine GUI für das Erstellen von SQL-Abfragen für DTS.