Data Science: Enabling Research at CERN with Big Data

Wow, time flies. One year has passed since I started to work at CERN as a data scientist. CERN, surrounded by snow-capped mountains and Lake Geneva, is known for its particle accelerator Large Hadron Collider (LHC) and its adventure in search of the Higgs boson. Underneath the research there is an tremendous amount of data that are analysed by data scientists.

Filters, known as High Level Triggers, reduce the flow of data from a petabyte (PB) a second to a gigabyte per second, which is then transferred from the detectors to the LHC Computing Grid. Once there, the data is stored on about 50PB of tape storage and 20PB of disk storage. The disks are managed as a cloud service (Hadoop), on which up to two millions of tasks are performed every day.

High Level Trigger data flow
High Level Trigger data flow, as applied in the ALICE experiment

CERN relies on software engineers and data scientists to streamline the management and operation of its particle accelerator. It is crucial for research to allow real-time analysis. Data extractions need to remain scalable and predictive. Machine learning is applied to identify new correlations between variables (LHC data and external data) that were not previously connected.

So what is coming up next? Scalability remains a very important area, as the CERN’s data will continue to grow exponentially. However, the role of data scientists goes much further. We need to transfer knowledge throughout the organisation and enable a data-driven culture. In addition, we need to evaluate and incorporate new innovative technologies for data analysis that are appropriate for our use cases.

Analyzing High Energy Physics Data with Tableau at CERN

Screenshot of Tableau 4.0 analyzing High Energy Physics Data at CERN
Screenshot of Tableau 4.0 analyzing High Energy Physics Data at CERN

About a year ago, I had a first try with Tableau and some survey data for a university project. Last week, I finally found time to test Tableau with High Energy Physics (HEP) data from CERN’s Proton Synchrotron (PS). Tableau enjoys a stellar reputation among the data visualization community, while the HEP community heavily uses Gnuplot and Python.

Tableau 4.0: Connect to Data
Tableau 4.0: Connect to Data

I was using an ordinary CSV file as data source for this quick visualization. Furthermore, Tableau can connect to other file types such as Excel, as well as to databases like Microsoft SQL Server, Oracle, and Postgres.

I’m also quite impressed by the ease and speed with which insightful analysis seems to appear out of bland data. Even though your analysis toolchain is script-based (as usual at CERN where batch processing is mandatory), I highly recommend using Tableau for prototyping and for ad-hoc data exploration.

Reflecting on my Internship in Software Engineering and Project Management at SAP

I recently completed an internship in the software engineering department of SAP, a large international software manufacturer, where I had the opportunity to work as a software engineer and project manager. Looking back on my experience, I am proud of the exceptional performance I was able to achieve in both of these roles and the great success I had in leading a team of 12 developers.

One of the main responsibilities of my internship was to lead the development of a mobile BI infrastructure. This was a complex and challenging project, but I was able to effectively manage it by using my project management skills to ensure that everything was completed on time and within budget. I was also able to contribute to the development of the infrastructure by using my software development skills to create high-quality code.

One of the things that I enjoyed most about my internship was the opportunity to work with such a diverse group of developers. Each person brought their own unique skills and perspectives to the table, which made the experience all the more enriching. By fostering a collaborative and inclusive work environment, I was able to create a positive team dynamic that made it easier for everyone to work together effectively.

These are some of my top learnings in software project management:

  1. Setting clear goals and objectives: It is important to have a clear understanding of what the project aims to achieve, as well as specific goals and objectives that need to be met. This will help to guide the project and ensure that it stays on track.
  2. Managing resources: A software project manager must be able to effectively allocate and manage resources, including budget, staff, and equipment, to ensure that the project is completed on time and within budget.
  3. Communication: Effective communication is crucial in software project management. The project manager must be able to communicate clearly and effectively with team members, stakeholders, and other stakeholders to ensure that everyone is on the same page and that any issues or concerns are addressed in a timely manner.
  4. Risk management: It is important to anticipate and mitigate potential risks to the project, as well as have contingency plans in place in case something does go wrong.
  5. Adaptability: A successful software project manager must be able to adapt to changes in the project and the industry, and be able to pivot as needed to ensure the project’s success.
  6. Leadership: A software project manager must be able to effectively lead and motivate the team to ensure that everyone is working towards the common goal.
  7. Attention to detail: A software project manager must have strong attention to detail to ensure that all aspects of the project are properly planned and executed.
  8. Time management: Managing a project requires effective time management skills to ensure that tasks are completed on schedule and that the project stays on track.

In conclusion, my internship at SAP was a valuable and rewarding experience that has helped me to develop my skills in software development and project management. I am grateful for the opportunity to have worked with such a talented team and am confident that the skills and knowledge I gained during my time at SAP will be invaluable as I pursue a career in the software industry.

This blog post is an excerpt from the Personal Development section of my internship report written for my university.

MS SQL Server: ETL mit Data Transformation Services

Screenshot von SQL Server Enterprise Manager mit SAP MaxDB
Screenshot von SQL Server Enterprise Manager mit SAP MaxDB

Kürzlich stand ich vor der Herausforderung einen Datenbestand von einem Datenbanksystem (SAP MaxDB) in ein anderes (Microsoft SQL Server) zu überführen. Das Unterfangen war manuell jedoch kaum zu realisieren, da die Datenbank mehrere hundert Tabellen und unzählige Datensätze umfasst.

Abhilfe schaffte der Microsoft SQL Server Enterprise Manager. Dort finden sich die Data Transformation Services wieder, Hilfsprogramme, die es erlaubt, ETL-Prozesse (Extract, Transform, Load) beim Import in oder Export aus einer Datenbank zu automatisieren. Dabei werden verschiedene Datenbanksysteme unterstützt, sofern diese über eine ODBC– oder eine OLE DB-Schnittstelle verfügen, was auch bei SAP MaxDB der Fall ist.

Konkret bestehen die Data Transformation Services (DTS) aus folgenden Komponenten:

  • DTS Import/Export Wizard: Assistenten, die es erlauben Daten von oder zu einem MS SQL Server zu übertragen, sowie Map Transformations ermöglichen.
  • DTS Designer: Ermöglicht das erstellen von komplexen ETL-Workflows einschließlich event-basierter Logik.
  • DTS Run Utility: Planung und Ausführung von DTS-Packages; auch via Kommandozeile möglich.
  • DTS Query Designer: Eine GUI für das Erstellen von SQL-Abfragen für DTS.