I became a Python geek and GnuPlot maniac since I joined CERN around three years ago. I have to admit, however, that I really enjoy the flexibility of D3.js, and its capability to render histograms directly in the web browser.
The following example loads a CSV file, which includes 10,000 dimuon events (i.e. events containing two muons) from the CMS detector, and displays the distribution of the invariant mass M (in GeV, in bins of size 0.1 GeV):
Feel free to download the sample CSV dataset here.
The raw data from the experiments is stored in structured files (using CERN’s ROOT Framework), which are better suited to physics analysis. Transactional relational databases (Oracle 11g with Real Application Clusters) store metadata information that is used to manage that raw data. For metadata residing on the Oracle Database, Oracle TimesTen serves as an in-memory cache database. The raw data is analysed on PROOF (Parallel ROOT Facility) clusters. Hadoop Distributed File System (HDFS), however, is used to store the monitoring data.
Just as in the CERN example, there are some significant trends in Big Data Analytics:
Descriptive Analytics, such as standard business reports, dashboards and data visualization, have been widely used for some time, and are the core applications of traditional Business Intelligence. This ad hoc analysis looks at the static past and reveal what has occurred. One recent trend, however, is to include the findings from Predictive Analytics, such as forecasts of sales on the dashboard.
Predictive Analytics identify trends, spot weaknesses or determine conditions for making decisions about the future. The methods for Predictive Analytics such as machine learning, predictive modeling, text mining, neural networks and statistical analysis have existed for some time. Software products such as SAS Enterprise Miner have made these methods much easier to use.
Discovery Analytics is the ability to analyse new data sources. This creates additional opportunities for insights and is especially important for organizations with massive amounts of various data.
Prescriptive Analytics suggests what to do and can identify optimal solutions, often for the allocation of scarce resources. Prescriptive Analytics has been researched at CERN for a long time but is now finding wider use in practice.
Semantic Analytics suggests what you are looking for and provides a richer response, bringing some human level into Analytics that we have not necessarily been getting out of raw data streams before.
As these trends bear fruit, new ecosystems and markets are being created for broad cross-enterprise Big Data Analytics. Use cases like the CERN’s LHC experiments provide us with greater insight into how important Big Data Analytics is in the scientific community as well as to businesses.
Wow, time flies. One year has passed since I started to work at CERN as a data scientist. CERN, surrounded by snow-capped mountains and Lake Geneva, is known for its particle accelerator Large Hadron Collider (LHC) and its adventure in search of the Higgs boson. Underneath the research there is an tremendous amount of data that are analysed by data scientists.
Filters, known as High Level Triggers, reduce the flow of data from a petabyte (PB) a second to a gigabyte per second, which is then transferred from the detectors to the LHC Computing Grid. Once there, the data is stored on about 50PB of tape storage and 20PB of disk storage. The disks are managed as a cloud service (Hadoop), on which up to two millions of tasks are performed every day.
CERN relies on software engineers and data scientists to streamline the management and operation of its particle accelerator. It is crucial for research to allow real-time analysis. Data extractions need to remain scalable and predictive. Machine learning is applied to identify new correlations between variables (LHC data and external data) that were not previously connected.
So what is coming up next? Scalability remains a very important area, as the CERN’s data will continue to grow exponentially. However, the role of data scientists goes much further. We need to transfer knowledge throughout the organisation and enable a data-driven culture. In addition, we need to evaluate and incorporate new innovative technologies for data analysis that are appropriate for our use cases.
just realized that today is my first year anniversary working for CERN. Thanks for all the memories made within the year!
Kürzlich stand ich vor der Herausforderung einen Datenbestand von einem Datenbanksystem (SAP MaxDB) in ein anderes (Microsoft SQL Server) zu überführen. Das Unterfangen war manuell jedoch kaum zu realisieren, da die Datenbank mehrere hundert Tabellen und unzählige Datensätze umfasst.
Abhilfe schaffte der Microsoft SQL Server Enterprise Manager. Dort finden sich die Data Transformation Services wieder, Hilfsprogramme, die es erlaubt, ETL-Prozesse (Extract, Transform, Load) beim Import in oder Export aus einer Datenbank zu automatisieren. Dabei werden verschiedene Datenbanksysteme unterstützt, sofern diese über eine ODBC– oder eine OLE DB-Schnittstelle verfügen, was auch bei SAP MaxDB der Fall ist.
Konkret bestehen die Data Transformation Services (DTS) aus folgenden Komponenten:
DTS Import/Export Wizard: Assistenten, die es erlauben Daten von oder zu einem MS SQL Server zu übertragen, sowie Map Transformations ermöglichen.
DTS Designer: Ermöglicht das erstellen von komplexen ETL-Workflows einschließlich event-basierter Logik.
DTS Run Utility: Planung und Ausführung von DTS-Packages; auch via Kommandozeile möglich.
DTS Query Designer: Eine GUI für das Erstellen von SQL-Abfragen für DTS.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.