Physics projects don’t get any bigger than this. The active European Organization for Nuclear Research, aka CERN, formed in 1954 and is headquartered in Geneva, Switzerland, employs thousands of world-class scientists on the forefront of breakthrough research. Its claim to fame is unmatched as the origin of the World Wide Web and creator of underground 17-mile-long particle accelerator called the Large Hadron Collider. Here, see photos of the many aspects of an international institution that may discover a way to move faster than the speed of light and how our universe was pieced together.
There are four key issues to overcome if you want to tame Big Data: volume (quantity of data), variety (different forms of data), velocity (how fast the data is generated and processed) and veracity (variation in quality of data). You have to be able to deal with lots and lots, of all kinds of data, moving really quickly.
That is why Big Data Analytics has a huge impact on how we plan CERN’s overall technology strategy as well as specific strategies for High-Energy Physics analysis. We want to profit from our data investment and extract the knowledge. This has to be done in a proactive, predictive and intelligent way.
The following presentation shows you how we use Big Data Analytics to improve the operation of the Large Hardron Collider.
I became a Python geek and GnuPlot maniac since I joined CERN around three years ago. I have to admit, however, that I really enjoy the flexibility of D3.js, and its capability to render histograms directly in the web browser.
The following example loads a CSV file, which includes 10,000 dimuon events (i.e. events containing two muons) from the CMS detector, and displays the distribution of the invariant mass M (in GeV, in bins of size 0.1 GeV):
Feel free to download the sample CSV dataset here.
The raw data from the experiments is stored in structured files (using CERN’s ROOT Framework), which are better suited to physics analysis. Transactional relational databases (Oracle 11g with Real Application Clusters) store metadata information that is used to manage that raw data. For metadata residing on the Oracle Database, Oracle TimesTen serves as an in-memory cache database. The raw data is analysed on PROOF (Parallel ROOT Facility) clusters. Hadoop Distributed File System (HDFS), however, is used to store the monitoring data.
Just as in the CERN example, there are some significant trends in Big Data Analytics:
Descriptive Analytics, such as standard business reports, dashboards and data visualization, have been widely used for some time, and are the core applications of traditional Business Intelligence. This ad hoc analysis looks at the static past and reveal what has occurred. One recent trend, however, is to include the findings from Predictive Analytics, such as forecasts of sales on the dashboard.
Predictive Analytics identify trends, spot weaknesses or determine conditions for making decisions about the future. The methods for Predictive Analytics such as machine learning, predictive modeling, text mining, neural networks and statistical analysis have existed for some time. Software products such as SAS Enterprise Miner have made these methods much easier to use.
Discovery Analytics is the ability to analyse new data sources. This creates additional opportunities for insights and is especially important for organizations with massive amounts of various data.
Prescriptive Analytics suggests what to do and can identify optimal solutions, often for the allocation of scarce resources. Prescriptive Analytics has been researched at CERN for a long time but is now finding wider use in practice.
Semantic Analytics suggests what you are looking for and provides a richer response, bringing some human level into Analytics that we have not necessarily been getting out of raw data streams before.
As these trends bear fruit, new ecosystems and markets are being created for broad cross-enterprise Big Data Analytics. Use cases like the CERN’s LHC experiments provide us with greater insight into how important Big Data Analytics is in the scientific community as well as to businesses.
Time really flies when you immerse yourself in the world of data science research and unravel the mysteries of the universe! It’s been an incredible journey over the past year as I’ve immersed myself in the world of data science at CERN. For those unfamiliar, CERN — set against a stunning backdrop of snow-capped mountains and tranquil Lake Geneva — is home to the Large Hadron Collider (LHC), the world’s most powerful particle accelerator. But what often goes unnoticed is the critical role that data science plays in powering this colossal machine and its quest for groundbreaking discoveries like the elusive Higgs boson.
The Data Tsunami: A Behind-The-Scenes Look
Imagine having to sift through one petabyte (PB) of data every second — yes, you read that right. That’s the amount of data generated by the LHC’s detectors. To make it manageable, high-level triggers act as an advanced filtering system, reducing this torrent of data to a more digestible gigabyte per second. This filtered data then finds its way to the LHC Computing Grid.
About 50PB of this data is stored on tape, and another 20PB is stored on disk, managed by a Hadoop-based cloud service. This platform runs up to two million tasks per day, making it a beehive of computational activity.
The Role of Data Science Research at CERN
Data scientists and software engineers are the unsung heroes at CERN, ensuring the smooth operation of the LHC and subsequent data analysis. Machine learning algorithms are used to discover new correlations between variables, including both LHC data and external data sets. This is critical for real-time analysis, where speed and accuracy are of the essence.
While managing the exponential growth of data is an ongoing challenge, the role of data scientists at CERN goes far beyond that. We are at the forefront of fostering a data-driven culture within the organization, transferring knowledge, and implementing best practices. In addition, as technology continues to evolve, part of our role is to identify and integrate new, cutting-edge tools that meet our specific data analysis needs.
The Road Ahead: A Data-Driven Journey
Looking ahead, scalability will remain a key focus as CERN’s data continues to grow. But the horizon of possibilities is vast. From exploring quantum computing to implementing advanced AI models, the role of data science in accelerating CERN’s research goals will only grow.
As I celebrate my one-year anniversary at CERN, I’m filled with gratitude and awe for what has been an incredible journey. From delving into petabytes of data to pushing the boundaries of machine learning in research, it’s been a year of immense learning and contribution.
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.