Data Science Toolbox: How to use R with Tableau

Recently Tableau released an exciting new feature: R integration via RServe. Tableau with R seems to bring my data science toolbox to the next level! In this tutorial I’m going to walk you through the installation and connecting Tableau with RServe. I will also give you an example of calling an R function with a parameter from Tableau to visualize the results in Tableau.

1. Install and start R and RServe

You can download base R from r-project.org. Next, invoke R from the terminal to install and run the RServe package:

> install.packages("Rserve")
> library(Rserve)
> Rserve()

To ensure RServe is running, you can try Telnet to connect to it:

Telnet

Protip: If you prefer an IDE for R, I can highly recommend you to install RStudio.

2. Connecting Tableau to RServe

Now let’s open Tableau and set up the connection:

Tableau 10 Help menu

Tableau 10 External Service Connection

3. Adding R code to a Calculated Field

You can invoke R scripts in Tableau’s Calculated Fields, such as k-means clustering controlled by an interactive parameter slider:

Calculated Field in Tableau 10

4. Use Calculated Field in Tableau

You can now use your R calculation as an alternate Calculated Field in your Tableau worksheet:

Tableau 10 showing k-means clustering

Feel free to download the Tableau Packaged Workbook (twbx) here.

Further reading: Hands-On with R

[Update 26 Jun 2016]: Tableau 8.1 screenshots were updated with Tableau 10.0 (Beta) screenshots due to my upcoming Advanced Analytics session at TC16, which is going to reference back to this blog post.

Data Science: Enabling Research at CERN with Big Data

Wow, time flies. One year has passed since I started to work at CERN as a data scientist. CERN, surrounded by snow-capped mountains and Lake Geneva, is known for its particle accelerator Large Hadron Collider (LHC) and its adventure in search of the Higgs boson. Underneath the research there is an tremendous amount of data that are analysed by data scientists.

Filters, known as High Level Triggers, reduce the flow of data from a petabyte (PB) a second to a gigabyte per second, which is then transferred from the detectors to the LHC Computing Grid. Once there, the data is stored on about 50PB of tape storage and 20PB of disk storage. The disks are managed as a cloud service (Hadoop), on which up to two millions of tasks are performed every day.

High Level Trigger data flow
High Level Trigger data flow, as applied in the ALICE experiment

CERN relies on software engineers and data scientists to streamline the management and operation of its particle accelerator. It is crucial for research to allow real-time analysis. Data extractions need to remain scalable and predictive. Machine learning is applied to identify new correlations between variables (LHC data and external data) that were not previously connected.

So what is coming up next? Scalability remains a very important area, as the CERN’s data will continue to grow exponentially. However, the role of data scientists goes much further. We need to transfer knowledge throughout the organisation and enable a data-driven culture. In addition, we need to evaluate and incorporate new innovative technologies for data analysis that are appropriate for our use cases.