Quantitative Finance Applications in R

Do you want to do some quick, in depth technical analysis of stock prices?

After I left CERN to work as consultant and to earn an MBA, I was engaged in many exciting projects in the finance sector, analyzing financial data, such as stock prices, exchange rates and so on. Obviously there are a lot of available models to fit, analyze and predict these types of data. For instance, basic time series model arima(p,d,q), Garch model, and multivariate time series model such as VARX model, state space models.

Although it is a little hard to propose a new and effective model in a short time, I believe that it is also meaningful to apply the existing models and methods to play the financial data. Probably some valuable conclusions will be found. For those of you who wish to have data to experiment with financial models, I put together a web application written in R:

TSLA
Quantitative Finance Analysis in R (click image to open application)

How to Log your Twitter Follower Stats with IFTTT to a Google Spreadsheet

tstats GitHub repository
The tstats script (on GitHub) logs your Twitter Follower Stats with IFTTT to a Google Spreadsheet

How can we log the follower statistics for a Twitter account?

In order to store these stats, I’d like to use IFTTT’s new Maker channel that was introduced last month. I have created a simple Bash script (tstats.sh) to log this data to a spreadsheet in my Google Drive. I run this as a cron job every 24 hours.

Prerequisites

Ruby:

sudo apt-get install ruby-dev

Twitter CLI:

gem install t

Authorize your Twitter account:

t authorize

A Google account, as the log is saved to a spreadsheet in your Google Drive.

An IFTTT account.

Connect the Maker and Google Drive channels to your IFTTT account.

Usage

cd into the tstats directory and edit the script with your IFTTT secret key, your IFTTT trigger event name and your Twitter screen name. Make the script executable with:

chmod +x tstats.sh

Then simply run it with:

./tstats.sh

If you receive a „Congratulations“ message and an entry is added to your spread sheet, you can go ahead and add it to your cron to run at a predetermined time.

To have this script run every 24 hours, add this to your crontab (you may need to change the path):

42,09 * * * * /home/user/tstats/tstats.sh >/dev/null 2>&1

[Update 26 Jul 2018] Now on GitHub: Yes, three years later this script is still hot! However, WordPress is not the perfect place to host code. As part of my preparation for my TC18 session on Social Media in New Orleans, I moved the code to a GitHub repositroy: https://github.com/aloth/tstats

How to unleash Data Science with an MBA?

Servers record a copy of LHC data and distribute it around the world for Analytics

My Data Science journey starts at CERN where I finished my master thesis in 2009. CERN, the European Organization for Nuclear Research, is the home of the Large Hadron Collider (LHC) and has some questions to answer: like how the universe works and what is it made of. CERN collects nearly unbelievable amounts of data – 35 petabytes of data per year that needs analysis. After submitted my thesis, I continued my Data Science research at CERN.

I began to wonder: Which insights are to be discovered beyond Particle Physics? How can traditional companies benefit from Data Science? After almost four exciting years at CERN with plenty of Hadoop and Map/Reduce, I decided to join Capgemini to develop business in Big Data Analysics, and to boost their engagements in Business Intelligence. In order to leverage my data-driven background I enrolled for the Executive MBA program at Frankfurt School of Finance & Management including an Emerging Markets module at CEIBS in Shanghai.

Today companies have realized that Business Analytics needs to be an essential part of their competitive strategy. The demand on Data Scientists grows exponentially. To me, Data Science is more about the right questions being asked than the actual data. The MBA enabled me to understand that data does not provide insights unless appropriately questioned. Delivering excellent Big Data projects requires a full understanding of the business, developing the questions, distilling the adequate amount of data to answer those questions and communicating the proposed solution to the target audience.

„The task of leaders is to simplify. You should be able to explain where you have to go in two minutes.“ – Jeroen van der Veer, former CEO of Royal Dutch Shell

IMF Global Data Explorer

How about some visual takeaways from the IMF’s World Economic Outlook? Recently I prepared two nifty data visualizations with Tableau that I like to share with you.

These visualizations allow you to explore plenty of economical data, including IMF staff estimates until 2020. Don’t forget to choose „Units“ after switching „Subject“ on the right-side bar. A detailed description on each subject is displayed below.

Tableau

A Data Processing Guide in the Big Data Jungle

14514437527_f687202d5d_k
Too many choices? Don’t get lost!

We are deep in the Big Data jungle. According to Gartner’s Hype Cycle for Emerging Technologies, Big Data has now officially passed the “peak of inflated expectations”, and is now on a one-way trip to the “trough of disillusionment”. Gartner says it’s done so rather fast, because we already have consistency in the way we approach this technology, and because most new advances are additive rather than revolutionary.

Pig, Hive, Impala, Tez and Spark: which one suits for which use case?

With so much hype and so many new advances, it’s easy to get lost. This little guide gives you an overview on data processing technologies in the Big Data jungle and tries to identify the best use cases for each.

  • Pig: Pig is often useful for pulling apart unstructured and nested data like text or JSON. Since Pig Latin is a procedural language, it is a very good choice for developing data pipelines on Hadoop. Pig is based on MapReduce and has tools for data storage, data execution and data manipulation.
  • Hive: Hive was original “relational on Hadoop” and is the first Hadoop SQL (HiveQL to be precise) query engine. Hive is still the most mature engine from all in this guide, as well as the slowest one. Hive is also based on MapReduce and is a very good choice for heavy ETL tasks where reliability is important, eg. daily aggregation jobs.
  • Impala: Impala is the only native open-source SQL query engine in the Hadoop world. It skips MapReduce entirely and is best used for SQL queries over big volumes. Impala is also capable of delivering results interactively over bigger volumes and with a much faster speed than other Hadoop query engines.
  • Tez: Tez may be considered as a better and faster base for query engines like Pig and Hive. Tez gets around limitations imposed by MapReduce and enables use cases with near-real-time performance and Machine Learning, which do not fit well into the MapReduce paradigm.
  • Spark: Spark is an in-memory query engine that also skips MapReduce. Perfect use cases for Spark are streaming, interactive data processing and ad-hoc analysis of moderate-sized data sets (as big as the cluster’s RAM). The ability of Spark to reuse data in-memory is the real highlight for these use cases. Spark SQL offers relational connectivity.