Tick Data Storage, Treatment (NaNs, Spikes, Missing Data) for Quant Analysis

(Last Updated On: December 5, 2011)

Tick Data Storage, Treatment (NaNs, Spikes, Missing Data) for Quant Analysis

Anyone out here use MATLAB and kdb+ for investigating algo-trading? I’d like some tips about treating tick data.
What are the dangers of downsampling, smoothing and handling missing data, are there any

I have done a lot of work in tick level analysis. One of the major things you will need to deal with is the massive amount of data you will be analyzing. I am actually a big fan of MATLAB, but it is terrible at handling the massive amounts of data you are typically dealing with when analysing tick data. If you are just analysing small subsets of the data this may not be an issue, but any significant amount of analysis will bog down very quickly in MATLAB.

One approach is to use MATLAB to prototype the analysis on a small batch of test data, then convert to a lower level language for bulk analysis of the data. I know this sounds like an extra step, but I have literally gone from an analysis that take 4-5 days in MATLAB to 20-30 minutes in F#. I am sure you can even get improvements over that if you used C++ or lower level languages. Before anyone asks the obvious questions, yes, I am very familiar with optimizing matrix operations in MATLAB and I have been using MATLAB successfully since 2001.

I have also found that using a database, even a specialized one like kdb+, isn’t as efficient as just using raw data files. There are very few use cases that require random look up of tick data. You are almost always sequentially going through the data in a single pass. A well organized binary data file stored on a SSD will easily beat any form of database for this type of data access. With a little thought about your use case it is easy to create a super fast way of organizing the data. The index almost always ends up just being the structure of the storage.

As for questions of down sampling, smoothing and missing data. That will almost always directly depend on your analysis so general rules of thumb are almost useless. The one thing I would strongly recommend is to always give yourself the ability to analyse the analysis process itself. Don’t make it a black box where raw data comes in at one end and an answer comes out the other. Algo Trading development is an iterative process. Things change quickly and it is easy to bury bugs in the process. You need to be able to see what is going on to isolate issues and verify that the analysis is working properly. Visibility is a key part of this process. Good documentation and versioning of the process is a huge help also.

How important is it to obtain your research tick data from primary sources, i.e. directly from exchange feeds? What are the drawbacks to using tick data supplied by vendors such as Reuters, Morningstar etc?

agree with as Matlab grinds to a halt when you can’t work around a massive for loop (> 10k iterations). One quick stopgap I’ve found is very useful is to write the tick processing part via Java and use Matlab’s plug & play functionality with jar files to speed up the bottlenecked portions. Oftentimes, the mundane loop code can be easily implemented in Java so you’re not looking at any increase in development time. Finally, Matlab passes to it’s JVM by value so you need to increase the Java Heap space in preferences if you want to batch process data

Dough I am specially interested on your comment about kdb+ not as efficient as using raw files.
How about if I am trying to write an algo that has to perform some pricing in real time? Is MATLAB still up to the task?
I’ve seriously noted the importance of being able to analyse the analysis process itself along with vesioning and documentation.
All my data come from providers like Morningstar and interactive data (also some banks that provide currency data), these come in with some rogue ticks, missing data for currency pairs.
I should have made this clear, I am currently using MATLAB to perform analysis and some back testing, I am not particularly concerned process time here. I am only concerned about MATLAB’s ‘real time’ performance.
Please excuse the vagueness of some of my questions and answers they reflect the vagueness of the grasp I have on the issues. Its a steep learning curve for me.
Thanks for the feedback.

Regrettably MATLAB is terrible for real time analysis. It just doesn’t handle high frequency streaming data well. Its core processes seem to hang up and stop at times, the memory management internal to MATLAB can go into cleanup mode whenever it wants. MATLAB is great for many things, but real time analysis is NOT one of them. Don’t even try it, in the end you won’t be happy.

MATLAB is great for prototyping and back testing an algo, but once you get the concept working the real time code needs to be rewritten in a compiled language. C#, F#, Java or a few other languages are fine if the system doesn’t need absolute top performance. C/C++ written by a programmer with lots of experience in real time processing is about the only way to go if you need absolute top speed.

Sorry that MATLAB won’t do it for you, but it is better to know now than later when you are trying to run live and it is failing miserably. BTW, this advice comes from painful experience trying to do what you are talking about.

agree matlab (or any other package like R) is not good for real time as long as you need top speed.
If millisecond response does not matter you can use it effectively and it bridges the gap between research and production.
Maybe other languages like python (Cython) or OCAML can be helpful because they can be compiled and they are very fast, although not as fast as C. At the same time you do not have to be a top programmer to work that out, which is a big plus.

Between the theory of best solution and practise of getting it done there is a big gap.
The time needed to develop something robust in C can be frustrating and costly.

People here mentioned for loops that you cannot possibly abuse in vectorised languages.

Obviously if you are competing with other HFT firms that is another story, but in that case you need very serious resources.

For research I suggest R on unix which is more that enough to handle giga bytes of data or machines with a lot of ram memory.
Also you have packages to handle a lot of statistical models and a big finance user base.

The suggestions above are very good. I just wanted to toss F# into the ring. I have found it to be extremely fast both for development and the execution of the analysis. There are a large number of open source numerical libraries that can easily be used and it is certainly fast enough to be used in a moderate HF environment.

F# can’t compete with low level development in C for true ultra HFT systems, but for basic work it gets you as close as possible and still gives you a Functional focus and rapid development.

Honestly there are lots of possibilities these days. Keep exploring and you should find some tools that fit.

The massive historical data should be processed and summarized after hours so you would be left only with the current day data to process live. I am building a system with R, and python. R will grind through my historical data the night before, the daily tick data I gather is roughly 2GB, and there is no way I would be running a program live over historical data that measures in the 100s of GB

Any experience on solutions that provide Tick specialized repositories integrated with stream processing engines (OneTick, VA, Delta)?

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!
Don't miss out!

You will received instantly the download links.

Invalid email address
Give it a try. You can unsubscribe at any time.


Check NEW site on stock forex and ETF analysis and automation

Scroll to Top