Tick Data Storage, Treatment (NaNs, Spikes, Missing Data) for Quant Development and Analysis

(Last Updated On: October 11, 2011)

Tick Data Storage, Treatment (NaNs, Spikes, Missing Data) for Quant Development and Analysis

Anyone out here use MATLAB and kdb+ for investigating algo-trading? I’d like some tips about treating tick data.
What are the dangers of downsampling, smoothing and handling missing data, are there any best practices to follow?

I have done a lot of work in tick level analysis. One of the major things you will need to deal with is the massive amount of data you will be analyzing. I am actually a big fan of MATLAB, but it is terrible at handling the massive amounts of data you are typically dealing with when analysing tick data. If you are just analysing small subsets of the data this may not be an issue, but any significant amount of analysis will bog down very quickly in MATLAB.

One approach is to use MATLAB to prototype the analysis on a small batch of test data, then convert to a lower level language for bulk analysis of the data. I know this sounds like an extra step, but I have literally gone from an analysis that take 4-5 days in MATLAB to 20-30 minutes in F#. I am sure you can even get improvements over that if you used C++ or lower level languages. Before anyone asks the obvious questions, yes, I am very familiar with optimizing matrix operations in MATLAB and I have been using MATLAB successfully since 2001.

I have also found that using a database, even a specialized one like kdb+, isn’t as efficient as just using raw data files. There are very few use cases that require random look up of tick data. You are almost always sequentially going through the data in a single pass. A well organized binary data file stored on a SSD will easily beat any form of database for this type of data access. With a little thought about your use case it is easy to create a super fast way of organizing the data. The index almost always ends up just being the structure of the storage.

As for questions of down sampling, smoothing and missing data. That will almost always directly depend on your analysis so general rules of thumb are almost useless. The one thing I would strongly recommend is to always give yourself the ability to analyse the analysis process itself. Don’t make it a black box where raw data comes in at one end and an answer comes out the other. Algo Trading development is an iterative process. Things change quickly and it is easy to bury bugs in the process. You need to be able to see what is going on to isolate issues and verify that the analysis is working properly. Visibility is a key part of this process. Good documentation and versioning of the process is a huge help also.

How important is it to obtain your research tick data from primary sources, i.e. directly from exchange feeds? What are the drawbacks to using tick data supplied by vendors such as Reuters, Morningstar etc?


I agree with Doug, as Matlab grinds to a halt when you can’t work around a massive for loop (> 10k iterations). One quick stopgap I’ve found is very useful is to write the tick processing part via Java and use Matlab’s plug & play functionality with jar files to speed up the bottlenecked portions. Oftentimes, the mundane loop code can be easily implemented in Java so you’re not looking at any increase in development time. Finally, Matlab passes to it’s JVM by value so you need to increase the Java Heap space in preferences if you want to batch process data > 1gb.

I am specially interested on your comment about kdb+ not as efficient as using raw files.
How about if I am trying to write an algo that has to perform some pricing in real time? Is MATLAB still up to the task?
I’ve seriously noted the importance of being able to analyse the analysis process itself along with vesioning and documentation.
all my data come from providers like Morningstar and interactive data (also some banks that provide currency data), these come in with some rogue ticks, missing data for currency pairs.
I should have made this clear, I am currently using MATLAB to perform analysis and some back testing, I am not particularly concerned process time here. I am only concerned about MATLAB’s ‘real time’ performance.
Please excuse the vagueness of some of my questions and answers they reflect the vagueness of the grasp I have on the issues. Its a steep learning curve for me.
Thanks for the feedback.


Regrettably MATLAB is terrible for real time analysis. It just doesn’t handle high frequency streaming data well. Its core processes seem to hang up and stop at times, the memory management internal to MATLAB can go into cleanup mode whenever it wants. MATLAB is great for many things, but real time analysis is NOT one of them. Don’t even try it, in the end you won’t be happy.

MATLAB is great for prototyping and back testing an algo, but once you get the concept working the real time code needs to be rewritten in a compiled language. C#, F#, Java or a few other languages are fine if the system doesn’t need absolute top performance. C/C++ written by a programmer with lots of experience in real time processing is about the only way to go if you need absolute top speed.

Sorry that MATLAB won’t do it for you, but it is better to know now than later when you are trying to run live and it is failing miserably. BTW, this advice comes from painful experience trying to do what you are talking about.




NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Subscribe For Latest Updates

Sign up to best of business news, informed analysis and opinions on what matters to you.
Invalid email address
We promise not to spam you. You can unsubscribe at any time.


Check NEW site on stock forex and ETF analysis and automation

Scroll to Top