TOP 10 articles for questions and answers for quant research, high frequency aka HFT, and hedge fund usesFACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!
TOP 10 articles for C++ for quant research, high frequency aka HFT, and hedge fund uses
TOP 10 articles for technical development for quant research, high frequency aka HFT, and hedge fund uses
|Hadoop architecture for dummies: What is Hadoop for quant development?|
|Introducing our new Yahoo Finance Database Builder for MySQL, Microsoft SQL Server, and SQLite|
|Open Source API for trading algorithms? best way the get in to it [ Algorithms Trading]? book? Examples? tutorial?|
|How to use Metatrader forex or stock data streaming for FREE! Export tick data to CSV text file to work with Matlab!|
|FPGA technology into the ultra low latency trading|
|Excellent Tradestation Easy Language tutorial videos on YouTube|
|Quant development: Choosing and configuring linux for low latency?|
|Zero Garbage Collection (GC) in Java?|
|Hadoop and GPUs in quant development?|
|Quant development: Which programming languages are the most widely used in trading systems for conventional algorithmic )?|
TOP 10 articles for technical development for quant research, high frequency aka HFT, and hedge fund uses
Will over fitting occur in this quant research strategy scenario?
I’ve Implemented J48 algorithm for my dataset and the accuracy was 87.91%. When I’ve identified and removed misclassified instance, the accuracy is improved to 98.79%. The original dataset had 10,000 instance and after the removal of misclassified instances, the instances are 7100.
will it overfit or not?
After using J48 (or any other pruning algorithm) you should see the performance on your original tree decrease, and that decrease is due to your original tree overfitting the data.
The key metric you want to use to evaluate the ‘goodness of fit’ is how the model performs on a holdout sample. With n = 10,000 you should have definitely pulled out a random group of at least n = 2,000 before you built your original tree.
Then, once you are confident in the tree you build on the remaining n = 8,000, you apply those rules to your holdout sample and see if the model is robust, and not overfitting the data.
(and I wasn’t sure what you meant when you said you “removed misclassified instance”… sounds odd)
I just answered to a similar question asked but I agreel: if you first classify data and then remove those instances that are misclassified, you artificially increase performance and this is equivalent to overfitting.
Regarding J48, in particular, I would advise to pay attention to tree stability. In other words, if you randomly split your data into training and test data multiple times and build the tree with the different training data, your tree shouldn’t change a lot from one training set to another training set. If this is not the case, then high accuracy is irrelevant, as it was accidentally attained. In this case, one can’t be sure that such a tree will well generalize to unseen data.
In other words, I would prefer a stable tree with a lower accuracy to an unstable tree with a higher accuracy.
• There’s always overfitting. The question is, how much overfitting? You need to have a way of accurately measuring that.
By coincidence, just last night I was reading a paper that describes a method of quantifying overfitting … see Harrell, Lee & Mark, (Tutorial in Biostatistics) Multivariate Prognostic Model Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors (Statistics in Medicine, Vol. 15, 361-387, 1996).
Given Dr. Harrell’s generous nature in terms of sharing, I thinking/hoping he’s built a function to do this in one of the R-packages he’s authored/co-authored, but I haven’t looked yet.
(edited: I originally said the calc used a bootstrap method, but that is incorrect. He uses bootstrapping to look for smaller models that may have less overfitting.)
URL to a PDF of the paper mentioned:
Pls can anyone help? Just got this assignment about the following: “Screening of cell tissue is extremely important to detect tumors. This dataset contains
measurements of cell properties derived from a digitalised image of cell nuclei of breast
cells for 569 samples. Each sample also contains a diagnosis as benign or malignant.Try to develop a go o d screening method based on these data”.
Diagnosis: M = malignant, B = b enign
radius: mean of distances from center to p oints on the p erimeter
texture: standard deviation of gray-scale values
perimeter: Average p erimeter of nuclei
area: Average area of nuclei
smo othness: lo cal variation in radius lengths
compactness: Average of p erimeter 2 / area – 1.0
concavity average severity of concave p ortions of the contour
concave p oints average numb er of concave p ortions of the contour
symmetry average symmetry measure of nuclei
fractal dimension “coastline approximation” – 1
Pls need your help ASAP with R codes for this task. Thank you in advance.
I apologize if I’m wrong, but it sounds like you’re trying to outsource your homework/learning exercise.
I’m kind of new to classification trees; in a few studies I have looked at, pruning seems to be inadequate to cope with overfitting. However, Brieman’s Random Forest algorithm, available for R in the randomForest package, is more robust as it averages across many trees built on random samples of cases and variable.
And Collins, yes it is a good candidate for your problem and you can adapt the examples provided to your data.
“Overfit” means that the classifier models characteristics of the sample so accurately that the decision border becomes more complex than it should optimally be for the population (ie. cases not yet seen). Usually this is tried to be avoided by some sort of regularization (in case of trees, pruning). What one should (ideally) see is that the pruned model worked better for the data in general (cases not yet seen). This is usually emulated by having a hold-out sample for verification, as pointed out earlier.
If you remove the missclassified data, the classifier will evidently perform much better because you make the sample distribution favorable for the classifier. The model might be less complex, but the data does not come from the original distribution any more. In my opinion, this is a biased model: it does not represent the original data and classification problem.
I agree with many of the comments on the board. We typically use an out of time sample to validate the splits obtained in the tree variables or if we built a score card we’ll also validate using the psi (population stability) of the variables as well as the overall model. A closer look at the distribution of the variables or bins within them can reveal shifting and issues with fit.
Tis extremely interesting and important discussion is but one manifestation of the more general issue of “HOW GOOD” is cross-validation? especially w.r.t. future objects/ samples, i.e. “cases not yet seen”. The entire issue is examined thoroughly here:
http://dl.dropbox.com/u/13542010/Mythbusters-1.pdf – which I hope can be of use.
is your “assignment” an educational exercise for a routine class style event? Or is your “assignment” a challenging clinical case awaiting quick treatment where you are overburdened with day to day customary and usual work? I am out the door to visit a patient. But if your need is for clinical care services and not a textbook, educational exercise, I will try to think about your needs ASAP. Thanks for the contact.
Always reserve part of the data for testing. Build on a sample, then test it.
In geobotany discarded up to 50% of cases. It is normal. But …
“”removed misclassified instance”… sounds odd”. In geobotany it is a common technique.
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!
How QuickFIX and FPGA can speed up quant research and greatly improve quant analysis even further!!
November 17, 2010 – FOR IMMEDIATE RELEASE – New York-based boutique financial technology firm Wall Street FPGA announced today the ability to significantly accelerate trading software. Speed makes trading strategies more profitable and helps identify new trading opportunities.
Using Field Programmable Gate Array (FPGA) technology, Wall Street FPGA accelerates the industry standard Financial Information eXchange (FIX) protocol. “We have a working prototype where we significantly accelerate QuickFIX, the open source FIX Engine, using an FPGA,” said John Stratoudakis, Founder and Director of Research at Wall Street FPGA.
The prototype creates a scenario where a broker/dealer wishes to quickly cancel open/unexecuted orders from a trading venue such an exchange. Two PCs, each running QuickFIX, are directly connected to each other.
- The broker/dealer PC sends several trade orders to the exchange PC where the orders remain open and do not execute.
- The network card of the broker/dealer PC has a customized FPGA which retains open order info.
- Upon receipt of a trigger, the FPGA generates FIX Order Cancel messages and sends them to the exchange PC.
Ordinarily, the QuickFIX software on the broker/dealer PC would generate the FIX Order Cancel messages and then send them to the network card for transmission. Using the Wall Street FPGA groundbreaking prototype, generating the FIX Order Cancel messages within the FPGA based network card dramatically accelerates parts of the QuickFIX software.
The Order Cancel System protects client’s money during uncertain market conditions in the form of a personal circuit breaker. For example, a single day drop of 100 points of the Dow Jones Industrial Average could trigger an order cancel, safely and quickly exiting a volatile market. The prototype could be expanded for use by Market Makers wherein open/unexecuted orders are canceled & replaced to adjust to rapidly changing market conditions.
The prototype shows the exceptional ability of Wall Street FPGA to quickly develop and deliver hardware accelerated, low latency trading solutions that adhere to industry standard communication protocols such as FIX.FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!
Quant Research: High Frequency Trading HFT
and Its Impact on Market Quality
In this paper I examine the impact of high frequency trading (HFT) on the U.S. equities market. I analyze a unique dataset to study the strategies utilized by high frequency traders (HFTs), their profitability, and their relationship with characteristics of the overall market, including liquidity, price discovery, and volatility. The 26 HFT firms in the dataset participate in 68.5% of the dollar-volume traded. I find the following key results: (1) HFTs tend to follow a price reversal strategy driven by order imbalances, (2) HFTs earn gross trading profits of approximately $2.8 billion annually, (3) HFTs do not seem to systematically engage in a non-HFTr anticipatory trading strategy, (4) HFTs’ strategies are more correlated with each other than are non-HFTs’, (5) HFTs’ trading levels change only moderately as volatility increases, (6) HFTs add substantially to the price discovery process, (7) HFTs provide the best bid and offer quotes for a significant portion of the trading day and do so strategically so as to avoid informed traders, but provide only one-fourth as much book depth as non-HFTs, and (8) HFTs may dampen intraday volatility. These findings suggest that HFTs’ activities are not detrimental to non-HFTs and that HFT tends to improve market quality.
5th Annual Conference on Empirical Legal Studies Paper
This paper examines the role of high frequency trading (HFT; HFTs refers to multiple high frequency
traders and HFTr for a single trader) in the U.S. equities market.1 HFT is a type of investment strategy
whereby stocks are rapidly bought and sold by a computer algorithm and held for a very short period,
usually seconds or milliseconds.2 The advancement oftechnology over the last two decades has altered
how markets operate. No longer are equity markets dominated by humans on an exchange floor conducting
trades. Instead, many firms employ computer algorithms that receive electronic data, analyze it, and
publish quotes and initiate trades. Firms that use computers to automate the trading process are referred
to as algorithmic traders; HFTs are the subset of algorithmic traders that most rapidly turn over their stock
positions. Today HFT makes up a significant portion of U.S. equities market activity, yet the academic
analysis of its activity in the financial markets has been limited. This paper aims to start filling the gap.
The rise of HFT introduces several natural questions. The most fundamental is how much market
activity is attributable to HFTs. In my sample 68.5% of the dollar-volume traded involves HFT.3 A second
question that arises is what HFTs are doing. Within this question lie many concerns regarding HFT,
including whether HFTs systematically anticipate and trade in front of non-HFTs, flee in volatile times,
and earn exorbitant profits.4 My findings do not validate these concerns. The third integral question is
how it is impacting asset pricing characteristics. This paper is an initial attempt to answer this difficult
question. The key characteristics I analyze are price discovery, liquidity, and volatility. I find that HFTs
add substantially to the price discovery process, frequently provide inside quotes while providing only
some additional liquidity depth, and may dampen intraday volatility.FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!