Tag Archives: data set

Quant analytics: how to check that Data reduction is correct after applying PCA to a data set

Quant analytics: how to check that Data reduction is correct after applying PCA to a data set

==

 

! You should check that the Cumulative Proportion of Variance of the number of dimensions you decide to take is enough (about 80%: it depends on the field). On R, you clearly see that with the “summary” command, where you see the proportion of variance due to each component. In a few words, you can reduce data if you do not lose too much information: so if you decide to take the two principal components, their cumulative proportion of variance should be enough in order to well represent the original data set. Of course the cumulative proportion reaches 100% only if you take all the dimensions, but very often only a couple of them are necessary to explain a big part of the original data. Hope this helps!

 

 

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Help this guy out resolve his big data set query for quant purposes

Help this guy out resolve his big data set query for quant purposes

Can you give this reader a hand in finding an answer for his query:

Hi Nagendra,

I think a chunk is a piece of data in the over all file. For ex: a 100 MB file can be split into a 10 chunks of data with each chunk containing 10 MB. Second part of your question, even I am not sure and I have the same question as well.

Looking forward to Hadoop experts help us in understanding this.

Regards,
Phani

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Matlab`s parallel computing toolkits enables you batch and breakup large data set processing

Matlab`s parallel computing toolkits enables you batch and breakup large data set processing
We all know about parallel computing capabilities. C++ is quite powerful at doing it but you need to hand code everything to take advantage of it. It seems Matlab includes a parallel computing toolkit which makes life easier.
You can break up the complicated and long running processes into small worker group sets. This allows you to run Matlab more efficiently and your tasks will take a shorter amount of time to complete. Matlab enables you run as many remote workers as you can based on the license you get from Mathworks. Out of the box, it seems Matlab limits you to eight workers running in parallel.
Some uses for this parallel computing feature is when you have long iterations running or you have too many running. You could use the parfor function instead of the standard for loop to take full advantage of the parallel computing capabilities. You could also batch jobs or shorten long data sets using this Matlab parallel computing toolkit.
With the parallel computing toolkit, you can get:
*

Different Array Types
*

Working with Co distributed Arrays
*

Using a for-Loop Over a Distributed Range (for-drange)
*

Using MATLAB Functions on Codistributed Arrays
Thankfully, Matlab`s help files give you a run down on this very powerful feature which many application cannot offer. I am very, very impressed with this capability which includes the capability to run a profiler that simultaneously displays each worker. It also has the ability to talk to most scheduling software programs which makes Matlab`s parallel computing features even better. Wow! You need to check out some of the powerful function calls to be make this happen.
Also, this parallel computing toolkits makes Matlab a very modern architecture as you may want to consider using instead of pure C++ home build applications.

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!