Tag Archives: Cassandra

Best alternative development stack for R with Hadoop? Forget MYSQL? Cassandra and Java listeners for market tick data

Best alternative development stack for R with Hadoop? Forget MYSQL? Cassandra and Java listeners for market tick data

After reading about the limitation of MYSQL and how expensive it can get, I decided to take chance on Cassandra. One big reason there is a RCassandra package within CRAN so yippee for that. Also, the install does not look hard and better yet, you can integrate it with Hadoop. Yipee for that! Also, Cassandra may be faster for writing than HBase which was part of the RHadoop offering so boo to that. Also, I plan to have to some Java listeners to my market data to populate the Cassandra database. This stack may work so let’s cross my fingers.  Here are my links that got me thinking this way:

http://stackoverflow.com/questions/4884967/hadoop-hbase-hdfs-vs-mysql-or-postgres-loads-of-independent-structured-d

http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

http://blog.milford.io/2010/06/installing-apache-cassandra-on-centos/

This above install of Cassandra appears to work with a few tricks as running as root but does work and installed fine.

http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr

http://code.google.com/p/cassandra-java-client/

The last 2 links  I question but the Cassandra install may be worth doing but integrating with Hadoop could be a challenge. I also hope the RCassandra works to.

 

 

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Sept 5. Cassandra vs HBase, with Nick Telford. Free talk.

Sept 5. Cassandra vs HBase, with Nick Telford. Free talk.

This month the London Cassandra User Group continues the “Cassandra vs” theme by taking a look at HBase.

 

Nick Telford has used both Cassandra and HBase in a real “big data” production environment at Datasift, and will be giving an expert insight into the two NoSQL solutions.

 

Sign up for free here: http://skillsmatter.com/podcast/nosql/cassandra-hbase/js-433

 

Sept 5. Cassandra vs HBase, with Nick Telford. Free talk.skillsmatter.com

This month the London Cassandra User Group continues the “Cassandra vs” theme by taking a look at HBase.

—-

It would be interesting to see if we can get the slides posted. Sorry, I can’t travel to the UK on a whim, but I’d like to see his experience.

 

 

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

July 18. Comparing distributed databases with the London Cassandra User Group. Free talk.

July 18. Comparing distributed databases with the London Cassandra User Group. Free talk.
It’s an interesting month for anyone who is currently evaluating NoSQL solutions — come along to the London Cassandra User Group where this month we’re focusing on analysing other distributed databases to see how they compare to Cassandra.

Get more info and sign up here: http://skillsmatter.com/podcast/nosql/mongodb/js-433

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Quant development: low latency read, replication across different data centers, hbase or cassandra?

Quant development: low latency read, replication across different data centers, hbase or cassandra?

Hi, If we have terabytes of log data being generated every day and if it needs to be piped to a nosql database (required low write latency), if we need replication across different data centers, if there is a need for low latency queries on the data in the nosql datastore, if there is a need for no single point of failure, need for schema updates without restarting the data store, running map reduce jobs… then what would be the right nosql implementation: hbase or cassandra? What do you think is more appropriate for this use case?

Both offer a “way” to do replication. HBase offers a two way a-sync replication. Cassandra has NetworkTopologyStrategy which offer N Data center replicate which a replication factor per datacenter (3 copies in this dc, 2 copies in that dc, 4 copies here). You also have special quorum levels called LOCAL_QUORUM, and EACH_QUORUM which control how the writes are acked in these multi dc scenarios. They both support map reduce. Both have online schema updates now. Your call for no SPOF. (yes you can drbd+lha all the hadoop SPOFs) but Cassandra is truly shared nothing. Cassandra seems to match your use case best

Couchdb is great at replication, MapReduce and writing to all nodes though sharding the data to grow horizontally is still from third party add ons. BigCouch offers a version of couchdb with built-in sharding that follows the Dynamo Paper and I’m going to look at that next. The Large Hadron Collider project uses couchdb to store petabytes of distributed data with a custom sharding solution

Replication in HBase is still pretty new, yet there are a couple of ways to handle replication, including at the application level.

 

I would suggest that you determine your use case and what you want to do. Cassandra has its issues as well as Hbase and any other option.

 

One thing that you also need to consider whichever solution you choose, what’s it’s ‘critical mass’. Meaning in 5 years time, will it still be around, or just a dying fad?

Replication can be done at the application level and that method has merit, but if you are doing the replication yourself every data store would now be in the solutions space. That is out of scope for this topic

Replication across data centers is pretty new in HBase => does that imply that there are some well documented issues? Isn’t replication across different data centers taken care of by the underlying HDFS?

Also, how about migration across different versions of HBase and Casandra? Can that happen online with no outage?

How about the API that a java application uses to interact with HBase and Cassandra? What would you suggest one should use for either?

low latency read, replication across different data centers, hbase or cassandra?

Hi, If we have terabytes of log data being generated every day and if it needs to be piped to a nosql database (required low write latency), if we need replication across different data centers, if there is a need for low latency queries on the data in the nosql datastore, if there is a need for no single point of failure, need for schema updates without restarting the data store, running map reduce jobs… then what would be the right nosql implementation: hbase or cassandra? What do you think is more appropriate for this use case?

Both offer a “way” to do replication. HBase offers a two way a-sync replication. Cassandra has NetworkTopologyStrategy which offer N Data center replicate which a replication factor per datacenter (3 copies in this dc, 2 copies in that dc, 4 copies here). You also have special quorum levels called LOCAL_QUORUM, and EACH_QUORUM which control how the writes are acked in these multi dc scenarios. They both support map reduce. Both have online schema updates now. Your call for no SPOF. (yes you can drbd+lha all the hadoop SPOFs) but Cassandra is truly shared nothing. Cassandra seems to match your use case best

Couchdb is great at replication, MapReduce and writing to all nodes though sharding the data to grow horizontally is still from third party add ons. BigCouch offers a version of couchdb with built-in sharding that follows the Dynamo Paper and I’m going to look at that next. The Large Hadron Collider project uses couchdb to store petabytes of distributed data with a custom sharding solution

Replication in HBase is still pretty new, yet there are a couple of ways to handle replication, including at the application level.

 

I would suggest that you determine your use case and what you want to do. Cassandra has its issues as well as Hbase and any other option.

 

One thing that you also need to consider whichever solution you choose, what’s it’s ‘critical mass’. Meaning in 5 years time, will it still be around, or just a dying fad?

Replication can be done at the application level and that method has merit, but if you are doing the replication yourself every data store would now be in the solutions space. That is out of scope for this topic

Replication across data centers is pretty new in HBase => does that imply that there are some well documented issues? Isn’t replication across different data centers taken care of by the underlying HDFS?

Also, how about migration across different versions of HBase and Casandra? Can that happen online with no outage?

How about the API that a java application uses to interact with HBase and Cassandra? What would you suggest one should use for either?

low latency read, replication across different data centers, hbase or cassandra?

Hi, If we have terabytes of log data being generated every day and if it needs to be piped to a nosql database (required low write latency), if we need replication across different data centers, if there is a need for low latency queries on the data in the nosql datastore, if there is a need for no single point of failure, need for schema updates without restarting the data store, running map reduce jobs… then what would be the right nosql implementation: hbase or cassandra? What do you think is more appropriate for this use case?

Both offer a “way” to do replication. HBase offers a two way a-sync replication. Cassandra has NetworkTopologyStrategy which offer N Data center replicate which a replication factor per datacenter (3 copies in this dc, 2 copies in that dc, 4 copies here). You also have special quorum levels called LOCAL_QUORUM, and EACH_QUORUM which control how the writes are acked in these multi dc scenarios. They both support map reduce. Both have online schema updates now. Your call for no SPOF. (yes you can drbd+lha all the hadoop SPOFs) but Cassandra is truly shared nothing. Cassandra seems to match your use case best

Couchdb is great at replication, MapReduce and writing to all nodes though sharding the data to grow horizontally is still from third party add ons. BigCouch offers a version of couchdb with built-in sharding that follows the Dynamo Paper and I’m going to look at that next. The Large Hadron Collider project uses couchdb to store petabytes of distributed data with a custom sharding solution

Replication in HBase is still pretty new, yet there are a couple of ways to handle replication, including at the application level.

 

I would suggest that you determine your use case and what you want to do. Cassandra has its issues as well as Hbase and any other option.

 

One thing that you also need to consider whichever solution you choose, what’s it’s ‘critical mass’. Meaning in 5 years time, will it still be around, or just a dying fad?

Replication can be done at the application level and that method has merit, but if you are doing the replication yourself every data store would now be in the solutions space. That is out of scope for this topic

Replication across data centers is pretty new in HBase => does that imply that there are some well documented issues? Isn’t replication across different data centers taken care of by the underlying HDFS?

Also, how about migration across different versions of HBase and Casandra? Can that happen online with no outage?

How about the API that a java application uses to interact with HBase and Cassandra? What would you suggest one should use for either?

My understanding is that HDFS shouldn’t be spanned across datacenters due to performance/SPOF on NN.. When I spoke with Todd Lipcon (hbase guy) at the Chicago Data Summit recently, I asked how his architecture picture changed when you have multiple data centers. His answer was to have HDFS/HBase in each datacenter and use the replication as Mike suggested (which is pretty new, but like most things, gets better each release).

The other things to remember is that hbase favors consistency while cassandra is more eventually consistent. So a big part of the question is when you look at the logs do you need a perfectly correct answer or can a few records be trailing. Even if you say you want a perfect answer, all bets are off with >1 datacenter as you still have the replication lag.

WRT to ingestion, seems that something like Flume (from Cloudera) deals with ingestion (either directly via java api or tailing a file, etc on the src side and using hbase or cassandra sink). It too is pretty new (like most of this stuff), so YMMV.

Haven’t tried any of this, but plan to near term

 

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

NOSQL Cassandra Query Language

Cassandra Query Language

Did you miss last night’s London Cassandra User Group at Skills Matter? Watch the videos here — including:
Lorenzo Alberton on Algorithms and Data Structures
Andrew Byde on the Acunu storage core
Courtney Robinson on Intro to CQL

Watch the SkillsCast videos from last night’s London Cassandra User Group at Skills Matter

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!