
Right now around 100 Gigabytes of data gets generated related to user activities at our company’s different applications/products. The product which I am working on basically consumes all of that and gives insights about users behavior and also helps to get segmented users to send relevant data/recommendations to help them in their quest.
When we built this application we didn’t think of data backup & data eviction policy in the database level that much. Usually whenever we build an application usually we analyze current requirements and based on that we predict future requirements and design application architecture accordingly. Also, we postpone a couple of things and make some assumptions.
We realized we are paying some extra money unnecessarily for redundant data which is present in our application Database Hard disk(EBS Volumes). So, we planned a good data backup strategy and we decided to put TTL to the new data which is getting inserted to our DB. Also, we decided that let’s iterate some Cassandra tables and put TTL to the old data. So that at the time of compaction old data gets removed.
For that, My boss Amit Jain gave the link to this great article. After reading this we decided to write code for table scan using token concept. Now interesting problem popped up. My team member(Rahul Deewan) when started writing the code then he realized that the Node.js Number/Integer range is from -(2**53–1) to (2**53–1) but Cassandra token range is from – (2**63) to (2**63–1).
Then we got this super reference. Then we chose bignum package because it’s easy to understand and we faced npm installation issues with node-bigint.
Sample code for finding minimum and maximum token range in a Cassandra table and Printing data between a certain range.
You will get output like this
Minimum token: -9223297786983086897 Maximum token: 9223328311826220865 Getting data between token range from -9223297786983086897 to -9223297786983081897 Row { movie_id: 4317, genres: 'Comedy|Romance', title: 'Love Potion #9 (1992)' }
Tip(Given by my friend Abhinav Faujdar): At the time of Cassandra table scan instead of using limit use token range.
Because at the time of limit, coorinator node gets the data from all the nodes and gives the amount of data based on limit so we extract the data from nodes more than what we required.
But in the case of token range it goes to a certain node because cassandra gives specific token range to specific node and gets the data that belongs to that range only. Basically its less memory and CPU intensive.
In the meanwhile my friends who love java laughing at me because if we would have written this script in java there is no need for any extra package like bignum. Because Java has a long data type 😉.
Peace. Happy Coding.