Few days I was performing blogosphere analysis using my crawler "VisionerBot" that I recently presented at ICISA, Seoul, 2010. I had quite a tough time and not to forget a number of sleepless nights while I was on this task as the amount of data was extremely huge :), my task was to process blogs for interest of users over the blogspot domain. In this post I present the problem I was facing and the technique I used to overcome it:
Input: 69 Blogs
1) Find 5,000 Blogs
2) Process at most 2,000 Blog posts per blog
Total Blogs found: 5,067
Total Alive Blogs: 4,552
Total Number of Posts: 1,704,587
Size of data that needs to be processed: 19 GB+
Size of available active RAM: 2 GB
Now what to do from here... When I started working on this, I never expected to find 1,704,587 posts with data size of 19 GB, I worked almost days and nights to get this data fit inside my desktop machine. If I used database for this experiment then it will cost me months to download this where as I had a deadline to complete this task within 10 days. There I gave birth to a new algorithm that I call as "Rack Algorithm" which downloads data in RAM until RAM gets filled and then it flushes the data on disk and cleans up RAM for remaining download process and this exercise continues until data is downloaded completely. After download comes the process of finding meaningful data out of that 19 GB and calling it to RAM to start processing and there I mananged to shrink its size size enough to manage it inside RAM. In this process I used (Key,Value) pairs along with lists.
Finally a sigh of relief, I have accomplished what seemed nearly impossilbe within 10 days, I am happy I managed to find opinion clusters of 1,704,586 posts by my coming algorithm that I call TDR (Topic Discussion Rank).