Tuesday, July 27, 2010

[Paper]: Revisiting Crawlers’ Role in a Search Engine

Following are the slides of my paper presentation at ICISA, 2010 in Seoul, Korea.

This paper considers tradeoffs in web crawler design especially from the perspective of events versus threads[1,2]. The paper also makes some recommendations for better OS support for web crawling. It points out that the two principal problems with web crawling are:
  • Choosing the right pages to crawl
  • Basic architecture for performing the crawl
The focus of the work lies on the second problem with our proposition that events are the ideal way for implementation of web crawlers as events give better throughput while crawling the web. Furthermore we argue that the growing usage of search engines needs a careful redesign of the constituents of the search engine and that too from the perspective of systems software with the conclusion that the exokernel[3] is the right answer in removing some of the limitations of search engines today. We recommend having a future operating system dedicated to search engines.

If any of you is interested in more details I recommend him to contact me through email at atifms@kaist.ac.kr or matifq@yahoo.com. Moreover you can also request for a copy of the paper by personal email.

[1] von Behren, R., Condit, J., and Brewer, E. Why Events are a Bad Idea (for High-concurrency Servers). In 10th Workshop on Hot Topics for Operating Systems (HotOS IX), Lihue, Hawaii, May 2003.
[2] Ousterhout, J. Why threads are a bad idea (for most purposes). In Invited talk presented at 1996 USENIX Annual Technical Conference, San Diego, CA, October 1996.
[3] Engler, D. R., Kaashoek, M. F., and O'Toole, J. 1995. Exokernel: an operating system architecture for application-level resource management. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (Copper Mountain, Colorado, United States, December 03 - 06, 1995). M. B. Jones, Ed. SOSP '95. ACM, New York, NY, 251-266.