Main Content

Scaling Document Clustering on the Cloud

About the Talk

June 3, 2011 9:30 AM

Microsoft Conference Center Building 33

Microsoft Conference Center Building 33

Cloud computing has gained significant popularity over the past few years and introduces a number of new and compelling capabilities as a computational platform. Beyond the well-established benefits such as the massive scalability of both compute on demand and storage on demand, cloud computing offers the ability to think differently about the problems we aretrying to solve. Rather than facing constraints of fixed limitations of a target computational platform, we are able to develop codes and algorithms that can make intelligent use of the resources available. A particular code may begin to solve a problem on a modest set of hardware and then, as the incoming data set grows or solution evolves, issue calls to the underlying cloud infrastructure to allocate the appropriate increase in hardware. The software can then reconfigure its control structures to accommodate the newly acquired hardware and resume the task of solving the problem at hand.

This talk will discuss early progress in the application of the above techniques to a document clustering algorithm developed at Oak Ridge National Laboratory. The original codes used to solve this problem utilize a memory resident non-binary tree, which causes the problem size to be limited by the amount of physical ram in the machine. The application of the cloud and the techniques described in this talk are allowing the algorithm to span multiple machines in a self-scaling, fault tolerant manner. This talk will provide initial results, lessons learned, future work, as well as a brief comments on utilizing various cloud vendors for the same code base.

Ratings and Recommendations

This Talk hasn't been rated yet. Sign In to rate Talks.

comments powered by Disqus