Distributed and Scalable PCA in the Cloud

Arun Kumar; Vijay Narayanan; Nikos Karampatziakis; Paul Mineiro; Markus Weimer

Distributed and Scalable PCA in the Cloud

Arun Kumar ,
Vijay Narayanan ,
Nikos Karampatziakis ,
Paul Mineiro ,
Markus Weimer

MSR-TR-2014-165 | January 2014

Published by Microsoft

Apache Reef Research Paper

Download BibTex

Principal Component Analysis (CA) is a popular technique with many applications. Recent randomized PCA algorithms scale to large datasets but face a bottleneck when the number of features is also large. We propose to mitigate this issue using a composition of structured and unstructured randomness within a randomized PCA algorithm. Initial experiments using a large graph dataset from Twitter show promising results. We demonstrate the scalability of our algorithm by implementing it both on Hadoop, and a more flexible platform named REEF.