Distributed and Scalable PCA in the Cloud
- Arun Kumar ,
- Vijay Narayanan ,
- Nikos Karampatziakis ,
- Paul Mineiro ,
- Markus Weimer
MSR-TR-2014-165 |
Published by Microsoft
Apache Reef Research Paper
Principal Component Analysis (CA) is a popular technique with many applications. Recent randomized PCA algorithms scale to large datasets but face a bottleneck when the number of features is also large. We propose to mitigate this issue using a composition of structured and unstructured randomness within a randomized PCA algorithm. Initial experiments using a large graph dataset from Twitter show promising results. We demonstrate the scalability of our algorithm by implementing it both on Hadoop, and a more flexible platform named REEF.