Deep Learning on Massively Parallel System


Parallelization of Convolutional Neural Networks (CNNs) has been considerably studied in recent years. A case study of parallelized CNNs using general-purpose computing on GPUs (GPGPU) and Message Passing Interface (MPI) has been published. On the other hand, little effort is being expended on studying scalability of parallelized CNNs on multi-core CPUs. We explores performance of the training process of CNNs achieved by increasing the number of computing cores and threads. Detailed experiments were conducted on state-of-the-art multi-core processors using OpenMP and MPI frameworks to demonstrate that Caffe-based CNNs are successfully accelerated due to well-designed multi-threaded programs. We also discussed better way to exhibit performance of multi-threaded CNNs by comparing three different implementations.