Abstract. Several novel methods of improving the memory locality of the Sequentially Truncated Higher Order Singular Value Decomposition (ST-HOSVD) algorithm for computing the Tucker decomposition are presented. We show how the two primary computational bottlenecks of the ST-HOSVD can be fused together into a single kernel to improve memory locality. We then extend matrix tiling techniques to tensors to further improve cache utilization. This blocked based approach is then leveraged to drastically reduce the auxiliary memory requirements of the algorithm. Our approach's effectiveness is demonstrated by comparing benchmark results between the traditional ST-HOSVD kernels and our single fused kernel. We then compare single-node CPU runtime results of a ST-HOSVD implementation utilizing our optimizations to TuckerMPI. We demonstrate $\sim2\times$ speedup over the existing state-of-the-art for dense high rank tensors, whilst increasing the problem size that can be computed for a given memory allocation by $\sim3\times$. Finally, we demonstrate our approach's effectiveness with a motivating example of compressing Homogeneous Charge Compression Ignition (HCCI) and Stat Planar (SP) simulation datasets.
| PP22 Home 2022 | Program | Speaker Index |