Data augmentation is a widely used technique and an essential component of recent advances in self-supervised representation learning. By preserving similarities between augmented data, the resulting data representation can improve a variety of downstream analytics and deliver state-of-the-art performance for many applications. To demystify the role of data augmentation, we developed a statistical framework on low-dimensional product manifolds to theoretically understand why unlabeled augmented data lead to useful data representations. Under this framework, we propose a new representation learning method called extended invariant manifold learning and develop a corresponding loss function. It can work with deep neural networks to learn data representations. Compared with existing methods, the new data representation simultaneously exploits the geometric structure of the manifold and the invariant properties of the extended data. Our theoretical investigations have precisely characterized how data representations learned from augmented data can improve k nearest neighbor classifiers in downstream analysis, and more complex data augmentation can be used in downstream analysis. It shows that it leads to improvement. Finally, numerical experiments on simulated and real datasets are presented to support the theoretical results of this paper.

    Source link


    Leave A Reply