In recent years, deep network pruning has attracted significant attention in order to enable the rapid deployment of AI into small devices with computation and memory constraints. Pruning is often achieved by dropping redundant weights, neurons, or layers of a deep network while attempting to retain a comparable test performance. Many deep pruning algorithms have been proposed with impressive empirical success. However, existing approaches lack a quantifiable measure to estimate the compressibility of a sub-network during each pruning iteration and thus may underprune or over-prune the model. In this work, we propose PQ Index (PQI) to measure the potential compressibility of deep neural networks and use this to develop a Sparsity-informed Adaptive Pruning (SAP) algorithm. Our extensive experiments corroborate the hypothesis that for a generic pruning procedure, PQI decreases first when a large model is being effectively regularized and then increases when its compressibility reaches a limit that appears to correspond to the beginning of underfitting. Subsequently, PQI decreases again when the model collapse and significant deterioration in the performance of the model start to occur. Additionally, our experiments demonstrate that the proposed adaptive pruning algorithm with proper choice of hyper-parameters is superior to the iterative pruning algorithms such as the lottery ticket-based pruning methods, in terms of both compression efficiency and robustness.
Deep model pruning
Sparsity index
Current biotechnologies can simultaneously measure multiple high-dimensional modalities (e.g., RNA, DNA accessibility, and protein) from the same cells. A combination of different analytical tasks (e.g., multi-modal integration and cross-modal analysis) is required to comprehensively understand such data, inferring how gene regulation drives biological diversity and functions. However, current analytical methods are designed to perform a single task, only providing a partial picture of the multi-modal data. Here, we present UnitedNet, an explainable multi-task deep neural network capable of integrating different tasks to analyze single-cell multi-modality data. Applied to various multi-modality datasets (e.g., Patch-seq, multiome ATAC + gene expression, and spatial transcriptomics), UnitedNet demonstrates similar or better accuracy in multi-modal integration and cross-modal prediction compared with state-of-the-art methods. Moreover, by dissecting the trained UnitedNet with the explainable machine learning algorithm, we can directly quantify the relationship between gene expression and other modalities with cell-type specificity. UnitedNet is a comprehensive end-to-end framework that could be broadly applicable to single-cell multi-modality biology. This framework has the potential to facilitate the discovery of cell-type-specific regulation kinetics across transcriptomics and other modalities.
AI for healthcare
Deep learning
Single-cell biology
Continual learning (CL) is an emerging research area aiming to emulate human learning throughout a lifetime. Most existing CL approaches primarily focus on mitigating catastrophic forgetting, a phenomenon where performance on old tasks declines while learning new ones. However, human learning involves not only re-learning knowledge but also quickly recognizing the current environment, recalling related knowledge, and refining it for improved performance. In this work, we introduce a new problem setting, Adaptive CL, which captures these aspects in an online, recurring task environment without explicit task boundaries or identities. We propose the LEARN algorithm to efficiently explore, recall, and refine knowledge in such environments. We provide theoretical guarantees from two perspectives: online prediction with tight regret bounds and asymptotic consistency of knowledge. Additionally, we present a scalable implementation that requires only first-order gradients for training deep learning models. Our experiments demonstrate that the LEARN algorithm is highly effective in exploring, recalling, and refining knowledge in adaptive CL environments, resulting in superior performance compared to competing methods.
Continual learning
Online streaming data
Collaborations among multiple organizations, such as financial institutions, medical centers, and retail markets in decentralized settings are crucial to providing improved service and performance. However, the underlying organizations may have little interest in sharing their local data, models, and objective functions. These requirements have created new challenges for multi-organization collaboration. In this work, we propose Gradient Assisted Learning (GAL), a new method for multiple organizations to assist each other in supervised learning tasks without sharing local data, models, and objective functions. In this framework, all participants collaboratively optimize the aggregate of local loss functions, and each participant autonomously builds its own model by iteratively fitting the gradients of the overarching objective function. We also provide asymptotic convergence analysis and practical case studies of GAL. Experimental studies demonstrate that GAL can achieve performance close to centralized learning when all data, models, and objective functions are fully disclosed.
Assisted learning
Privacy
We propose a new architecture for distributed image compression from a group of distributed data sources. The work is motivated by practical needs of data-driven codec design, low power consumption, robustness, and data privacy. The proposed architecture, which we refer to as Distributed Recurrent Autoencoder for Scalable Image Compression (DRASIC), is able to train distributed encoders and one joint decoder on correlated data sources. Its compression capability is much better than the method of training codecs separately. Meanwhile, the performance of our distributed system with 10 distributed sources is only within 2 dB peak signal-to-noise ratio (PSNR) of the performance of a single codec trained with all data sources. We experiment distributed sources with different correlations and show how our data-driven methodology well matches the Slepian-Wolf Theorem in Distributed Source Coding (DSC). To the best of our knowledge, this is the first data-driven DSC framework for general distributed code design with deep learning.
Codecs
Data compression
Image coding
Recurrent neural nets
Motivated by the ever-increasing demands for limited communication bandwidth and low-power consumption, we propose a new methodology, named joint Variational Autoencoders with Bernoulli mixture models (VAB), for performing clustering in the compressed data domain. The idea is to reduce the data dimension by Variational Autoencoders (VAEs) and group data representations by Bernoulli mixture models. Once jointly trained for compression and clustering, the model can be decomposed into two parts: a data vendor that encodes the raw data into compressed data, and a data consumer that classifies the received (compressed) data. In this way, the data vendor benefits from data security and communication bandwidth, while the data consumer benefits from low computational complexity. To enable training using the gradient descent algorithm, we propose to use the GumbelSoftmax distribution to resolve the infeasibility of the backpropagation algorithm when assessing categorical samples.
Unsupervised learning
Variational autoencoder
Bernoulli Mixture Model
Recurrent Neural Network (RNN) and its variations, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have become standard building blocks for learning online data of sequential nature in many research areas, including natural language processing and speech data analysis. In this paper, we present a new methodology to significantly reduce the number of parameters in RNNs while maintaining performance that is comparable or even better than classical RNNs.
The new proposal, referred to as Restricted Recurrent Neural Network (RRNN), restricts the weight matrices corresponding to the input data and hidden states at each time step to share a large proportion of parameters. The new architecture can be regarded as compression of its classical counterpart, but it does not require pre-training or sophisticated parameter fine-tuning, both of which are major issues in most existing compression techniques. Experiments on natural language modeling show that compared with its classical counterpart, the restricted recurrent architecture generally produces comparable results at about 50% compression rate. In particular, the Restricted LSTM can outperform classical RNN with even less number of parameters.
Recurrent neural network
Long short-term memory
Gated recurrent unit
Motivated by Fisher divergence, we present a new set of information quantities, which we refer to as gradient information. These measures serve as surrogates for classical information measures such as those based on logarithmic loss, Kullback-Leibler divergence, directed Shannon information, etc. in many data-processing scenarios of interest and often provide a significant computational advantage, improved stability, and robustness. As an example, we apply these measures to the Chow-Liu tree algorithm and demonstrate its performance using both synthetic and real data.
Capacity
Fisher divergence
Information
Stability
Chow-Liu tree approximation
The Bayes factor is a widely used criterion in model comparison, and its logarithm is a difference of out-of-sample predictive scores under the logarithmic scoring rule. However, when some of the candidate models involve vague priors on their parameters, the log-Bayes factor features an arbitrary additive constant that hinders its interpretation. As an alternative, we consider model comparison using the Hyvärinen score. We propose a method to consistently estimate this score for parametric models, using sequential Monte Carlo methods. We show that this score can be estimated for models with tractable likelihoods as well as nonlinear non-Gaussian state-space models with intractable likelihoods. We prove the asymptotic consistency of this new model selection criterion under strong regularity assumptions in the case of non-nested models, and we provide qualitative insights for the nested case. We also use existing characterizations of proper scoring rules on discrete spaces to extend the Hyvärinen score to discrete observations. Our numerical illustrations include Lévy-driven stochastic volatility models and diffusion models for population dynamics. Supplementary materials for this article are available online.
Bayes factor
Noninformative prior
Model selection
Sequential Monte Carlo
State-space model
We develop a methodology (referred to as kinetic prediction) for predicting time series undergoing unknown changes in their data generating distributions. Based on Kolmogorov-Tikhomirov’s ε-entropy, we propose a concept called ε-predictability that quantifies the size of a model class (which can be parametric or nonparametric) and the maximal number of abrupt structural changes that guarantee the achievability of asymptotically optimal prediction. Moreover, for parametric distribution families, we extend the aforementioned kinetic prediction with discretized function spaces to its counterpart with continuous function spaces and propose a sequential Monte Carlo based implementation.
We also extend our methodology for predicting smoothly varying data generating distributions. Under reasonable assumptions, we prove that the average predictive performance converges almost surely to the oracle bound, which corresponds to the case that the data generating distributions are known in advance. The results also shed some light on the so-called “prediction-inference dilemma.” Various examples and numerical results are provided to demonstrate the wide applicability of our methodology.
Change points
Kinetic prediction
ε-entropy
Optimal prediction
Sequential Monte-Carlo
Smooth variations
Online tracking
We consider online multimodal data fusion, where the goal is to combine information from multiple modes to identify an element in a large dictionary. We address this problem in object recognition by focusing on tactile sensing as one of the modes. Using a tactile glove with seven sensors, various individuals grasp different objects to obtain 7-D time series, where each component represents the pressure sequence applied to one sensor. The pressure data of all objects is stored in a dictionary as a reference. The objective is to match a streaming vector time series from grasping an unknown object to a dictionary object. We propose an algorithm that may start with prior knowledge provided by other modes. Receiving pressure data sequentially, the algorithm uses a dissimilarity metric to modify the prior and form a probability distribution over the dictionary. When the dictionary objects are dissimilar in shape, we empirically show that our algorithm recognize the unknown object even with a uniform prior. If there exists a similar object to the unknown object in the dictionary, our algorithm needs the prior from other modes to detect the unknown object. Notably, our algorithm maintains a similar performance to standard offline classification techniques, such as support vector machine, with a significantly lower computational time.
Object recognition
Online learning
Tactile sensing
We propose a method for adaptive nonlinear sequential modeling of time series data. Data are modeled as a nonlinear function of past values corrupted by noise, and the underlying nonlinear function is assumed to be approximately expandable on a spline basis. We cast the modeling of data as finding a good fit representation in the linear span of a multidimensional spline basis and use a variant of l1 -penalty regularization in order to reduce the dimensionality of representation. Using adaptive filtering techniques, we design our online algorithm to automatically tune the underlying parameters based on the minimization of the regularized sequential prediction error. We demonstrate the generality and flexibility of the proposed approach on both synthetic and real-world datasets. Moreover, we analytically investigate the performance of our algorithm by obtaining both bounds on prediction errors and consistency in variable selection.
Adaptive filtering
Data prediction
Nonlinearity
Sequential modeling
Spline
Time series
One of the main challenges in identifying structural changes in stochastic processes is to carry out analysis for time series with the dependency structure in a computationally tractable way. Another challenge is that the number of true change points is usually unknown, requiring a suitable model selection criterion to arrive at informative conclusions.
To address the first challenge, we model the data generating process as a segment-wise autoregression, which is composed of several segments (time epochs), each of which is modeled by an autoregressive model. We propose a multi-window method that is both effective and efficient for discovering the structural changes. The proposed approach was motivated by transforming a segment-wise autoregression into a multivariate time series that is asymptotically segment-wise independent and identically distributed. To address the second challenge, we derive theoretical guarantees for (almost surely) selecting the true number of change points of segment-wise independent multivariate time series. Specifically, under mild assumptions, we show that a Bayesian information criterion like criterion gives a strongly consistent selection of the optimal number of change points, while an Akaike information criterion like criterion cannot.
Finally, we demonstrate the theory and strength of the proposed algorithms by experiments on both synthetic- and real-world data, including the Eastern U.S. temperature data and the El Nino data. The experiment leads to some interesting discoveries about temporal variability of the summer-time temperature over the Eastern U.S. and about the most dominant factor of ocean influence on climate, which was also discovered by environmental scientists.
Change detection
Information criteria
Large deviation analysis
Strong consistency
Time series