Publication Types:

Sort by year:

Towards Understanding Variation-Constrained Deep Neural Networks

AI FoundationsJournal paper
Gen Li, Jie Ding
IEEE Transactions on Signal Processing
Publication year: 2023

Abstract:

Multi-layer feedforward networks have been used to approximate a wide range of nonlinear functions. A fundamental problem is understanding the generalizability of a neural network model through its statistical risk, or the expected test error. In particular, it is important to understand the phenomenon that overparameterized deep neural networks may not suffer from overfitting when the number of neurons and learning parameters rapidly grow with n or even surpass n. In this paper, we show that a class of variation-constrained regression neural networks, with arbitrary width, can achieve a near-parametric rate n^{1/2+δ} for an arbitrarily small positive constant δ. It is equivalent to n^{1+2δ} under the mean squared error. This rate is also observed from numerical experiments. The result provides an insight into the benign overparameterization phenomenon. It indicates that the number of trainable parameters may not be a suitable complexity measure as often perceived for classical regression models. We also discuss the convergence rate regarding other network parameters, including the input dimension, network layer, and coefficient norm.

Keywords:

Complexity theory

Deep learning theory

Statistical convergence analysis

Pruning Deep Neural Networks from a Sparsity Perspective

AI FoundationsAI ScalabilityConference paper
E. Diao, G. Wang, J. Zhang, Y. Yang, J. Ding, V. Tarokh
International Conference on Learning Representations (ICLR)
Publication year: 2023

Abstract:

In recent years, deep network pruning has attracted significant attention in order to enable the rapid deployment of AI into small devices with computation and memory constraints. Pruning is often achieved by dropping redundant weights, neurons, or layers of a deep network while attempting to retain a comparable test performance. Many deep pruning algorithms have been proposed with impressive empirical success. However, existing approaches lack a quantifiable measure to estimate the compressibility of a sub-network during each pruning iteration and thus may underprune or over-prune the model. In this work, we propose PQ Index (PQI) to measure the potential compressibility of deep neural networks and use this to develop a Sparsity-informed Adaptive Pruning (SAP) algorithm. Our extensive experiments corroborate the hypothesis that for a generic pruning procedure, PQI decreases first when a large model is being effectively regularized and then increases when its compressibility reaches a limit that appears to correspond to the beginning of underfitting. Subsequently, PQI decreases again when the model collapse and significant deterioration in the performance of the model start to occur. Additionally, our experiments demonstrate that the proposed adaptive pruning algorithm with proper choice of hyper-parameters is superior to the iterative pruning algorithms such as the lottery ticket-based pruning methods, in terms of both compression efficiency and robustness.

Keywords:

Deep model pruning

Sparsity index

Provable Identifiability of Two-Layer ReLU Neural Networks via LASSO Regularization

AI FoundationsJournal paper
Gen Li, Ganghua Wang, Jie Ding
IEEE Transactions on Information Theory
Publication year: 2023

Abstract:

LASSO regularization is a popular regression tool to enhance the prediction accuracy of statistical models by performing variable selection through the l1 penalty, initially formulated for the linear model and its variants. In this paper, the territory of LASSO is extended to the neural network model, a fashionable and powerful nonlinear regression model. Specifically, given a neural network whose output y depends only on a small subset of input x, denoted by S, we prove that the LASSO estimator can stably reconstruct the neural network and identify S when the number of samples scales logarithmically with the input dimension. This challenging regime has been well understood for linear models while barely studied for neural networks. Our theory lies in an extended Restricted Isometry Property (RIP)-based analysis framework for two-layer ReLU neural networks, which may be of independent interest to other LASSO or neural network settings. Based on the result, we advocate a neural network-based variable selection method. Experiments on simulated and real-world datasets show promising performance of the variable selection approach compared with existing techniques.

Keywords:

Lasso

Identifiability

Neural network

Nonlinear regression

Variable selection

Information Criteria for Model Selection

AI FoundationsJournal paper
Jiawei Zhang, Yuhong Yang, Jie Ding
WIREs Computational Statistics (invited article)
Publication year: 2023

Abstract:

The rapid development of modeling techniques has brought many opportunities for data-driven discovery and prediction. However, this also leads to the challenge of selecting the most appropriate model for any particular data task. Information criteria, such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC), have been developed as a general class of model selection methods with profound connections with foundational thoughts in statistics and information theory. Many perspectives and theoretical justifications have been developed to understand when and how to use information criteria, which often depend on particular data circumstances. This review article will revisit information criteria by summarizing their key concepts, evaluation metrics, fundamental properties, interconnections, recent advancements, and common misconceptions to enrich the understanding of model selection in general.

Keywords:

Modeling Methods

Model Selection

Information Theoretic Methods

Exploring Gradient Oscillation in Deep Neural Network Training

AI FoundationsConference paper
Chedi Morchdi, Yi Zhou, Jie Ding and Bei Wang
59th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
Publication year: 2023

Abstract:

Understanding optimization in deep learning is a fundamental problem. Many previous works argued that gradient descent stably trains deep networks, yet recent work empirically discovered that the training is actually at the edge of stability. In this work, we take one step further toward exploring the instability of gradient descent in training deep networks. Specifically, through training various modern deep networks using gradient descent, we empirically show that most of the optimization progress is achieved with oscillating gradients – gradients that are highly negatively correlated in adjacent iterations. Moreover, we observe that such gradient oscillation (GO) has several fundamental properties: (i) GO appears in different training stages for networks with different architectures; (ii) under a large learning rate, GO is consistently observed across all the layers of the networks; and (iii) under a small learning rate, GO is more substantial in the input layers than in the output layers. Our discoveries suggest that GO is an essential and invariant feature in training different types of neural networks, and may inspire new optimizer designs.

Keywords:

Deep learning theory

Gradient descent

Learning rate

Assisted Learning for Organizations with Limited Imbalanced Data

AI FoundationsDecentralized AIJournal paper
Cheng Chen, Jiaying Zhou, Jie Ding, Yi Zhou
Transactions on Machine Learning Research
Publication year: 2023

Abstract:

In the era of big data, many big organizations are integrating machine learning into their work pipelines to facilitate data analysis. However, the performance of their trained models is often restricted by limited and imbalanced data available to them. In this work, we develop an assisted learning framework for assisting organizations to improve their learning performance. The organizations have sufficient computation resources but are subject to stringent data-sharing and collaboration policies. Their limited imbalanced data often cause biased inference and sub-optimal decision-making. In assisted learning, an organizational learner purchases assistance service from an external service provider and aims to enhance its model performance within only a few assistance rounds. We develop effective stochastic training algorithms for both assisted deep learning and assisted reinforcement learning. Different from existing distributed algorithms that need to transmit gradients or models frequently, our framework allows the learner to only occasionally share information with the service provider, but still, obtain a model that achieves near-oracle performance as if all the data were centralized.

Keywords:

Assisted Learning

Imbalanced Data

Targeted Cross-Validation

AI FoundationsJournal paper
Jiawei Zhang, Jie Ding, Yuhong Yang
Bernoulli
Publication year: 2022

Abstract:

In many applications, we have access to the complete dataset but are only interested in the prediction of a particular region of predictor variables. A standard approach is to find the globally best modeling method from a set of candidate methods. However, it is perhaps rare in reality that one candidate method is uniformly better than the others. A natural approach for this scenario is to apply a weighted L2 loss in performance assessment to reflect the region-specific interest. We propose targeted cross-validation (TCV) to select models or procedures based on a general weighted L2 loss. We show that the TCV is consistent in selecting the best performing candidate under the weighted L2 loss. Experimental studies are used to demonstrate the use of TCV and its potential advantage over the global CV or the approach of using only local data for modeling a local region.

Previous investigations on CV have relied on the condition that when the sample size is large enough, the ranking of two candidates stays the same. However, in many applications with the setup of changing data-generating processes or highly adaptive modeling methods, the relative performance of the methods is not static as the sample size varies. Even with a fixed data-generating process, it is possible that the ranking of two methods switches infinitely many times. In this work, we broaden the concept of selection consistency by allowing the best candidate to switch as the sample size varies, and then establish the consistency of the TCV. This flexible framework can be applied to high-dimensional and complex machine learning scenarios where the relative performances of modeling procedures are dynamic.

Keywords:

Consistency

Cross-validation

Model selection

Regression

SemiFL: Communication Efficient Semi-Supervised Federated Learning with Unlabeled Clients

AI FoundationsConference paperDecentralized AI
Enmao Diao, Jie Ding, Vahid Tarokh
Conference on Neural Information Processing Systems (NeurIPS)
Publication year: 2022

Abstract:

Federated Learning allows training machine learning models by using the computation and private data resources of many distributed clients such as smartphones and IoT devices. Most existing works on Federated Learning (FL) assume the clients have ground-truth labels. However, in many practical scenarios, clients may be unable to label task-specific data, e.g., due to a lack of expertise. This work considers a server that hosts a labeled dataset and wishes to leverage clients with unlabeled data for supervised learning. We propose a new Federated Learning framework referred to as SemiFL to address Semi-Supervised Federated Learning (SSFL). In SemiFL, clients have completely unlabeled data, while the server has a small amount of labeled data. SemiFL is communication efficient since it separates the training of server-side supervised data and client-side unsupervised data. We demonstrate several strategies of SemiFL that enhance efficiency and prediction and develop intuitions of why they work. In particular, we provide a theoretical understanding of the use of strong data augmentation for Semi-Supervised Learning (SSL), which can be interesting in its own right. Extensive empirical evaluations demonstrate that our communication efficient method can significantly improve the performance of a labeled server with unlabeled clients. Moreover, we demonstrate that SemiFL can outperform many existing SSFL methods, and perform competitively with the state-of-the-art FL and centralized SSL results. For instance, in standard communication efficient scenarios, our method can perform 93% accuracy on the CIFAR10 dataset with only 4000 labeled samples at the server. Such accuracy is only 2% away from the result trained from 50000 fully labeled data, and it improves about 30% upon existing SSFL methods in the communication efficient setting.

Keywords:

Federated Learning

Semi-Supervised Learning

Data augmentation theory

Unlabeled data

Interval Privacy: A Framework for Privacy-Preserving Data Collection

AI FoundationsAI SafetyJournal paper
Jie Ding, Bangjun Ding
IEEE Transactions on Signal Processing
Publication year: 2022

Abstract:

The emerging public awareness and government regulations of data privacy motivate new paradigms of collecting and analyzing data transparent and acceptable to data owners. We present a new concept of privacy and corresponding data formats, mechanisms, and theories for privatizing data during data collection. The privacy, named Interval Privacy, enforces the raw data conditional distribution on the privatized data to be the same as its unconditional distribution over a nontrivial support set. Correspondingly, the proposed privacy mechanism will record each data value as a random interval (or, more generally, a range) containing it. The proposed interval privacy mechanisms can be easily deployed through survey-based data collection interfaces, e.g., by asking a respondent whether its data value is within a randomly generated range. Another unique feature of interval mechanisms is that they obfuscate the truth but do not perturb it. Using narrowed range to convey information is complementary to the popular paradigm of perturbing data. Also, the interval mechanisms can generate progressively refined information at the discretion of individuals, naturally leading to privacy-adaptive data collection. We develop different aspects of theory such as composition, robustness, distribution estimation, and regression learning from interval-valued data. Interval privacy provides a new perspective of human-centric data privacy where individuals have a perceptible, transparent, and simple way of sharing sensitive data.

Keywords:

Data collection
Data privacy
Interval mechanism
Local privacy

 

L1 Regularization in Two-Layer Neural Networks

AI FoundationsJournal paper
Gen Li, Yuantao Gu, Jie Ding
IEEE Signal Processing Letters
Publication year: 2021

Abstract:

A crucial problem of neural networks is to select an architecture that strikes appropriate tradeoffs between underfitting and overfitting. This work shows that 1 regularizations for two-layer neural networks can control the generalization error and sparsify the input dimension. In particular, with an appropriate 1 regularization on the output layer, the network can produce a tight statistical risk. Moreover, an appropriate 1 regularization on the input layer leads to a risk bound that does not involve the input data dimension. The results also indicate that training a wide neural network with a suitable regularization provides an alternative bias-variance tradeoff to selecting from a candidate set of neural networks. Our analysis is based on a new integration of dimension-based and norm-based complexity analysis to bound the generalization error.

Keywords:

Generalization error
Regularization
Statistical risk
Two-layer neural network

On Statistical Efficiency in Learning

AI FoundationsJournal paper
Jie Ding, Enmao Diao, Jiawei Zhou, Vahid Tarokh
IEEE Transactions on Information Theory, Volume 67, Issue 4, Pages 2488 - 2506
Publication year: 2021

Abstract:

A central issue of many statistical learning problems is to select an appropriate model from a set of candidate models. Large models tend to inflate the variance (e.g., overfitting), while small models tend to cause biases (e.g., underfitting) for a given fixed dataset. In this work, we address the critical challenge of model selection to strike a balance between model fitting and model complexity, thus gaining reliable predictive power. We consider the task of approaching the theoretical limit of statistical learning, meaning that the selected model has the predictive performance that is as good as the best possible model given a class of potentially misspecified candidate models.

We propose a generalized notion of Takeuchi’s information criterion and prove that the proposed method can asymptotically achieve the optimal out-sample prediction loss under reasonable assumptions. It is the first proof of the asymptotic property of Takeuchi’s information criterion to our best knowledge. Our proof applies to a wide variety of nonlinear models, loss functions, and high dimensionality (in the sense that the models’ complexity can grow with sample size). The proposed method can be used as a computationally efficient surrogate for leave-one-out cross-validation. Moreover, for modeling streaming data, we propose an online algorithm that sequentially expands the model complexity to enhance selection stability and reduce computation cost. Experimental studies show that the proposed method has desirable predictive power and much less computational cost than some popular methods.

Keywords:

Cross-validation
Expert learning
Feature selection
Limit of learning
Model expansion

Is a Classification Procedure Good Enough?--A Goodness-of-Fit Assessment Tool for Classification Learning

AI FoundationsJournal paper
Jiawei Zhang, Jie Ding, Yuhong Yang
Journal of the American Statistical Association
Publication year: 2021

Abstract:

In recent years, many non-traditional classification methods, such as Random Forest, Boosting, and neural network, have been widely used in applications. Their performance is typically measured in terms of classification accuracy. While the classification error rate and the like are important, they do not address a fundamental question: Is the classification method underfitted? To our best knowledge, there is no existing method that can assess the goodness-of-fit of a general classification procedure. Indeed, the lack of a parametric assumption makes it challenging to construct proper tests. To overcome this difficulty, we propose a methodology called BAGofT that splits the data into a training set and a validation set. First, the classification procedure to assess is applied to the training set, which is also used to adaptively find a data grouping that reveals the most severe regions of underfitting. Then, based on this grouping, we calculate a test statistic by comparing the estimated success probabilities and the actual observed responses from the validation set. The data splitting guarantees that the size of the test is controlled under the null hypothesis, and the power of the test goes to one as the sample size increases under the alternative hypothesis. For testing parametric classification models, the BAGofT has a broader scope than the existing methods since it is not restricted to specific parametric models (e.g., logistic regression). Extensive simulation studies show the utility of the BAGofT when assessing general classification procedures and its strengths over some existing methods when testing parametric classification models.

Keywords:

Goodness-of-fit test

Cassification procedure

Adaptive partition

Fisher Auto-Encoders

AI FoundationsConference paper
Khalil Elkhalil, Ali Hasan, Jie Ding, Sina Farsiu, Vahid Tarokh
International Conference on Artificial Intelligence and Statistics (AISTATS)
Publication year: 2021

It has been conjectured that the Fisher divergence is more robust to model uncertainty than the conventional Kullback-Leibler (KL) divergence. This motivates the design of a new class of robust generative auto-encoders (AE) referred to as Fisher auto-encoders. Our approach is to design Fisher AEs by minimizing the Fisher divergence between the intractable joint distribution of observed data and latent variables, with that of the postulated/modeled joint distribution. In contrast to KL-based variational AEs (VAEs), the Fisher AE can exactly quantify the distance between the true and the model-based posterior distributions. Qualitative and quantitative results are provided on both MNIST and celebA datasets demonstrating the competitive performance of Fisher AEs in terms of robustness compared to other AEs such as VAEs and Wasserstein AEs.

Evolutionary Spectra Based on the Multitaper Method with Application to Stationarity Test

AI FoundationsJournal paper
Yu Xiang, Jie Ding, Vahid Tarokh
IEEE Transactions on Signal Processing, 67(5): 1353—1365, 2019
Publication year: 2019

Abstract:

In this work, we propose a new inference procedure for understanding non-stationary processes under the framework of evolutionary spectra developed by Priestley. Among various frameworks of non-stationary modeling, the distinguishing feature of the evolutionary spectra is its focus on the physical meaning of frequency. The classical estimate of the evolutionary spectral density is based on a double-window technique consisting of a short-time Fourier transform and a smoothing. However, smoothing is known to suffer from the so-called bias leakage problem. By incorporating Thomson’s multitaper method that was originally designed for stationary processes, we propose an improved estimate of the evolutionary spectral density and analyze its bias/variance/resolution tradeoff. As an application of the new estimate, we further propose a non-parametric rank-based stationarity test and provide various experimental studies.

Keywords:

Non-stationary processes
Evolutionary spectra
Spectral analysis
Multitaper method
Stationarity test

Model Selection Techniques--An Overview

AI FoundationsJournal paper
Jie Ding, Vahid Tarokh, Yuhong Yang
IEEE Signal Processing Magazine (featured article), 35(6): 16—34, 2018
Publication year: 2018

Abstract:

In the era of big data, analysts usually explore various statistical models or machine-learning methods for observed data to facilitate scientific discoveries or gain predictive power. Whatever data and fitting procedures are employed, a crucial step is to select the most appropriate model or method from a set of candidates. Model selection is the central ingredient in data analysis for reliable and reproducible statistical inference or prediction, and thus it is central to scientific studies in such fields as ecology, economics, engineering, finance, political science, biology, and epidemiology.

There has been a long history of model selection techniques that arise from researches in statistics, information theory, and signal processing. A considerable number of methods have been proposed, following different philosophies and exhibiting varying performances. The purpose of this article is to provide a comprehensive overview of them in terms of their motivation, large sample performance, and applicability. We provide integrated and practically relevant discussions on the theoretical properties of the state-of-the-art model selection approach. We also share our thoughts on some controversial views on the practice of model selection.

Keywords:

Akaike information criterion
Bayesian information criterion
Bridge criterion
Cross-validation
Model preference

Bridging AIC and BIC: A New Criterion for Autoregression

AI FoundationsJournal paper
Jie Ding, Vahid Tarokh, Yuhong Yang
IEEE Transactions on Information Theory, 64(6): 4024—4043, 2018
Publication year: 2018

Abstract:

To address order selection for an autoregressive model fitted to time series data, we propose a new information criterion. It has the benefits of the two well-known model selection techniques, the Akaike information criterion, and the Bayesian information criterion. When the data is generated from a finite order autoregression, the Bayesian information criterion is known to be consistent, and so is the new criterion. When the true order is infinity or suitably high with respect to the sample size, the Akaike information criterion is known to be efficient in the sense that its predictive performance is asymptotically equivalent to the best offered by the candidate models; in this case, the new criterion behaves in a similar manner. Different from the two classical criteria, the proposed criterion adaptively achieves either consistency or efficiency, depending on the underlying data generating process. In practice, where the observed time series is given without any prior information about the model specification, the proposed order selection criterion is more flexible and reliable compared with classical approaches. Numerical results are presented to demonstrate the adaptivity of the proposed technique when applied to various datasets.

Keywords:

Adaptivity
Akaike information criterion
Asymptotic efficiency
Bayesian information criterion
Bridge criterion
Consistency
Information criterion
Model selection

Analysis of Multistate Autoregressive Models

AI FoundationsJournal paper
Jie Ding, Shahin Shahrampour, Kathryn Heal, Vahid Tarokh
IEEE Transactions on Signal Processing, 66(9): 2429—2440, 2018
Publication year: 2018

Abstract:

In this work, we consider the inference problem for a wide class of time-series models, referred to as multistate autoregressive models. The time series that we consider are composed of multiple epochs, each modeled by an autoregressive process. The number of epochs is unknown, and the transitions of states follow a Markov process of unknown order. We propose an inference strategy that enables reliable and efficient offline analysis of this class of time series. The inference is carried out through a three-step approach: detecting the structural changes of the time series using a recently proposed multiwindow algorithm, identifying each segment as a state and selecting the most appropriate number of states, and estimating the Markov source based upon the symbolic sequence obtained from previous steps. We provide theoretical results and algorithms in order to facilitate the inference procedure described above. We demonstrate the accuracy, efficiency, and wide applicability of the proposed algorithms via an array of experiments using synthetic and real-world data.

Keywords:

Multi-regime models
Prediction
Recurring patterns
Time series

Optimal Variable Selection in Regression Models

AI FoundationsManuscript
Jie Ding, Vahid Tarokh, Yuhong Yang
Manuscript, 2016
Publication year: 2016

Abstract:

We introduce a new criterion for variable selection in regression models and show its optimality in terms of both loss and risk under appropriate assumptions. The key idea is to impose a penalty that is nonlinear in model dimensions. In contrast to the state-of-art model selection criteria such as the Cp method, delete-1 or delete-k cross-validation, Akaike information criterion, Bayesian information criterion, the proposed method is able to achieve asymptotic loss and risk efficiency in both parametric and nonparametric regression settings, giving new insights on the reconciliation of two types of classical criteria with different asymptotic behaviors. Adaptivity and wide applicability of the new criterion are demonstrated by several numerical experiments. Unless the signal to noise ratio is very low, it performs better than some popular methods in our experimental study. An R package ‘bc’ is released that serves as a supplement to this work.

Keywords:

Regression
Subset selection
Feature selection

Data-Driven Learning of the Number of States in Multi-State Autoregressive Models

AI FoundationsConference paper
Jie Ding, Mohammad Noshad, Vahid Tarokh
Allerton Conference on Communication, Control, and Computing, 2015
Publication year: 2015

Abstract:

In this work, we consider the class of multi-state autoregressive processes that can be used to model non-stationary time-series of interest. In order to capture different autoregressive (AR) states underlying an observed time series, it is crucial to select the appropriate number of states. We propose a new model selection technique based on the Gap statistics, which uses a null reference distribution on the stable AR filters to check whether adding a new AR state will significantly improve the performance of the model. To that end, we define a new distance measure between AR filters and propose an efficient method to generate random stable filters that are uniformly distributed in the coefficient space. Numerical results are provided to evaluate the performance of the proposed approach.

Keywords:

Levinson-Durbin recursion
Polyhedra
Uniform distribution over Autoregressions