#### BSc at Moscow Institute of Physics and Technology

Department of Control and Applied Mathematics

Thesis: Accelerated Directional Searchs and Gradient-Free Methods with non-Euclidean prox-structure

I am a first year student of MS program at Moscow Institute of Physics and Technology, Department of Control and Applied Mathematics. I work under supervision of professor Alexander Gasnikov. My research interests include Stochastic Optimization and its applications to Machine Learning.

ResearchGate

arXiv

Google scholar

Download my CV

My e-mail: ed-gorbunov at yandex dot ru

- Operating systems: Microsoft Windows, Linux, Mac OSX
- Programming languages: Python, LaTeX, C, C++

- Russian (native)
- English (advanced)

- Football: 9 years in football school in Rybinsk, Russia. Now I am playing for an amateur team and a student team.
- Table tennis, fitness

Sep 2014 - Jul 2018

Department of Control and Applied Mathematics

Thesis: Accelerated Directional Searchs and Gradient-Free Methods with non-Euclidean prox-structure

Sep 2018 - Present

Department of Control and Applied Mathematics

**[23 June, 2019]**(in Russian):

New book out**"Lecture Notes on Stochastic Processes"**- joint work with Alexander Gasnikov, Sergey Guz, Elena Chernousova, Maksim Shirobokov, Egor Shulgin.

---------------------------------------------------------**[30 May, 2019]****New paper out: "A Stochastic Derivative Free Optimization Method with Momentum" -**joint work with Adel Bibi, Ozan Sener, El Houcine Bergou and Peter Richtárik.**Abstract:***We consider the problem of unconstrained minimization of a smooth objective function in R^d in setting where only function evaluations are possible. We propose and analyze stochastic zeroth-order method with heavy ball momentum. In particular, we propose, SMTP, a momentum version of the stochastic three-point method (STP). We show new complexity results for non-convex, convex and strongly convex functions. We test our method on a collection of learning to continuous control tasks on several MuJoCo environments with varying difficulty andcompare against STP, other state-of-the-art derivative-free optimization algorithms and against policy gradient methods. SMTP significantly outperforms STP and all other methods that we considered in our numerical experiments. Our second contribution is SMTP with importance sampling which we call SMTP_IS. We provide convergence analysis of this method for non-convex, convex and strongly convex objectives.***[27 May, 2019]****New paper out: "A Unified Theory of SGD: Variance Reduction,****Sampling, Quantization and Coordinate Descent"**- joint work with Filip Hanzely and Peter Richtárik.**Abstract:***In this paper we introduce a unified analysis of a large family of variants of proximal stochastic gradient descent (SGD) which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities. We show that our framework includes methods with and without the following tricks, and their combinations: variance reduction, importance sampling, mini-batch sampling, quantization, and coordinate sub-sampling. As a by-product, we obtain the first unified theory of SGD and randomized coordinate descent (RCD) methods, the first unified theory of variance reduced and non-variance-reduced SGD methods, and the first unified theory of quantized and non-quantized methods. A key to our approach is a parametric assumption on the iterates and stochastic gradients. In a single theorem we establish a linear convergence result under this assumption and strong-quasi convexity of the loss function. Whenever we recover an existing method as a special case, our theorem gives the best known complexity result. Our approach can be used to motivate the development of new useful methods, and offers pre-proved convergence guarantees. To illustrate the strength of our approach, we develop five new variants of SGD, and through numerical experiments demonstrate some of their properties.*

---------------------------------------------------------**[10 April, 2019]**The award includes scholarship (total amount - 350,000 Russian rubles), insternship offer at Yandex.Research, travel grant to attend one international conference and personal mentor from Yandex.

I got the Ilya Segalovich Award - Yandex scientific scholarship!

[video] [about award] [about winners]

---------------------------------------------------------**[23 March, 2019]**joint work with Darina Dvinskikh, Alexander Gasnikov, Pavel Dvurechensky and Cesar A. Uribe.

New paper out: "On Dual Approach for Distributed Stochastic Convex Optimization over Networks" -

Abstract:*We introduce dual stochastic gradient oracle methods for distributed stochastic convex optimization problems over networks. We estimate the complexity of the proposed method in terms of probability of large deviations. This analysis is based on a new technique that allows to bound the distance between the iteration sequence and the solution point. By the proper choice of batch size, we can guarantee that this distance equals (up to a constant) to the distance between the starting point and the solution.*

---------------------------------------------------------**[10 February, 2019]****New paper out:**"Stochastic Three Points Method for Unconstrained Smooth Minimization" - joint work with El Houcine Bergou and Peter Richtárik.

Abstract:*In this paper we consider the unconstrained minimization problem of a smooth function in R^n in a setting where only function evaluations are possible. We design a novel randomized derivative-free algorithm --- the stochastic three points (STP) method --- and analyze its iteration complexity. At each iteration, STP generates a random search direction according to a certain fixed probability law. Our assumptions on this law are very mild: roughly speaking, all laws which do not concentrate all measure on any halfspace passing through the origin will work. For instance, we allow for the uniform distribution on the sphere and also distributions that concentrate all measure on a positive spanning set.**Given a current iterate x, STP compares the objective function at three points: x, x+αs and x−αs, where α>0 is a stepsize parameter and s is the random search direction. The best of these three points is the next iterate. We analyze the method STP under several stepsize selection schemes (fixed, decreasing, estimated through finite differences, etc). We study non-convex, convex and strongly convex cases.**We also propose a parallel version for STP, with iteration complexity bounds which do not depend on the dimension n.*arXiv: 1901.09269

---------------------------------------------------------**[08 February, 2019]**

I am a reviewer for ICML 2019! This is my first time being a reviewer for such a conference.

---------------------------------------------------------**[26 January, 2019]**"Distributed learning with compressed gradient differences" - joint work with Konstantin Mishchenko, Martin Takáč and Peter Richtárik.

New paper out:

Abstract:*Training very large machine learning models requires a distributed computing approach, with communication of the model updates often being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of the updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which means that they necessarily suffer from several issues, such as the inability to converge to the true optimum in the batch mode, inability to work with a nonsmooth regularizer, and slow convergence rates. In this work we propose a new distributed learning method - DIANA - which resolves these issues via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are vastly superior to existing rates. Our analysis of block quantization and differences between l2 and l∞ quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to TernGrad, we establish the first convergence rate for this method.*

arXiv: 1901.09269

---------------------------------------------------------**[13 January - 24 February, 2019]**I am visiting Peter Richtárik at KAUST.

---------------------------------------------------------**[05 September, 2018]**

The paper "Stochastic spectral and conjugate descent methods" - joint work with Dmitry Kovalev, Elnur Gasanov and Peter Richtárik - accepted to The Thirty-second Annual Conference on Neural Information Processing Systems (NIPS).

Abstract:*The state-of-the-art methods for solving optimization problems in big dimensions are variants of randomized coordinate descent (RCD). In this paper we introduce a fundamentally new type of acceleration strategy for RCD based on the augmentation of the set of coordinate directions by a few spectral or conjugate directions. As we increase the number of extra directions to be sampled from, the rate of the method improves, and interpolates between the linear rate of RCD and a linear rate independent of the condition number. We develop and analyze also inexact variants of these methods where the spectral and conjugate directions are allowed to be approximate only. We motivate the above development by proving several negative results which highlight the limitations of RCD with importance sampling.*

The conference will be in Montreal, Canada during December 3-8, 2018.

---------------------------------------------------------**[01 July - 06 July, 2018]**

I am attending the 23rd International Symposium on Mathematical Programming in Bordeaux, France. My talk is scheduled on Friday, 06 July, in the section*"New methods for stochastic optimization and variational inequalities".*

The Talk:*”An Accelerated Directional Derivative Method for Smooth Stochastic Convex Optimization”*

[slides]

---------------------------------------------------------**[27 June - 28 June, 2018]**

I am in Grenoble, attending the Grenoble Optimization Days 2018. By the way I have the new most favorite photo!

Here I read the excellent book by my advisor Alexander Gasnikov.

------------------------------------------------------------

------------------------------------------------------------

**[25 June, 2018]**

Today I successfully defended my Bachelor diploma.

Thesis: "Accelerated Directional Searchs and Gradient-Free Methods with non-Euclidean prox-structure"

---------------------------------------------------------**[10 June - 15 June, 2018]**

I am in Traditional Youth School ”Control, Information and Optimization” (Voronovo, Russia) organized by Boris Polyak and Elena Gryazina.

I presented a poster*"An Accelerated Directional Derivative Method for SmoothStochastic Convex Optimization".*My work was chosen and I gave a talk there. I won third prize for this talk in competitions of best talks among participants.

[slides of the talk]

---------------------------------------------------------**[14 April, 2018]**

I am attending workshop*”Optimization at Work”*at MIPT (Moscow, Russia) with a talk*”An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization”*.

[slides] [video]

---------------------------------------------------------**[10 April, 2018]****New paper out:**"On the upper bound for the mathematical expectation of the norm of a vector uniformly distributed on the sphere and the phenomenon of concentration of uniform measure on the sphere" - joint work with Evgeniya Vorontsova and Alexander Gasnikov.

Abstract:*We considered the problem of obtaining upper bounds for the mathematical expectation of the q-norm (2⩽q⩽∞) of the vector which is uniformly distributed on the unit Euclidean sphere.*

---------------------------------------------------------**[8 April, 2018]**An Accelerated Directional Derivative Method for Smooth Stochastic Convex Optimization" - joint work with Pavel Dvurechensky and Alexander Gasnikov.

New paper out: "

Abstract:*We consider smooth stochastic convex optimization problems in the context of algorithms which are based on directional derivatives of the objective function. This context can be considered as an intermediate one between derivative-free optimization and gradient-based optimization. We assume that at any given point and for any given direction, a stochastic approximation for the directional derivative of the objective function at this point and in this direction is available with some additive noise. The noise is assumed to be of an unknown nature, but bounded in the absolute value. We underline that we consider directional derivatives in any direction, as opposed to coordinate descent methods which use only derivatives in coordinate directions. For this setting, we propose a non-accelerated and an accelerated directional derivative method and provide their complexity bounds. Despite that our algorithms do not use gradient information, our non-accelerated algorithm has a complexity bound which is, up to a factor logarithmic in problem dimension, similar to the complexity bound of gradient-based algorithms. Our accelerated algorithm has a complexity bound which coincides with the complexity bound of the accelerated gradient-based algorithm up to a factor of square root of the problem dimension, whereas for existing directional derivative methods this factor is of the order of problem dimension. We also extend these results to strongly convex problems. Finally, we consider derivative-free optimization as a particular case of directional derivative optimization with noise in the directional derivative and obtain complexity bounds for non-accelerated and accelerated derivative-free methods. Complexity bounds for these algorithms inherit the gain in the dimension dependent factors from our directional derivative methods.*

---------------------------------------------------------**[25 February, 2018]****New paper out:**"An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization" - joint work with Pavel Dvurechensky and Alexander Gasnikov.

Abstract:*We consider an unconstrained problem of minimization of a smooth convex function which is only available through noisy observations of its values, the noise consisting of two parts. Similar to stochastic optimization problems, the first part is of a stochastic nature. On the opposite, the second part is an additive noise of an unknown nature, but bounded in the absolute value. In the two-point feedback setting, i.e. when pairs of function values are available, we propose an accelerated derivative-free algorithm together with its complexity analysis. The complexity bound of our derivative-free algorithm is only by a factor of \sqrt{n} larger than the bound for accelerated gradient-based algorithms, where n is the dimension of the decision variable. We also propose a non-accelerated derivative-free algorithm with a complexity bound similar to the stochastic-gradient-based algorithm, that is, our bound does not have any dimension-dependent factor. Interestingly, if the solution of the problem is sparse, for both our algorithms, we obtain better complexity bound if the algorithm uses a 1-norm proximal setup, rather than the Euclidean proximal setup, which is a standard choice for unconstrained problems.*

---------------------------------------------------------**[10 February, 2018]****New paper out**: "Stochastic spectral and conjugate descent methods" - joint work with Dmitry Kovalev, Elnur Gasanov and Peter Richtárik.

Abstract:*The state-of-the-art methods for solving optimization problems in big dimensions are variants of randomized coordinate descent (RCD). In this paper we introduce a fundamentally new type of acceleration strategy for RCD based on the augmentation of the set of coordinate directions by a few spectral or conjugate directions. As we increase the number of extra directions to be sampled from, the rate of the method improves, and interpolates between the linear rate of RCD and a linear rate independent of the condition number. We develop and analyze also inexact variants of these methods where the spectral and conjugate directions are allowed to be approximate only. We motivate the above development by proving several negative results which highlight the limitations of RCD with importance sampling.*

---------------------------------------------------------**[9 February, 2018]**

I started my final semester as a Bachelor student today. This spring I will be a tutor for the course "Algorithms and Models of Computation".

---------------------------------------------------------**[5 February - 7 February, 2018]**

I am attending KAUST Research Workshop on Optimization and Big Data with poster*”Stochastic Spectral Descent Methods”*based on our joint work with Dmitry Kovalev, Elnur Gasanov and Peter Richtárik.

---------------------------------------------------------**[14 January - 08 February, 2018]**

I am visiting Peter Richtárik at KAUST.

---------------------------------------------------------**[25 November, 2017]**Today I got a prize for the best talk in Section of Information Transmission Problems, Data Analysis and Optimization, 60th Scientific Conference of MIPT.

The talk:*”About accelerated Directional Search with non-Euclidean prox-structure”*

[slides]

---------------------------------------------------------**[27 October, 2017]**

I am attending workshop ”Optimization at Work” at MIPT (Moscow, Russia) with a talk*”Accelerated Directional Search with non-Euclidean prox-structure”.*[slides]*---------------------------------------------------------***[30 September, 2017]****New paper out:**"Accelerated Directional Search with non-Euclidean prox-structure" - joint work with Evgeniya Vorontsova and Alexander Gasnikov.

Abstract:*In the paper we propose an accelerated directional search method with non-euclidian prox-structure. We consider convex unconstrained optimization problem in R^n. For simplicity we start from the zero point. We expect in advance that 1-norm of the solution is close enough to its 2-norm. In this case the standard accelerated Nesterov's directional search method can be improved. In the paper we show how to make Nesterov's method n-times faster (up to a log n-factor) in this case. The basic idea is to use linear coupling, proposed by Allen-Zhu & Orecchia in 2014, and to make Grad-step in 2-norm, but Mirr-step in 1-norm. We show that for constrained optimization problem this approach stable upon an obstacle.*

---------------------------------------------------------**[29 July, 2017]**I am happy to announce that I started working with Peter Richtárik as a member of his research group at MIPT.