#### BSc at Moscow Institute of Physics and Technology

Department of Control and Applied Mathematics

Thesis: Accelerated Directional Searchs and Gradient-Free Methods with non-Euclidean prox-structure

I am a second year student of MS program at Moscow Institute of Physics and Technology, Phystech School of Applied Mathematics and Informatics. I work under supervision of professor Alexander Gasnikov. Also, from August 2017 to October 2019 I was a member of the research group at MIPT led by Peter Richtárik. I still actively work with Peter and informally Peter is also my scientific advisor. My research interests include Stochastic Optimization and its applications to Machine Learning. I am also a co-organizer of the research seminar on Optimization at MIPT.**Selected awards: **

-** **The Ilya Segalovich Award 2019 (**highly selective**; 350 000 russian rubles, internship offer at Yandex.Research, travel grant to attend one international conference)

- Huawei scholarship for bachelor and master students at MIPT (125 000 russian rubles)

ResearchGate

arXiv

Google scholar

Download my CV

My main e-mail: eduard.gorbunov at phystech dot edu

My second e-mail: ed-gorbunov at yandex dot ru

- Operating systems: Microsoft Windows, Linux, Mac OSX
- Programming languages: Python, LaTeX, C, C++

- Russian (native)
- English (advanced)

- Football: 9 years in football school in Rybinsk, Russia. Now I am playing for an amateur team and a student team.
- Table tennis, fitness

Sep 2014 - Jul 2018

Department of Control and Applied Mathematics

Thesis: Accelerated Directional Searchs and Gradient-Free Methods with non-Euclidean prox-structure

Sep 2018 - Present

Phystech School of Applied Mathematics and Informatics

Aug 2017 - Oct 2019

Researcher at Peter Richtárik's group

Feb 2018 - Jun 2019

Teaching assistant for the following courses: Algorithms and Models of Computation, Probability Theory

May 2019 - Aug 2019

Intern in Media Laboratory (research, C++ programming)

Aug 2019 - Now

Researcher

Now 2019 - Now

Junior researcher at Department of Mathematical Methods of Predictive Modelling

Feb 2020 - Now

Junior researcher at Laboratory of Numerical Methods of Applied Structural Optimization

**[22 May, 2020]****New paper out: "**Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping" - joint work with Marina Danilova and Alexander Gasnikov.**Abstract:***In this paper, we propose a new accelerated stochastic first-order method called clipped-SSTM for smooth convex stochastic optimization with heavy-tailed distributed noise in stochastic gradients and derive the first high-probability complexity bounds for this method closing the gap in the theory of stochastic optimization with heavy-tailed noise. Our method is based on a special variant of accelerated Stochastic Gradient Descent (SGD) and clipping of stochastic gradients. We extend our method to the strongly convex case and prove new complexity bounds that outperform state-of-the-art results in this case. Finally, we extend our proof technique and derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.*

---------------------------------------------------------**[29 April, 2020]**

I got accepted to The Machine Learning Summer School this year which will be fully virtual. There were more than 1300 applications and only 180 available slots, so, I am incredibly pleased to be one of those 180 students and looking forward to taking part in this school.

---------------------------------------------------------**[27-30 April, 2020]**

I took part in the first fully virtual conference - ICLR 2020!

I have presented our paper "A Stochastic Derivative Free Optimization Method with Momentum" - joint work with Adel Bibi, Ozan Sener, El Houcine Bergou and Peter Richtárik.

I enjoyed the online format and met a lot of amazing people there. Great thanks to the organizers!

[video]

---------------------------------------------------------**[2 February, 2020]**I have arrived at KAUST. I will be here till the end of March to work with Peter Richtárik as a visiting student.due to the situation with COVID-19 I have decided to come back to Moscow earlier (no, there is no confirmed cases at KAUST). I have arrived in Moscow on 14 March.

Update:

---------------------------------------------------------**[15 January, 2020]****I got Huawei scholarship for MIPT bachelor and master students for academic and reseacrh achievements!**

The prize is 125 000 Russian rubles.

---------------------------------------------------------**[07 January, 2020]****Paper accepted to AISTATS 2020!**

Our paper "A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent" - joint work with Filip Hanzely and Peter Richtárik - got accepted to AISTATS 2020. The conference will take place in Palermo, Sicily, Italy during June 3-5.

---------------------------------------------------------**[20 December, 2019]****Paper accepted to ICLR 2020!**

Our paper "A Stochastic Derivative Free Optimization Method with Momentum" - joint work with Adel Bibi, Ozan Sener, El Houcine Bergou and Peter Richtárik - got accepted to ICLR 2020. The conference will take place in Addis Ababa, Ethiopia during April 26-30.

The paper was presented last week at the NeurIPS 2019 Optimization Foundations of Reinforcement Learning Workshop, here is our poster.

---------------------------------------------------------**[26 November, 2019]**-

New paper out: "Derivative-Free Method For Decentralized Distributed Non-Smooth Optimization"**Abstract:***In this paper, we propose new derivative-free method which is based on the Sliding Algorithm from Lan (2016, 2019) for the convex composite optimization problem that includes two terms: smooth one and non-smooth one. We prove the convergence rate for the new method that matches the corresponding rate for the first-order method up to a factor proportional to the dimension of the space. We apply this method for the decentralized distributed optimization and prove the bounds for the number of communication rounds for this method that matches the lower bounds. We prove the bound for the number of zeroth-order oracle calls per node that matches the similar state-of-the-art bound for the first-order decentralized distributed optimization up to to the factor proportional to the dimension of the space.*

---------------------------------------------------------**[20 November, 2019]**"Optimal Decentralized Distributed Algorithms for Stochastic Convex Optimization" - joint work with Darina Dvinskikh and Alexander Gasnikov.

New paper out:**Abstract:***We consider stochastic convex optimization problems with affine constraints and develop several methods using either primal or dual approach to solve it. In the primal case, we use special penalization technique to make the initial problem more convenient for using optimization methods. We propose algorithms to solve it based on Similar Triangles Method with Inexact Proximal Step for the convex smooth and strongly convex smooth objective functions and methods based on Gradient Sliding algorithm to solve the same problems in the non-smooth case. We prove the convergence guarantees in the smooth convex case with deterministic first-order oracle.*

We propose and analyze three novel methods to handle stochastic convex optimization problems with affine constraints: SPDSTM, R-RRMA-AC-SA2 and SSTM_sc. All methods use stochastic dual oracle. SPDSTM is the stochastic primal-dual modification of STM and it is applied for the dual problem when the primal functional is strongly convex and Lipschitz continuous on some ball. We extend the result from Dvinskikh & Gasnikov (2019) for this method to the case when only biased stochastic oracle is available. R-RRMA-AC-SA2 is an accelerated stochastic method based on the restarts of RRMA-AC-SA2 from Foster et al. (2019) and SSTM_sc is just stochastic STM for strongly convex problems. Both methods are applied to the dual problem when the primal functional is strongly convex, smooth and Lipschitz continuous on some ball and use stochastic dual first-order oracle. We develop convergence analysis for these methods for the unbiased and biased oracles respectively.

Finally, we apply all aforementioned results and approaches to solve the decentralized distributed optimization problem and discuss the optimality of the obtained results in terms of communication rounds and the number of oracle calls per node.

---------------------------------------------------------**[18 October, 2019]**Today I gave a talk based on our recent paper with Filip Hanzely and Peter Richtárik at SIERRA, INRIA.

Peter, thanks for your slides that you prepared for ICCOPT 2019 Summer School.

[slides]

---------------------------------------------------------**[06 October - 26 October, 2019]**

I am visiting Adrien Taylor and Francis Bach at SIERRA, INRIA.

---------------------------------------------------------**[02 October, 2019]****2 papers got accepted to NeurIPS workshops:**

1)*"An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization"*- joint work with Pavel Dvurechensky and Alexander Gasnikov - got accepted to the workshop "Beyond First Order Methods in ML"

2) "A Stochastic Derivative Free Optimization Method with Momentum" - joint work with Adel Bibi, Ozan Sener, El Houcine Bergou and Peter Richtárik - got accepted to the workshop "Optimization Foundations for Reinforcement Learning"

---------------------------------------------------------**[23 June, 2019]**(in Russian):

New book out**"Lecture Notes on Stochastic Processes"**- joint work with Alexander Gasnikov, Sergey Guz, Elena Chernousova, Maksim Shirobokov, Egor Shulgin.

---------------------------------------------------------**[30 May, 2019]****New paper out: "A Stochastic Derivative Free Optimization Method with Momentum" -**joint work with Adel Bibi, Ozan Sener, El Houcine Bergou and Peter Richtárik.**Abstract:***We consider the problem of unconstrained minimization of a smooth objective function in R^d in setting where only function evaluations are possible. We propose and analyze stochastic zeroth-order method with heavy ball momentum. In particular, we propose, SMTP, a momentum version of the stochastic three-point method (STP). We show new complexity results for non-convex, convex and strongly convex functions. We test our method on a collection of learning to continuous control tasks on several MuJoCo environments with varying difficulty andcompare against STP, other state-of-the-art derivative-free optimization algorithms and against policy gradient methods. SMTP significantly outperforms STP and all other methods that we considered in our numerical experiments. Our second contribution is SMTP with importance sampling which we call SMTP_IS. We provide convergence analysis of this method for non-convex, convex and strongly convex objectives.***[27 May, 2019]****New paper out: "A Unified Theory of SGD: Variance Reduction,****Sampling, Quantization and Coordinate Descent"**- joint work with Filip Hanzely and Peter Richtárik.**Abstract:***In this paper we introduce a unified analysis of a large family of variants of proximal stochastic gradient descent (SGD) which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities. We show that our framework includes methods with and without the following tricks, and their combinations: variance reduction, importance sampling, mini-batch sampling, quantization, and coordinate sub-sampling. As a by-product, we obtain the first unified theory of SGD and randomized coordinate descent (RCD) methods, the first unified theory of variance reduced and non-variance-reduced SGD methods, and the first unified theory of quantized and non-quantized methods. A key to our approach is a parametric assumption on the iterates and stochastic gradients. In a single theorem we establish a linear convergence result under this assumption and strong-quasi convexity of the loss function. Whenever we recover an existing method as a special case, our theorem gives the best known complexity result. Our approach can be used to motivate the development of new useful methods, and offers pre-proved convergence guarantees. To illustrate the strength of our approach, we develop five new variants of SGD, and through numerical experiments demonstrate some of their properties.*

---------------------------------------------------------**[10 April, 2019]**The award includes scholarship (total amount - 350,000 Russian rubles), insternship offer at Yandex.Research, travel grant to attend one international conference and personal mentor from Yandex.

I got the Ilya Segalovich Award - Yandex scientific scholarship!

[video] [about award] [about winners]

---------------------------------------------------------**[23 March, 2019]**joint work with Darina Dvinskikh, Alexander Gasnikov, Pavel Dvurechensky and Cesar A. Uribe.

New paper out: "On Dual Approach for Distributed Stochastic Convex Optimization over Networks" -

Abstract:*We introduce dual stochastic gradient oracle methods for distributed stochastic convex optimization problems over networks. We estimate the complexity of the proposed method in terms of probability of large deviations. This analysis is based on a new technique that allows to bound the distance between the iteration sequence and the solution point. By the proper choice of batch size, we can guarantee that this distance equals (up to a constant) to the distance between the starting point and the solution.*

---------------------------------------------------------**[10 February, 2019]****New paper out:**"Stochastic Three Points Method for Unconstrained Smooth Minimization" - joint work with El Houcine Bergou and Peter Richtárik.

Abstract:*In this paper we consider the unconstrained minimization problem of a smooth function in R^n in a setting where only function evaluations are possible. We design a novel randomized derivative-free algorithm --- the stochastic three points (STP) method --- and analyze its iteration complexity. At each iteration, STP generates a random search direction according to a certain fixed probability law. Our assumptions on this law are very mild: roughly speaking, all laws which do not concentrate all measure on any halfspace passing through the origin will work. For instance, we allow for the uniform distribution on the sphere and also distributions that concentrate all measure on a positive spanning set.**Given a current iterate x, STP compares the objective function at three points: x, x+αs and x−αs, where α>0 is a stepsize parameter and s is the random search direction. The best of these three points is the next iterate. We analyze the method STP under several stepsize selection schemes (fixed, decreasing, estimated through finite differences, etc). We study non-convex, convex and strongly convex cases.**We also propose a parallel version for STP, with iteration complexity bounds which do not depend on the dimension n.*arXiv: 1901.09269

---------------------------------------------------------**[08 February, 2019]**

I am a reviewer for ICML 2019! This is my first time being a reviewer for such a conference.

---------------------------------------------------------**[26 January, 2019]**"Distributed learning with compressed gradient differences" - joint work with Konstantin Mishchenko, Martin Takáč and Peter Richtárik.

New paper out:

Abstract:*Training very large machine learning models requires a distributed computing approach, with communication of the model updates often being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of the updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which means that they necessarily suffer from several issues, such as the inability to converge to the true optimum in the batch mode, inability to work with a nonsmooth regularizer, and slow convergence rates. In this work we propose a new distributed learning method - DIANA - which resolves these issues via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are vastly superior to existing rates. Our analysis of block quantization and differences between l2 and l∞ quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to TernGrad, we establish the first convergence rate for this method.*

arXiv: 1901.09269

---------------------------------------------------------**[13 January - 24 February, 2019]**I am visiting Peter Richtárik at KAUST.

---------------------------------------------------------**[05 September, 2018]**

The paper "Stochastic spectral and conjugate descent methods" - joint work with Dmitry Kovalev, Elnur Gasanov and Peter Richtárik - accepted to The Thirty-second Annual Conference on Neural Information Processing Systems (NIPS).

Abstract:*The state-of-the-art methods for solving optimization problems in big dimensions are variants of randomized coordinate descent (RCD). In this paper we introduce a fundamentally new type of acceleration strategy for RCD based on the augmentation of the set of coordinate directions by a few spectral or conjugate directions. As we increase the number of extra directions to be sampled from, the rate of the method improves, and interpolates between the linear rate of RCD and a linear rate independent of the condition number. We develop and analyze also inexact variants of these methods where the spectral and conjugate directions are allowed to be approximate only. We motivate the above development by proving several negative results which highlight the limitations of RCD with importance sampling.*

The conference will be in Montreal, Canada during December 3-8, 2018.

---------------------------------------------------------**[01 July - 06 July, 2018]**

I am attending the 23rd International Symposium on Mathematical Programming in Bordeaux, France. My talk is scheduled on Friday, 06 July, in the section*"New methods for stochastic optimization and variational inequalities".*

The Talk:*”An Accelerated Directional Derivative Method for Smooth Stochastic Convex Optimization”*

[slides]

---------------------------------------------------------**[27 June - 28 June, 2018]**

I am in Grenoble, attending the Grenoble Optimization Days 2018. By the way I have the new most favorite photo!

Here I read the excellent book by my advisor Alexander Gasnikov.

------------------------------------------------------------

------------------------------------------------------------

**[25 June, 2018]**

Today I successfully defended my Bachelor diploma.

Thesis: "Accelerated Directional Searchs and Gradient-Free Methods with non-Euclidean prox-structure"

---------------------------------------------------------**[10 June - 15 June, 2018]**

I am in Traditional Youth School ”Control, Information and Optimization” (Voronovo, Russia) organized by Boris Polyak and Elena Gryazina.

I presented a poster*"An Accelerated Directional Derivative Method for SmoothStochastic Convex Optimization".*My work was chosen and I gave a talk there. I won third prize for this talk in competitions of best talks among participants.

[slides of the talk]

---------------------------------------------------------**[14 April, 2018]**

I am attending workshop*”Optimization at Work”*at MIPT (Moscow, Russia) with a talk*”An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization”*.

[slides] [video]

---------------------------------------------------------**[10 April, 2018]****New paper out:**"On the upper bound for the mathematical expectation of the norm of a vector uniformly distributed on the sphere and the phenomenon of concentration of uniform measure on the sphere" - joint work with Evgeniya Vorontsova and Alexander Gasnikov.

Abstract:*We considered the problem of obtaining upper bounds for the mathematical expectation of the q-norm (2⩽q⩽∞) of the vector which is uniformly distributed on the unit Euclidean sphere.*

---------------------------------------------------------**[8 April, 2018]**An Accelerated Directional Derivative Method for Smooth Stochastic Convex Optimization" - joint work with Pavel Dvurechensky and Alexander Gasnikov.

New paper out: "

Abstract:*We consider smooth stochastic convex optimization problems in the context of algorithms which are based on directional derivatives of the objective function. This context can be considered as an intermediate one between derivative-free optimization and gradient-based optimization. We assume that at any given point and for any given direction, a stochastic approximation for the directional derivative of the objective function at this point and in this direction is available with some additive noise. The noise is assumed to be of an unknown nature, but bounded in the absolute value. We underline that we consider directional derivatives in any direction, as opposed to coordinate descent methods which use only derivatives in coordinate directions. For this setting, we propose a non-accelerated and an accelerated directional derivative method and provide their complexity bounds. Despite that our algorithms do not use gradient information, our non-accelerated algorithm has a complexity bound which is, up to a factor logarithmic in problem dimension, similar to the complexity bound of gradient-based algorithms. Our accelerated algorithm has a complexity bound which coincides with the complexity bound of the accelerated gradient-based algorithm up to a factor of square root of the problem dimension, whereas for existing directional derivative methods this factor is of the order of problem dimension. We also extend these results to strongly convex problems. Finally, we consider derivative-free optimization as a particular case of directional derivative optimization with noise in the directional derivative and obtain complexity bounds for non-accelerated and accelerated derivative-free methods. Complexity bounds for these algorithms inherit the gain in the dimension dependent factors from our directional derivative methods.*

---------------------------------------------------------**[25 February, 2018]****New paper out:**"An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization" - joint work with Pavel Dvurechensky and Alexander Gasnikov.

Abstract:*We consider an unconstrained problem of minimization of a smooth convex function which is only available through noisy observations of its values, the noise consisting of two parts. Similar to stochastic optimization problems, the first part is of a stochastic nature. On the opposite, the second part is an additive noise of an unknown nature, but bounded in the absolute value. In the two-point feedback setting, i.e. when pairs of function values are available, we propose an accelerated derivative-free algorithm together with its complexity analysis. The complexity bound of our derivative-free algorithm is only by a factor of \sqrt{n} larger than the bound for accelerated gradient-based algorithms, where n is the dimension of the decision variable. We also propose a non-accelerated derivative-free algorithm with a complexity bound similar to the stochastic-gradient-based algorithm, that is, our bound does not have any dimension-dependent factor. Interestingly, if the solution of the problem is sparse, for both our algorithms, we obtain better complexity bound if the algorithm uses a 1-norm proximal setup, rather than the Euclidean proximal setup, which is a standard choice for unconstrained problems.*

---------------------------------------------------------**[10 February, 2018]****New paper out**: "Stochastic spectral and conjugate descent methods" - joint work with Dmitry Kovalev, Elnur Gasanov and Peter Richtárik.

Abstract:*The state-of-the-art methods for solving optimization problems in big dimensions are variants of randomized coordinate descent (RCD). In this paper we introduce a fundamentally new type of acceleration strategy for RCD based on the augmentation of the set of coordinate directions by a few spectral or conjugate directions. As we increase the number of extra directions to be sampled from, the rate of the method improves, and interpolates between the linear rate of RCD and a linear rate independent of the condition number. We develop and analyze also inexact variants of these methods where the spectral and conjugate directions are allowed to be approximate only. We motivate the above development by proving several negative results which highlight the limitations of RCD with importance sampling.*

---------------------------------------------------------**[9 February, 2018]**

I started my final semester as a Bachelor student today. This spring I will be a tutor for the course "Algorithms and Models of Computation".

---------------------------------------------------------**[5 February - 7 February, 2018]**

I am attending KAUST Research Workshop on Optimization and Big Data with poster*”Stochastic Spectral Descent Methods”*based on our joint work with Dmitry Kovalev, Elnur Gasanov and Peter Richtárik.

---------------------------------------------------------**[14 January - 08 February, 2018]**

I am visiting Peter Richtárik at KAUST.

---------------------------------------------------------**[25 November, 2017]**Today I got a prize for the best talk in Section of Information Transmission Problems, Data Analysis and Optimization, 60th Scientific Conference of MIPT.

The talk:*”About accelerated Directional Search with non-Euclidean prox-structure”*

[slides]

---------------------------------------------------------**[27 October, 2017]**

I am attending workshop ”Optimization at Work” at MIPT (Moscow, Russia) with a talk*”Accelerated Directional Search with non-Euclidean prox-structure”.*[slides]*---------------------------------------------------------***[30 September, 2017]****New paper out:**"Accelerated Directional Search with non-Euclidean prox-structure" - joint work with Evgeniya Vorontsova and Alexander Gasnikov.

Abstract:*In the paper we propose an accelerated directional search method with non-euclidian prox-structure. We consider convex unconstrained optimization problem in R^n. For simplicity we start from the zero point. We expect in advance that 1-norm of the solution is close enough to its 2-norm. In this case the standard accelerated Nesterov's directional search method can be improved. In the paper we show how to make Nesterov's method n-times faster (up to a log n-factor) in this case. The basic idea is to use linear coupling, proposed by Allen-Zhu & Orecchia in 2014, and to make Grad-step in 2-norm, but Mirr-step in 1-norm. We show that for constrained optimization problem this approach stable upon an obstacle.*

---------------------------------------------------------**[29 July, 2017]**I am happy to announce that I started working with Peter Richtárik as a member of his research group at MIPT.