90年代初，我和Vapnik一起在贝尔实验室共事，在此期间相继提出了一些后来有影响力的算法：卷积神经网络，支持向量机，切线距离等。1995年，AT&T从朗讯科技公司（LUCENT）独立出来，我则出任了AT&T实验室图像处理研究组的负责人，组内机器学习相关的研究员包括：Yoshua Bengio, Leon Bottou, and Patrick Haffner, and Vladimir Vapnik，访问学者和实习生主要包括：Bernhard SchÃ¶lkopf, Jason Weston, Olivier Chapelle。
Yann LeCun: Vapnik and I were in nearby office at Bell Labs in the early 1990s, in Larry Jackel’s Adaptive Systems Research Department. Convolutional nets, Support Vector Machines, Tangent Distance, and several other influential methods were invented within a few meters of each other, and within a few years of each other. When AT&T spun off Lucent In 1995, I became the head of that department which became the Image Processing Research Department at AT&T Labs – Research. Machine Learning members included Yoshua Bengio, Leon Bottou, and Patrick Haffner, and Vladimir Vapnik. Visitors and interns included Bernhard SchÃ¶lkopf, Jason Weston, Olivier Chapelle, and others.
Vapnik and I often had lively discussions about the relative merits of (deep) neural nets and kernel machines. Basically, I have always been interested in solving the problem of learning features or learning representations. I had only a moderate interest in kernel methods because they did nothing to address this problem. Naturally, SVMs are wonderful as a generic classification method with beautiful math behind them. But in the end, they are nothing more than simple two-layer systems. The first layer can be seen as a set of units (one per support vector) that measure a kind of similarity between the input vector and each support vector using the kernel function. The second layer linearly combines these similarities.
It’s a two-layer system in which the first layer is trained with the simplest of all unsupervised learning method: simply store the training samples as prototypes in the units. Basically, varying the smoothness of the kernel function allows us to interpolate between two simple methods: linear classification, and template matching. I got in trouble about 10 years ago by saying that kernel methods were a form of glorified template matching. Vapnik, on the other hand, argued that SVMs had a very clear way of doing capacity control. An SVM with a “narrow” kernel function can always learn the training set perfectly, but its generalization error is controlled by the width of the kernel and the sparsity of the dual coefficients. Vapnik really believes in his bounds. He worried that neural nets didn’t have similarly good ways to do capacity control (although neural nets do have generalization bounds, since they have finite VC dimension).
My counter argument was that the ability to do capacity control was somewhat secondary to the ability to compute highly complex function with a limited amount of computation. Performing image recognition with invariance to shifts, scale, rotation, lighting conditions, and background clutter was impossible (or extremely inefficient) for a kernel machine operating at the pixel level. But it was quite easy for deep architectures such as convolutional nets.