In a new automotive application, we have used convolutional neural networks (CNNs) to map the raw pixels from a front-facing camera to the steering commands for a self-driving car. This powerful end-to-end approach means that with minimum training data from humans, the system learns to steer, with or without lane markings, on both local roads and highways. The system can also operate in areas with unclear visual guidance such as parking lots or unpaved roads.
Figure 1: NVIDIA’s self-driving car in action.
We designed the end-to-end learning system using an NVIDIA DevBox running Torch 7 for training. An NVIDIA DRIVETM PXself-driving car computer, also with Torch 7, was used to determine where to drive—while operating at 30 frames per second (FPS). The system is trained to automatically learn the internal representations of necessary processing steps, such as detecting useful road features, with only the human steering angle as the training signal. We never explicitly trained it to detect, for example, the outline of roads. In contrast to methods using explicit decomposition of the problem, such as lane marking detection, path planning, and control, our end-to-end system optimizes all processing steps simultaneously.
We believe that end-to-end learning leads to better performance and smaller systems. Better performance results because the internal components self-optimize to maximize overall system performance, instead of optimizing human-selected intermediate criteria, e. g., lane detection. Such criteria understandably are selected for ease of human interpretation which doesn’t automatically guarantee maximum system performance. Smaller networks are possible because the system learns to solve the problem with the minimal number of processing steps.
Convolutional Neural Networks to Process Visual Data
CNNs have revolutionized the computational pattern recognition process. Prior to the widespread adoption of CNNs, most pattern recognition tasks were performed using an initial stage of hand-crafted feature extraction followed by a classifier. The important breakthrough of CNNs is that features are now learned automatically from training examples. The CNN approach is especially powerful when applied to image recognition tasks because the convolution operation captures the 2D nature of images. By using the convolution kernels to scan an entire image, relatively few parameters need to be learned compared to the total number of operations.
While CNNs with learned features have been used commercially for over twenty years , their adoption has exploded in recent years because of two important developments.
First, large, labeled data sets such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) are now widely available for training and validation.
Second, CNN learning algorithms are now implemented on massively parallel graphics processing units (GPUs), tremendously accelerating learning and inference ability.
The CNNs that we describe here go beyond basic pattern recognition. We developed a system that learns the entire processing pipeline needed to steer an automobile. The groundwork for this project was actually done over 10 years ago in a Defense Advanced Research Projects Agency (DARPA) seedling project known as DARPA Autonomous Vehicle (DAVE), in which a sub-scale radio control (RC) car drove through a junk-filled alley way. DAVE was trained on hours of human driving in similar, but not identical, environments. The training data included video from two cameras and the steering commands sent by a human operator.
In many ways, DAVE was inspired by the pioneering work of Pomerleau, who in 1989 built the Autonomous Land Vehicle in a Neural Network (ALVINN)system. ALVINN is a precursor to DAVE, and it provided the initial proof of concept that an end-to-end trained neural network might one day be capable of steering a car on public roads. DAVE demonstrated the potential of end-to-end learning, and indeed was used to justify starting the DARPA Learning Applied to Ground Robots (LAGR) program, but DAVE’s performance was not sufficiently reliable to provide a full alternative to the more modular approaches to off-road driving. (DAVE’s mean distance between crashes was about 20 meters in complex environments.)
About a year ago we started a new effort to improve on the original DAVE, and create a robust system for driving on public roads. The primary motivation for this work is to avoid the need to recognize specific human-designated features, such as lane markings, guard rails, or other cars, and to avoid having to create a collection of “if, then, else” rules, based on observation of these features. We are excited to share the preliminary results of this new effort, which is aptly named: DAVE–2.
The DAVE-2 System
Figure 2: High-level view of the data collection system.
Figure 2 shows a simplified block diagram of the collection system for training data of DAVE-2. Three cameras are mounted behind the windshield of the data-acquisition car, and timestamped video from the cameras is captured simultaneously with the steering angle applied by the human driver. The steering command is obtained by tapping into the vehicle’s Controller Area Network (CAN) bus. In order to make our system independent of the car geometry, we represent the steering command as 1/r, where r is the turning radius in meters. We use 1/r instead of r to prevent a singularity when driving straight (the turning radius for driving straight is infinity). 1/r smoothly transitions through zero from left turns (negative values) to right turns (positive values).
Training data contains single images sampled from the video, paired with the corresponding steering command (1/r). Training with data from only the human driver is not sufficient; the network must also learn how to recover from any mistakes, or the car will slowly drift off the road. The training data is therefore augmented with additional images that show the car in different shifts from the center of the lane and rotations from the direction of the road.
The images for two specific off-center shifts can be obtained from the left and the right cameras. Additional shifts between the cameras and all rotations are simulated through viewpoint transformation of the image from the nearest camera. Precise viewpoint transformation requires 3D scene knowledge which we don’t have, so we approximate the transformation by assuming all points below the horizon are on flat ground, and all points above the horizon are infinitely far away. This works fine for flat terrain, but for a more complete rendering it introduces distortions for objects that stick above the ground, such as cars, poles, trees, and buildings. Fortunately these distortions don’t pose a significant problem for network training. The steering label for the transformed images is quickly adjusted to one that correctly steers the vehicle back to the desired location and orientation in two seconds.
Figure 3: Training the neural network.
Figure 3 shows a block diagram of our training system. Images are fed into a CNN that then computes a proposed steering command. The proposed command is compared to the desired command for that image, and the weights of the CNN are adjusted to bring the CNN output closer to the desired output. The weight adjustment is accomplished using back propagation as implemented in the Torch 7 machine learning package.
Once trained, the network is able to generate steering commands from the video images of a single center camera. Figure 4 shows this configuration.
Figure 4: The trained network is used to generate steering commands from a single front-facing center camera.
Training data was collected by driving on a wide variety of roads and in a diverse set of lighting and weather conditions. We gathered surface street data in central New Jersey and highway data from Illinois, Michigan, Pennsylvania, and New York. Other road types include two-lane roads (with and without lane markings), residential roads with parked cars, tunnels, and unpaved roads. Data was collected in clear, cloudy, foggy, snowy, and rainy weather, both day and night. In some instances, the sun was low in the sky, resulting in glare reflecting from the road surface and scattering from the windshield.
The data was acquired using either our drive-by-wire test vehicle, which is a 2016 Lincoln MKZ, or using a 2013 Ford Focus with cameras placed in similar positions to those in the Lincoln. Our system has no dependencies on any particular vehicle make or model. Drivers were encouraged to maintain full attentiveness, but otherwise drive as they usually do. As of March 28, 2016, about 72 hours of driving data was collected.
Figure 5: CNN architecture. The network has about 27 million connections and 250 thousand parameters.
We train the weights of our network to minimize the mean-squared error between the steering command output by the network, and either the command of the human driver or the adjusted steering command for off-center and rotated images (see “Augmentation”, later). Figure 5 shows the network architecture, which consists of 9 layers, including a normalization layer, 5 convolutional layers, and 3 fully connected layers. The input image is split into YUV planes and passed to the network.
The first layer of the network performs image normalization. The normalizer is hard-coded and is not adjusted in the learning process. Performing normalization in the network allows the normalization scheme to be altered with the network architecture, and to be accelerated via GPU processing.
The convolutional layers are designed to perform feature extraction, and are chosen empirically through a series of experiments that vary layer configurations. We then use strided convolutions in the first three convolutional layers with a 2×2 stride and a 5×5 kernel, and a non-strided convolution with a 3×3 kernel size in the final two convolutional layers.
We follow the five convolutional layers with three fully connected layers, leading to a final output control value which is the inverse-turning-radius. The fully connected layers are designed to function as a controller for steering, but we noted that by training the system end-to-end, it is not possible to make a clean break between which parts of the network function primarily as feature extractor, and which serve as controller.
The first step to training a neural network is selecting the frames to use. Our collected data is labeled with road type, weather condition, and the driver’s activity (staying in a lane, switching lanes, turning, and so forth). To train a CNN to do lane following, we simply select data where the driver is staying in a lane, and discard the rest. We then sample that video at 10 FPS because a higher sampling rate would include images that are highly similar, and thus not provide much additional useful information. To remove a bias towards driving straight the training data includes a higher proportion of frames that represent road curves.
After selecting the final set of frames, we augment the data by adding artificial shifts and rotations to teach the network how to recover from a poor position or orientation. The magnitude of these perturbations is chosen randomly from a normal distribution. The distribution has zero mean, and the standard deviation is twice the standard deviation that we measured with human drivers. Artificially augmenting the data does add undesirable artifacts as the magnitude increases (as mentioned previously).
Before road-testing a trained CNN, we first evaluate the network’s performance in simulation. Figure 6 shows a simplified block diagram of the simulation system, and Figure 7 shows a screenshot of the simulator in interactive mode.
Figure 6: Block-diagram of the drive simulator.
The simulator takes prerecorded videos from a forward-facing on-board camera connected to a human-driven data-collection vehicle, and generates images that approximate what would appear if the CNN were instead steering the vehicle. These test videos are time-synchronized with the recorded steering commands generated by the human driver.
Since human drivers don’t drive in the center of the lane all the time, we must manually calibrate the lane’s center as it is associated with each frame in the video used by the simulator. We call this position the “ground truth”.
The simulator transforms the original images to account for departures from the ground truth. Note that this transformation also includes any discrepancy between the human driven path and the ground truth. The transformation is accomplished by the same methods as described previously.
The simulator accesses the recorded test video along with the synchronized steering commands that occurred when the video was captured. The simulator sends the first frame of the chosen test video, adjusted for any departures from the ground truth, to the input of the trained CNN, which then returns a steering command for that frame. The CNN steering commands as well as the recorded human-driver commands are fed into the dynamic model  of the vehicle to update the position and orientation of the simulated vehicle.
Figure 7: Screenshot of the simulator in interactive mode. See text for explanation of the performance metrics. The green area on the left is unknown because of the viewpoint transformation. The highlighted wide rectangle below the horizon is the area which is sent to the CNN.
The simulator then modifies the next frame in the test video so that the image appears as if the vehicle were at the position that resulted by following steering commands from the CNN. This new image is then fed to the CNN and the process repeats.
The simulator records the off-center distance (distance from the car to the lane center), the yaw, and the distance traveled by the virtual car. When the off-center distance exceeds one meter, a virtual human intervention is triggered, and the virtual vehicle position and orientation is reset to match the ground truth of the corresponding frame of the original test video.
We evaluate our networks in two steps: first in simulation, and then in on-road tests.
In simulation we have the networks provide steering commands in our simulator to an ensemble of prerecorded test routes that correspond to about a total of three hours and 100 miles of driving in Monmouth County, NJ. The test data was taken in diverse lighting and weather conditions and includes highways, local roads, and residential streets.
We estimate what percentage of the time the network could drive the car (autonomy) by counting the simulated human interventions that occur when the simulated vehicle departs from the center line by more than one meter. We assume that in real life an actual intervention would require a total of six seconds: this is the time required for a human to retake control of the vehicle, re-center it, and then restart the self-steering mode. We calculate the percentage autonomy by counting the number of interventions, multiplying by 6 seconds, dividing by the elapsed time of the simulated test, and then subtracting the result from 1:
Thus, if we had 10 interventions in 600 seconds, we would have an autonomy value of
After a trained network has demonstrated good performance in the simulator, the network is loaded on the DRIVE PX in our test car and taken out for a road test. For these tests we measure performance as the fraction of time during which the car performs autonomous steering. This time excludes lane changes and turns from one road to another. For a typical drive in Monmouth County NJ from our office in Holmdel to Atlantic Highlands, we are autonomous approximately 98% of the time. We also drove 10 miles on the Garden State Parkway (a multi-lane divided highway with on and off ramps) with zero intercepts.
Here is a video of our test car driving in diverse conditions.
Visualization of Internal CNN State
Figure 8: How the CNN “sees” an unpaved road. Top: subset of the camera image sent to the CNN. Bottom left: Activation of the first layer feature maps. Bottom right: Activation of the second layer feature maps. This demonstrates that the CNN learned to detect useful road features on its own, i. e., with only the human steering angle as training signal. We never explicitly trained it to detect the outlines of roads.
Figures 8 and 9 show the activations of the first two feature map layers for two different example inputs, an unpaved road and a forest. In case of the unpaved road, the feature map activations clearly show the outline of the road while in case of the forest the feature maps contain mostly noise, i. e., the CNN finds no useful information in this image.
This demonstrates that the CNN learned to detect useful road features on its own, i. e., with only the human steering angle as training signal. We never explicitly trained it to detect the outlines of roads, for example.
Figure 9: Example image with no road. The activations of the first two feature maps appear to contain mostly noise, i. e., the CNN doesn’t recognize any useful features in this image.
We have empirically demonstrated that CNNs are able to learn the entire task of lane and road following without manual decomposition into road o
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Winter 1989.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks.
Danwei Wang and Feng Qi. Trajectory planning for a four-wheel-steering vehicle. In Proceedings of the 2001 IEEE International Conference on Robotics & Automation, May 21–26 2001. URL: http://www.ntu.edu.sg/home/edwwang/confpapers/wdwicar01.pdf.
rlane marking detection, semantic abstraction, path planning, and control. A small amount of training data from less than a hundred hours of driving was sufficient to train the car to operate in diverse conditions, on highways, local and residential roads in sunny, cloudy, and rainy conditions.
The CNN is able to learn meaningful road features from a very sparse training signal (steering alone).
The system learns for example to detect the outline of a road without the need of explicit labels during training.
More work is needed to improve the robustness of the network, to find methods to verify the robustness, and to improve visualization of the network-internal processing steps.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop- agation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Winter 1989. URL: http://yann.lecun.org/exdb/publis/pdf/lecun-89e.pdf.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. URL: http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks. pdf.
L. D. Jackel, D. Sharman, Stenard C. E., Strom B. I., , and D Zuckert. Optical character recognition for self-service banking. AT&T Technical Journal, 74(1):16–24, 1995.
Large scale visual recognition challenge (ILSVRC). URL: http://www.image-net.org/ challenges/LSVRC/.
Net-Scale Technologies, Inc. Autonomous off-road vehicle control using end-to-end learning, July 2004. Final technical report. URL: http://net-scale.com/doc/net-scale-dave-report.pdf.
Dean A. Pomerleau. ALVINN, an autonomous land vehicle in a neural network. Technical report, Carnegie Mellon University, 1989. URL: http://repository.cmu.edu/cgi/viewcontent. cgi?article=2874&context=compsci.
Danwei Wang and Feng Qi. Trajectory planning for a four-wheel-steering vehicle. In Proceedings of the 2001 IEEE International Conference on Robotics & Automation, May 21–26 2001. URL: http: //www.ntu.edu.sg/home/edwwang/confpapers/wdwicar01.pdf.
Olli hits the road in the Washington, D.C. area and later this year in Miami-Dade County and Las Vegas.
Local Motors CEO and co-founder John B. Rogers, Jr. with “Olli” & IBM, June 15, 2016.Rich Riggins/Feature Photo Service for IBM
IBM, along with the Arizona-based manufacturer Local Motors, debuted the first-ever driverless vehicle to use the Watson cognitive computing platform. Dubbed “Olli,” the electric vehicle was unveiled at Local Motors’ new facility in National Harbor, Maryland, just outside of Washington, D.C.
Olli, which can carry up to 12 passengers, taps into four Watson APIs (
Speech to Text,
Natural Language Classifier,
Entity Extraction and
Text to Speech
) to interact with its riders. It can answer questions like “Can I bring my children on board?” and respond to basic operational commands like, “Take me to the closest Mexican restaurant.” Olli can also give vehicle diagnostics, answering questions like, “Why are you stopping?“
Olli learns from data produced by more than 30 sensors embedded throughout the vehicle, which will added and adjusted to meet passenger needs and local preferences.
While Olli is the first self-driving vehicle to use IBM Watson Internet of Things (IoT), this isn’t Watson’s first foray into the automotive industry. IBM launched its IoT for Automotive unit in September of last year, and in March, IBM and Honda announced a deal for Watson technology and analytics to be used in the automaker’s Formula One (F1) cars and pits.
IBM demonstrated its commitment to IoT in March of last year, when it announced it was spending $3B over four years to establish a separate IoT business unit, whch later became the Watson IoT business unit.
IBM says that starting Thursday, Olli will be used on public roads locally in Washington, D.C. and will be used in Miami-Dade County and Las Vegas later this year. Miami-Dade County is exploring a pilot program that would deploy several autonomous vehicles to shuttle people around Miami.
NVIDIA today shifted its autonomous-driving leadership into high gear.
At a press event kicking off CES 2016, we unveiled artificial-intelligence technology that will let cars sense the world around them and pilot a safe route forward.
Dressed in his trademark black leather jacket, speaking to a crowd of some 400 automakers, media and analysts, NVIDIA CEO Jen-Hsun Huang revealed DRIVE PX 2, an automotive supercomputing platform that processes 24 trillion deep learning operations a second. That’s 10 times the performance of the first-generation DRIVE PX, now being used by more than 50 companies in the automotive world.
The new DRIVE PX 2 delivers 8 teraflops of processing power. It has the processing power of 150 MacBook Pros. And it’s the size of a lunchbox in contrast to earlier autonomous-driving technology being used today, which takes up the entire trunk of a mid-sized sedan.
“Self-driving cars will revolutionize society,” Huang said at the beginning of his talk. “And NVIDIA’s vision is to enable them.”
Volvo to Deploy DRIVE PX in Self-Driving SUVs
As part of its quest to eliminate traffic fatalities, Volvo will be the first automaker to deploy DRIVE PX 2.
Huang announced that Volvo – known worldwide for safety and reliability – will be the first automaker to deploy DRIVE PX 2.
In the world’s first public trial of autonomous driving, the Swedish automaker next year will lease 100 XC90 luxury SUVs outfitted with DRIVE PX 2 technology. The technology will help the vehicles drive autonomously around Volvo’s hometown of Gothenburg, and semi-autonomously elsewhere.
DRIVE PX 2 has the power to harness a host of sensors to get a 360 degree view of the environment around the car.
“The rear-view mirror is history,” Jen-Hsun said.
Drive Safely, by Not Driving at All
Not so long ago, pundits had questioned the safety of technology in cars. Now, with Volvo incorporating autonomous vehicles into its plan to end traffic fatalities, that script has been flipped. Autonomous cars may be vastly safer than human-piloted vehicles.
Car crashes – an estimated 93 percent of them caused by human error – kill 1.3 million drivers each year. More American teenagers die from texting while driving than any other cause, including drunk driving.
There’s also a productivity issue. Americans waste some 5.5 billion hours of time each year in traffic, costing the U.S. about $121 billion, according to an Urban Mobility Report from Texas A&M. And inefficient use of roads by cars wastes even vaster sums spent on infrastructure.
Deep Learning Hits the Road
Self-driving solutions based on computer vision can provide some answers. But tackling the infinite permutations that a driver needs to react to – stray pets, swerving cars, slashing rain, steady road construction crews – is far too complex a programming challenge.
Deep learning enabled by NVIDIA technology can address these challenges. A highly trained deep neural network – residing on supercomputers in the cloud – captures the experience of many tens of thousands of hours of road time.
Huang noted that a number of automotive companies are already using NVIDIA’s deep learning technology to power their efforts,getting speedup of 30-40X in training their networks compared with other technology.BMW, Daimler and Ford are among them, along with innovative Japanese startups like Preferred Networks and ZMP. And Audi said it was able in four hours to do training that took it two years with a competing solution.
NVIDIA DRIVE PX 2 is part of an end-to-end platform that brings deep learning to the road.
NVIDIA’s end-to-end solution for deep learning starts withNVIDIA DIGITS, a supercomputer that can be used to train digital neural networks by exposing them to data collected during that time on the road. On the other end is DRIVE PX 2, which draws on this training to make inferences to enable the car to progress safely down the road. In the middle is NVIDIA DriveWorks, a suite of software tools, libraries and modules that accelerates development and testing of autonomous vehicles.
DriveWorks enables sensor calibration, acquisition of surround data, synchronization, recording and then processing streams of sensor data through a complex pipeline of algorithms running on all of the DRIVE PX 2’s specialized and general-purpose processors.
During the event, Huang reminded the audience that machines are already beating humans at tasks once considered impossible for computers, such as image recognition. Systems trained with deep learning can now correctly classify images more than 96 percent of the time, exceeding what humans can do on similar tasks.
He used the event to show what deep learning can do for autonomous vehicles.
A series of demos drove this home, showing in three steps how DRIVE PX 2 harnesses a host of sensors – lidar, radar and cameras and ultrasonic – to understand the world around it, in real time, and plan a safe and efficient path forward.
The World’s Biggest Infotainment System
The highlight of the demos was what Huang called the world’s largest car infotainment system — an elegant block the size of a medium-sized bedroom wall mounted with a long horizontal screen and a long vertical one.
While a third larger screen showed the scene that a driver would take in, the wide demo screen showed how the car — using deep learning and sensor fusion — “viewed” the very same scene in real-time, stitched together from its array of sensors. On its right, the huge portrait-oriented screen shows a highly precise map that marked the car’s progress.
It’s a demo that will leave an impression on an audience that’s going to be hear a lot about the future of driving in the week ahead.
With the unemployment rate falling to 5.3 percent, the lowest in seven years, policy makers are heaving a sigh of relief. Indeed, with the technology boom in progress, there is a lot to be optimistic about.
Manufacturing will be returning to U.S. shores with robots doing the job of Chinese workers;
American carmakers will be mass-producing self-driving electric vehicles;
technology companies will develop medical devices that greatly improve health and longevity;
we will have unlimited clean energy and 3D print our daily needs.
The cost of all of these things will plummet and make it possible to provide for the basic needs of every human being.
I am talking about technology advances that are happening now, which will bear fruit in the 2020s.
But policy makers will have a big new problem to deal with: the disappearance of human jobs. Not only will there be fewer jobs for people doing manual work, the jobs of knowledge workers will also be replaced by computers. Almost every industry and profession will be impacted and this will create a new set of social problems — because most people can’t adapt to such dramatic change.
If we can develop the economic structures necessary to distribute the prosperity we are creating, most people will no longer have to work to sustain themselves. They will be free to pursue other creative endeavors. The problem, however, is that without jobs, they will not have the dignity, social engagement, and sense of fulfillment that comes from work. The life, liberty and pursuit of happiness that the constitution entitles us to won’t be through labor, it will have to be through other means.
It is imperative that we understand the changes that are happening and find ways to cushion the impacts.
The technology elite who are leading this revolution will reassure you that there is nothing to worry about because we will create new jobs just as we did in previous centuries when the economy transitioned from agrarian to industrial to knowledge-based. Tech mogul Marc Andreessen has called the notion of a jobless future a “Luddite fallacy,” referring to past fears that machines would take human jobs away. Those fears turned out to be unfounded because we created newer and better jobs and were much better off.
True, we are living better lives. But what is missing from these arguments is the timeframe over which the transitions occurred. The industrial revolution unfolded over centuries. Today’s technology revolutions are happening within years. We will surely create a few intellectually-challenging jobs, but we won’t be able to retrain the workers who lose today’s jobs. They will experience the same unemployment and despair that their forefathers did. It is they who we need to worry about.
The first large wave of unemployment will be caused by self-driving cars.These will provide tremendous benefit by eliminating traffic accidents and congestion, making commuting time more productive, and reducing energy usage. But they will eliminate the jobs of millions of taxi and truck drivers and delivery people. Fully-automated robotic cars are no longer in the realm of science fiction; you can see Google’s cars on the streets of Mountain View, Calif. There are also self-driving trucks on our highways and self-driving tractors on farms. Uber just hired away dozens of engineers from Carnegie Mellon University to build its own robotic cars. It will surely start replacing its human drivers as soon as its technology is ready — later in this decade. As Uber CEO Travis Kalanick reportedly said in an interview, “The reason Uber could be expensive is you’re paying for the other dude in the car. When there is no other dude in the car, the cost of taking an Uber anywhere is cheaper. Even on a road trip.”
The dude in the driver’s seat will go away.
Manufacturing will be the next industry to be transformed. Robots have, for many years, been able to perform surgery, milk cows, do military reconnaissance and combat, and assemble goods. But they weren’t dexterous enough to do the type of work that humans do in installing circuit boards. The latest generation of industrial robots by ABB of Switzerland and Rethink Robotics of Boston can do this however. ABB’s robot, Yumi, can even thread a needle. It costs only $40,000.
China, fearing the demise of its industry, is setting up fully-automated robotic factories in the hope that by becoming more price-competitive, it can continue to be the manufacturing capital of the world. But its advantage only holds up as long as the supply chains are in China and shipping raw materials and finished goods over the oceans remains cost-effective. Don’t forget that our robots are as productive as theirs are; they too don’t join labor unions (yet) and will work around the clock without complaining. Supply chains will surely shift and the trickle of returning manufacturing will become a flood.
But there will be few jobs for humans once the new, local factories are built.
With advances in artificial intelligence, any job that requires the analysis of information can be done better by computers. This includes the jobs of physicians, lawyers, accountants, and stock brokers. We will still need some humans to interact with the ones who prefer human contact, but the grunt work will disappear. The machines will need very few humans to help them.
This jobless future will surely create social problems — but it may be an opportunity for humanity to uplift itself. Why do we need to work 40, 50, or 60 hours a week, after all? Just as we were better off leaving the long and hard agrarian and factory jobs behind, we may be better off without the mindless work at the office. What if we could be working 10 or 15 hours per week from anywhere we want and have the remaining time for leisure, social work, or attainment of knowledge?
Yes, there will be a booming tourism and recreation industry and new jobs will be created in these — for some people.
There are as many things to be excited about as to fear. If we are smart enough to develop technologies that solve the problems of disease, hunger, energy, and education, we can — and surely will — develop solutions to our social problems. But we need to start by understanding where we are headed and prepare for the changes. We need to get beyond the claims of a Luddite fallacy — to a discussion about the new future.
Wadhwa is a fellow at Rock Center for Corporate Governance at Stanford
University, director of research at Center for Entrepreneurship and
Research Commercialization at Duke, and distinguished fellow at
past appointments include Harvard Law School, University of California
Berkeley, and Emory University. Follow him on Twitter @wadhwa.
Many gadgets will be able to understand images and video thanks to chips designed to run powerful artificial-intelligence algorithms.
WHY IT MATTERS
Many applications for mobile computers could be more powerful with advanced image recognition.
Many of the devices around us may soon acquire powerful new abilities to understand images and video, thanks to hardware designed for the machine-learning technique called deep learning.
Companies like Google have made breakthroughs in image and face recognition through deep learning, using giant data sets and powerful computers (see “10 Breakthrough Technologies 2013: Deep Learning”). Now two leading chip companies and the Chinese search giant Baidu say hardware is coming that will bring the technique to phones, cars, and more.
Chip manufacturers don’t typically disclose their new features in advance. But at a conference on computer vision Tuesday, Synopsys, a company that licenses software and intellectual property to the biggest names in chip making, showed off a new image-processor core tailored for deep learning. It is expected to be added to chips that power smartphones, cameras, and cars. The core would occupy about one square millimeter of space on a chip made with one of the most commonly used manufacturing technologies.
Pierre Paulin, a director of R&D at Synopsys, told MIT Technology Review that the new processor design will be made available to his company’s customers this summer. Many have expressed strong interest in getting hold of hardware to help deploy deep learning, he said.
Synopsys showed a demo in which the new design recognized speed-limit signs in footage from a car. Paulin also presented results from using the chip to run a deep-learning network trained to recognize faces. It didn’t hit the accuracy levels of the best research results, which have been achieved on powerful computers, but it came pretty close, he said. “For applications like video surveillance it performs very well,” he said. The specialized core uses significantly less power than a conventional chip would need to do the same task.
The new core could add a degree of visual intelligence to many kinds of devices, from phones to cheap security cameras. It wouldn’t allow devices to recognize tens of thousands of objects on their own, but Paulin said they might be able to recognize dozens.
That might lead to novel kinds of camera or photo apps. Paulin said the technology could also enhance car, traffic, and surveillance cameras. For example, a home security camera could start sending data over the Internet only when a human entered the frame. “You can do fancier things like detecting if someone has fallen on the subway,” he said.
Jeff Gehlhaar, vice president of technology at Qualcomm Research, spoke at the event about his company’s work on getting deep learning running on apps for existing phone hardware. He declined to discuss whether the company is planning to build support for deep learning into its chips. But speaking about the industry in general, he said that such chips are surely coming. Being able to use deep learning on mobile chips will be vital to helping robots navigate and interact with the world, he said, and to efforts to develop autonomous cars.
“I think you will see custom hardware emerge to solve these problems,” he said. “Our traditional approaches to silicon are going to run out of gas, and we’ll have to roll up our sleeves and do things differently.” Gehlhaar didn’t indicate how soon that might be. Qualcomm has said that its coming generation of mobile chips will include software designed to bring deep learning to camera and other apps (see “Smartphones Will Soon Learn to Recognize Faces and More”).
Ren Wu, a researcher at Chinese search company Baidu, also said chips that support deep learning are needed for powerful research computers in daily use. “You need to deploy that intelligence everywhere, at any place or any time,” he said.
Being able to do things like analyze images on a device without connecting to the Internet can make apps faster and more energy-efficient because it isn’t necessary to send data to and fro, said Wu. He and Qualcomm’s Gehlhaar both said that making mobile devices more intelligent could temper the privacy implications of some apps by reducing the volume of personal data such as photos transmitted off a device.
“You want the intelligence to filter out the raw data and only send the important information, the metadata, to the cloud,” said Wu.
If venture capital and research funding are any indication, artificial intelligence will play a leading role in shaping our future. And few tech innovators in the private or public sector have been as prominent in defining that role as Andrew Ng, chief scientist at China’s search giant Baidu. Ng has taught AI at Stanford, led the Google Brain project, founded online education pioneer Coursera, and just last year took his post at “China’s Google” in hopes of figuring out how to teach computers to see and hear, and to do that for the world’s most populous country.
Small wonder why China represents such a huge opportunity for machine intelligence applications.
Baidu is the world’s fifth most trafficked website.
Shopping site Taobao,
messaging app QQ,
media company Sina, and
microblogging platform Weibo,
all Chinese properties, hold spots within the top 15. When Baidu designs an application, according to Ng, mobile comes first; cell phones are the primary channel of access for Chinese consumers.
Ng is soft-spoken with an undercurrent of passion when discussing his research. Today he manages a growing team at Baidu’s U.S. campus in Sunnyvale, Calif. He does not believe all hype about the robot revolution, but says he does believe researchers are only scratching the surface of a machine’s potential. Killer robots are not his concern; he prefers to fret about a microprocessor’s run time or pushing voice recognition to a place where humans actually trust it. To him, there’s a lot of work to do. But Ng believes that there are enough good ideas and smart companies that someday soon we’ll be able to speak, rather than tap, when we want something on our smartphones.
In a recent chat on Skype (edited for brevity and clarity), Ng outlined what he thinks is within reach—and what isn’t—for machine intelligence.
What excites you most about the potential for AI and deep learning?
A number of organizations, us and others, have just amazing computer vision technology, doing things that seemed impossible even a year ago. I think the struggle is figuring out the most compelling products. I don’t know that any of us have found the killer app yet.
In Silicon Valley there are a lot of startups, using computer vision for agriculture or shopping—there are a lot for clothes shopping. At Baidu, for example, if you find a picture of a movie star, we actually use facial recognition to identify that movie star and then tell you things like their age and hobbies. If they are wearing clothing that we recognize, we can find related clothing you can buy, and we show that. That’s been pretty popular.
Could advertisers eventually bid on the placement in relation to that image?
We’re not doing that right now; we’re just finding related clothing. But there are a number of verticals like that—recognizing interesting people, recognizing a holiday destination and then showing other pictures of that same destination. There’s probably a potential for computer vision to do even bigger things, but I don’t think we’ve figured out what that is.
What’s the most valid reason that we should be worried about destructive artificial intelligence?
I think that hundreds of years from now if people invent a technology that we haven’t heard of yet, maybe a computer could turn evil. But the future is so uncertain. I don’t know what’s going to happen five years from now. The reason I say that I don’t worry about AI turning evil is the same reason I don’t worry about overpopulation on Mars. Hundreds of years from now I hope we’ve colonized Mars. But we’ve never set foot on the planet so how can we productively worry about this problem now?
What’s it like working on AI every day?
I think AI is akin to building a rocket ship. You need a huge engine and a lot of fuel. If you have a large engine and a tiny amount of fuel, you won’t make it to orbit. If you have a tiny engine and a ton of fuel, you can’t even lift off. To build a rocket you need a huge engine and a lot of fuel.
The analogy to deep learning [one of the key processes in creating artificial intelligence] is that the rocket engine is the deep learning models and the fuel is the huge amounts of data we can feed to these algorithms.
You spent time at Google—what’s your view on self-driving cars?
I sat close to that team and I’m friends with a lot of them, so I have a sense of what they’re doing. But I was not contributing directly to them.
I think self-driving cars are a little further out than most people think. There’s a debate about which one of two universes we’re in.
In the first universe it’s an incremental path to self-driving cars, meaning you have cruise control, adaptive cruise control, then self-driving cars only on the highways, and you keep adding stuff until 20 years from now you have a self-driving car.
In universe two you have one organization, maybe Carnegie Mellon or Google, that invents a self-driving car and bam! You have self-driving cars. It wasn’t available Tuesday but it’s on sale on Wednesday.
I’m in universe one. I think there’s a lot of confusion about how easy it is to do self-driving cars. There’s a big difference between being able to drive a thousand miles, versus being able to drive anywhere. And it turns out that machine-learning technology is good at pushing performance from 90 to 99 percent accuracy. But it’s challenging to get to four nines (99.99 percent). I’ll give you this: we’re firmly on our way to being safer than a drunk driver.
You founded Coursera and championed the value of online education programs. How do you think about the future of education?
Our education system has succeeded so far in teaching generations to do different routine tasks. So when tractors displaced farming labor we taught the next generation to work in factories. But what we’ve never really been good at is teaching a huge number of people to do non-routine creative work.
Do you buy the argument that the future of labor is less in peril because automation will lower the cost of goods so you will only need to work 10-20 hours a week?
I would have said zero hours. I see a minimum living wage as a long-term solution, but I’m not sure that’s my favorite. I think society benefits if all the human race is empowered and aspiring to do great things. Giving people the skill sets to do great things will take work.
In fact, he’s irritated by the discussion about scientists somehow building an apocalyptic super-intelligence. “I think it’s a distraction from the conversation about…serious issues,” Ng said at an AI conference in San Francisco last week.
Ng isn’t alone in thinking this way. A select group of AI luminaries met recently at a closed door retreat in Puerto Rico to discuss ethics and AI. WIRED interviewed some of them, and the consensus was that there are short-term and long-term AI issues to worry about. But it’s the long-term questions getting all the press. Artificial intelligence is likely to start having an important effect on society over the next five to 10 years, according to Murray Shanahan, a professor of cognitive robotics with Imperial College, Professor of Cognitive Robotics. “It’s hard to predict exactly what’s going on,” he told WIRED a few weeks ago, “but we can be pretty sure that these technologies are going to impact and society quite a bit. ”
The way Ng sees it, it took the US about 200 years to switch from an agricultural economy where 90 percent of the country worked on farms, to our current economy, where the number is closer to 2 percent. The AI switchover promises to come must faster, and that could make it a bigger problem.
That’s an idea echoed in two MIT academics, Erik Brynjolfsson and Andrew McAfee, who argue that we’re entering a “second machine age,” where the accelerating rate of change brought on by digital technologies could leave millions of medium-and-low skilled workers behind.
Some AI technologies, such as the self-driving car, could be extremely disruptive, but over a much shorter period of time than the industrial revolution. There are three million truck drivers in the US, according to the American Trucking Association. What happens if self-driving vehicles put them all out of a job in a matter of years?
With recent advances in perception, the range of things that machines can do is getting a boost. Computers are better at understanding what we say and analyzing data in a way that used to be the exclusive domain of humans.
Last month, Audi’s self-driving car took WIRED’s Alex Davies for a 500 mile ride. In Cupertino, California’s Aloft Hotel a robot butler can deliver you a toothbrush. Paralegals are now finding their work performed by data-sifting computers. And just last year, Google told us about a group of workers who were doing mundane image recognition work for the search giant—jobs like figuring out the difference between telephone numbers and street addresses on building walls. Google figured out how to do this by machine, and so they’ve now moved onto other things.
Ng, who also co-founded the online learning company Coursera, says that if AI really starts taking jobs, retraining all of those workers could present a major challenge. When it comes to retraining workers, he said, “our education system has historically found it very difficult.”
The development of artificial intelligence – thrown into spotlight this week after Google spent hundreds of millions on new technology – could mean computers take over human jobs at a faster rate than new roles can be created, experts have warned
DeepMind was founded two years ago by 37-year-old neuroscientist and former teenage chess prodigy Demis Hassabis, along with Shane Legg and Mustafa Suleyman Photo: AP
Artificial intelligence could lead to mass unemployment if computers develop the capacity to take over human work, experts warned days after it emerged that Google had beat competitors to buy a firm specialising in this kind of technology.
Dr Stuart Armstrong, from the Future of Humanity Institute at the University of Oxford, gave the stark warning after it emerged that Google had paid £400m for the British artificial intelligence firm DeepMind.
He welcomed the web giant’s decision to set up an ethics board to safely develop and use artificial intelligence claiming the advances in technology carried a number of risks.
Mr Armstrong said computers had the potential to take over people’s jobs at a faster rate than new roles could be created.
He cited logistics, administration and insurance underwriting as professions that were particularly vulnerable to the development of artificial intelligence.
He also warned about the implications for uncontrolled mass surveillance if computers were taught to recognise human faces.
Speaking on Radio 4’s Today programme, he said: “There’s a variety of short term risks for artificial intelligence, everyone knows about the autonomous drones.
“But there’s also the potential for mass surveillance, you don’t just have to recognise cat images, you could also recognise human faces and also mass unemployment in a variety of professions.”
He added: “We have some studies looking into which jobs are the most vulnerable and there’s quite a lot of them in logistics, administration, insurance underwriting but ultimately a huge swathe of jobs are potentially vulnerable to improved artificial intelligence.”
His concerns were backed up by Murray Shanahan, professor of cognitive robotics at Imperial College London, who said: “I think it is a very good thing that Google has set up this ethics board and I think there certainly are some short term issues that we all need to be talking about.
“It’s very difficult to predict and that is of course a concern but in the past when we’ve developed new kinds of technologies then often they have created jobs at the same time as taking them over but it certainly is something we ought to be discussing.”
DeepMind was founded two years ago by 37-year-old neuroscientist and former teenage chess prodigy Demis Hassabis, along with Shane Legg and Mustafa Suleyman.
The company specialises in algorithms and machine learning for simulation, e-commerce and games.
It is also working in an area called Deep Learning in which machines are taught to see patterns from large quantities of data so computers could start to recognise objects from daily life such as cars or food products and even human faces.
It is believed Google will use DeepMind’s expertise to improve the functions of its current products such as the Google Glass and extend its current artificial intelligence work such as the development of self-driving cars.
Mr Shanahan said: “We all know that Google have got an interest in wearable computing with their Google glass and you can imagine them and other companies using this technology to build some kind of assistant that for example could help you to make a lasagne in your kitchen and to tell you what ingredients you needed and where to find them.
“Not necessarily a robot assistant but something wearable such as your Google glass or some other maker might make a similar thing so you can carry it around with you.”