Deep-learning machines already have superhuman skills when it comes to tasks such as
video-game playing, and
even the ancient Chinese game of Go.
So it’s easy to think that humans are already outgunned.
But not so fast. Intelligent machines still lag behind humans in one crucial area of performance: the speed at which they learn. When it comes to mastering classic video games, for example, the best deep-learning machines take some 200 hours of play to reach the same skill levels that humans achieve in just two hours.
So computer scientists would dearly love to have some way to speed up the rate at which machines learn.
Today, Alexander Pritzel and pals at Google’s DeepMind subsidiary in London claim to have done just that. These guys have built a deep-learning machine that is capable of rapidly assimilating new experiences and then acting on them. The result is a machine that learns significantly faster than others and has the potential to match humans in the not too distant future.
First, some background.
Deep learning uses layers of neural networks to look for patterns in data. When a single layer spots a pattern it recognizes, it sends this information to the next layer, which looks for patterns in this signal, and so on.
So in face recognition,
one layer might look for edges in an image,
the next layer for circular patterns of edges (the kind that eyes and mouths make), and
the next for triangular patterns such as those made by two eyes and a mouth.
When all this happens, the final output is an indication that a face has been spotted.
Of course, the devil is in the details. There are various systems of feedback to allow the system to learn by adjusting various internal parameters such as the strength of connections between layers. These parameters must change slowly, since a big change in one layer can catastrophically affect learning in the subsequent layers. That’s why deep neural networks need so much training and why it takes so long.
Pritzel and co have tackled this problem with a technique they call Neural Episodic Control. “Neural episodic control demonstrates dramatic improvements on the speed of learning for a wide range of environments,” they say. “Critically, our agent is able to rapidly latch onto highly successful strategies as soon as they are experienced, instead of waiting for many steps of optimisation.”
The basic idea behind DeepMind’s approach is to copy the way humans and animals learn quickly. The general consensus is that humans can tackle situations in two different ways.
If the situation is familiar, our brains have already formed a model of it, which they use to work out how best to behave. This uses a part of the brain called the prefrontal cortex.
But when the situation is not familiar, our brains have to fall back on another strategy. This is thought to involve a much simpler test-and-remember approach involving the hippocampus. So we try something and remember the outcome of this episode. If it is successful, we try it again, and so on. But if it is not a successful episode, we try to avoid it in future.
This episodic approach suffices in the short term while our prefrontal brain learns. But it is soon outperformed by the prefrontal cortex and its model-based approach.
Pritzel and co have used this approach as their inspiration. Their new system has two approaches.
The first is a conventional deep-learning system that mimics the behaviur of the prefrontal cortex.
The second is more like the hippocampus. When the system tries something new, it remembers the outcome.
But crucially, it doesn’t try to learn what to remember. Instead, it remembers everything. “Our architecture does not try to learn when to write to memory, as this can be slow to learn and take a significant amount of time,” say Pritzel and co. “Instead, we elect to write all experiences to the memory, and allow it to grow very large compared to existing memory architectures.”
They then use a set of strategies to read from this large memory quickly. The result is that the system can latch onto successful strategies much more quickly than conventional deep-learning systems.
They go on to demonstrate how well all this works by training their machine to play classic Atari video games, such as Breakout, Pong, and Space Invaders. (This is a playground that DeepMind has used to train many deep-learning machines.)
The team, which includes DeepMind cofounder Demis Hassibis, shows that neural episodic control vastly outperforms other deep-learning approaches in the speed at which it learns. “Our experiments show that neural episodic control requires an order of magnitude fewer interactions with the environment,” they say.
That’s impressive work with significant potential. The researchers say that an obvious extension of this work is to test their new approach on more complex 3-D environments.
It’ll be interesting to see what environments the team chooses and the impact this will have on the real world. We’ll look forward to seeing how that works out.
Guessing the location of a randomly chosen Street View image is hard, even for well-traveled humans. But Google’s latest artificial-intelligence machine manages it with relative ease. Here’s a tricky task. Pick a photograph from the Web at random. Now try to work out where it was taken using only the image itself. If the image shows a famous building or landmark, such as the Eiffel Tower or Niagara Falls, the task is straightforward. But the job becomes significantly harder when the image lacks specific location cues or is taken indoors or shows a pet or food or some other detail.Nevertheless, humans are surprisingly good at this task. To help, they bring to bear all kinds of knowledge about the world such as the type and language of signs on display, the types of vegetation, architectural styles, the direction of traffic, and so on. Humans spend a lifetime picking up these kinds of geolocation cues.So it’s easy to think that machines would struggle with this task. And indeed, they have.
Today, that changes thanks to the work of Tobias Weyand, a computer vision specialist at Google, and a couple of pals. These guys have trained a deep-learning machine to work out the location of almost any photo using only the pixels it contains.
Their new machine significantly outperforms humans and can even use a clever trick to determine the location of indoor images and pictures of specific things such as pets, food, and so on that have no location cues.
Their approach is straightforward, at least in the world of machine learning.
Weyand and co begin by dividing the world into a grid consisting of over 26,000 squares of varying size that depend on the number of images taken in that location.
So big cities, which are the subjects of many images, have a more fine-grained grid structure than more remote regions where photographs are less common. Indeed, the Google team ignored areas like oceans and the polar regions, where few photographs have been taken.
Next, the team created a database of geolocated images from the Web and used the location data to determine the grid square in which each image was taken. This data set is huge, consisting of 126 million images along with their accompanying Exif location data.
Weyand and co used 91 million of these images to teach a powerful neural network to work out the grid location using only the image itself. Their idea is to input an image into this neural net and get as the output a particular grid location or a set of likely candidates.
They then validated the neural network using the remaining 34 million images in the data set.
Finally they tested the network—which they call PlaNet—in a number of different ways to see how well it works.
The results make for interesting reading. To measure the accuracy of their machine, they fed it 2.3 million geotagged images from Flickr to see whether it could correctly determine their location. “PlaNet is able to localize 3.6 percent of the images at street-level accuracy and 10.1 percent at city-level accuracy,” say Weyand and co. What’s more, the machine determines the country of origin in a further 28.4 percent of the photos and the continent in 48.0 percent of them.
That’s pretty good. But to show just how good, Weyand and co put PlaNet through its paces in a test against 10 well-traveled humans. For the test, they used an online game that presents a player with a random view taken from Google Street View and asks him or her to pinpoint its location on a map of the world.
Anyone can play at www.geoguessr.com. Give it a try—it’s a lot of fun and more tricky than it sounds.
GeoGuesser Screen Capture Example
Needless to say, PlaNet trounced the humans. “In total, PlaNet won 28 of the 50 rounds with a median localization error of 1131.7 km, while the median human localization error was 2320.75 km,” say Weyand and co. “[This] small-scale experiment shows that PlaNet reaches superhuman performance at the task of geolocating Street View scenes.”
An interesting question is how PlaNet performs so well without being able to use the cues that humans rely on, such as vegetation, architectural style, and so on. But Weyand and co say they know why: “We think PlaNet has an advantage over humans because it has seen many more places than any human can ever visit and has learned subtle cues of different scenes that are even hard for a well-traveled human to distinguish.”
They go further and use the machine to locate images that do not have location cues, such as those taken indoors or of specific items. This is possible when images are part of albums that have all been taken at the same place. The machine simply looks through other images in the album to work out where they were taken and assumes the more specific image was taken in the same place.
That’s impressive work that shows deep neural nets flexing their muscles once again. Perhaps more impressive still is that the model uses a relatively small amount of memory unlike other approaches that use gigabytes of the stuff. “Our model uses only 377 MB, which even fits into the memory of a smartphone,” say Weyand and co.
That’s a tantalizing idea—the power of a superhuman neural network on a smartphone. It surely won’t be long now!
New software does in seconds what took staff 360,000 hours Bank seeking to streamline systems, avoid redundancies
At JPMorgan Chase & Co., a learning machine is parsing financial deals that once kept legal teams busy for thousands of hours.
The program, called COIN, for Contract Intelligence, does the mind-numbing job of interpreting commercial-loan agreements that, until the project went online in June, consumed 360,000 hours of work each year by lawyers and loan officers. The software reviews documents in seconds, is less error-prone and never asks for vacation.
Attendees discuss software on Feb. 27, the eve of JPMorgan’s Investor Day.
Photographer: Kholood Eid/Bloomberg
While the financial industry has long touted its technological innovations, a new era of automation is now in overdrive as cheap computing power converges with fears of losing customers to startups. Made possible by investments in machine learning and a new private cloud network, COIN is just the start for the biggest U.S. bank. The firm recently set up technology hubs for teams specializing in big data, robotics and cloud infrastructure to find new sources of revenue, while reducing expenses and risks.
The push to automate mundane tasks and create new tools for bankers and clients — a growing part of the firm’s $9.6 billion technology budget — is a core theme as the company hosts its annual investor day on Tuesday.
Behind the strategy, overseen by Chief Operating Operating Officer Matt Zames and Chief Information Officer Dana Deasy, is an undercurrent of anxiety: Though JPMorgan emerged from the financial crisis as one of few big winners, its dominance is at risk unless it aggressively pursues new technologies, according to interviews with a half-dozen bank executives.
That was the message Zames had for Deasy when he joined the firm from BP Plc in late 2013. The New York-based bank’s internal systems, an amalgam from decades of mergers, had too many redundant software programs that didn’t work together seamlessly.“Matt said, ‘Remember one thing above all else: We absolutely need to be the leaders in technology across financial services,’” Deasy said last week in an interview. “Everything we’ve done from that day forward stems from that meeting.”
After visiting companies including Apple Inc. and Facebook Inc. three years ago to understand how their developers worked, the bank set out to create its own computing cloud called Gaia that went online last year. Machine learning and big-data efforts now reside on the private platform, which effectively has limitless capacity to support their thirst for processing power. The system already is helping the bank automate some coding activities and making its 20,000 developers more productive, saving money, Zames said. When needed, the firm can also tap into outside cloud services from Amazon.com Inc., Microsoft Corp. and International Business Machines Corp.
Tech SpendingJPMorgan will make some of its cloud-backed technology available to institutional clients later this year, allowing firms like BlackRock Inc. to access balances, research and trading tools. The move, which lets clients bypass salespeople and support staff for routine information, is similar to one Goldman Sachs Group Inc. announced in 2015.JPMorgan’s total technology budget for this year amounts to 9 percent of its projected revenue — double the industry average, according to Morgan Stanley analyst Betsy Graseck. The dollar figure has inched higher as JPMorgan bolsters cyber defenses after a 2014 data breach, which exposed the information of 83 million customers.
“We have invested heavily in technology and marketing — and we are seeing strong returns,” JPMorgan said in a presentation Tuesday ahead of its investor day, noting that technology spending in its consumer bank totaled about $1 billion over the past two years.
One-third of the company’s budget is for new initiatives, a figure Zames wants to take to 40 percent in a few years. He expects savings from automation and retiring old technology will let him plow even more money into new innovations.
Not all of those bets, which include several projects based on a distributed ledger, like blockchain, will pay off, which JPMorgan says is OK. One example executives are fond of mentioning: The firm built an electronic platform to help trade credit-default swaps that sits unused.
‘Can’t Wait’“We’re willing to invest to stay ahead of the curve, even if in the final analysis some of that money will go to product or a service that wasn’t needed,” Marianne Lake, the lender’s finance chief, told a conference audience in June. That’s “because we can’t wait to know what the outcome, the endgame, really looks like, because the environment is moving so fast.”As for COIN, the program has helped JPMorgan cut down on loan-servicing mistakes, most of which stemmed from human error in interpreting 12,000 new wholesale contracts per year, according to its designers.
JPMorgan is scouring for more ways to deploy the technology, which learns by ingesting data to identify patterns and relationships. The bank plans to use it for other types of complex legal filings like credit-default swaps and custody agreements. Someday, the firm may use it to help interpret regulations and analyze corporate communications.
Another program called X-Connect, which went into use in January, examines e-mails to help employees find colleagues who have the closest relationships with potential prospects and can arrange introductions.
Creating Bots For simpler tasks, the bank has created bots to perform functions like granting access to software systems and responding to IT requests, such as resetting an employee’s password, Zames said. Bots are expected to handle 1.7 million access requests this year, doing the work of 140 people.
Photographer: Kholood Eid/Bloomberg
While growing numbers of people in the industry worry such advancements might someday take their jobs, many Wall Street personnel are more focused on benefits. A survey of more than 3,200 financial professionals by recruiting firm Options Group last year found a majority expect new technology will improve their careers, for example by improving workplace performance.
“Anything where you have back-office operations and humans kind of moving information from point A to point B that’s not automated is ripe for that,” Deasy said. “People always talk about this stuff as displacement. I talk about it as freeing people to work on higher-value things, which is why it’s such a terrific opportunity for the firm.”
To help spur internal disruption, the company keeps tabs on 2,000 technology ventures, using about 100 in pilot programs that will eventually join the firm’s growing ecosystem of partners. For instance, the bank’s machine-learning software was built with Cloudera Inc., a software firm that JPMorgan first encountered in 2009.
“We’re starting to see the real fruits of our labor,” Zames said. “This is not pie-in-the-sky stuff.”
In a new automotive application, we have used convolutional neural networks (CNNs) to map the raw pixels from a front-facing camera to the steering commands for a self-driving car. This powerful end-to-end approach means that with minimum training data from humans, the system learns to steer, with or without lane markings, on both local roads and highways. The system can also operate in areas with unclear visual guidance such as parking lots or unpaved roads.
Figure 1: NVIDIA’s self-driving car in action.
We designed the end-to-end learning system using an NVIDIA DevBox running Torch 7 for training. An NVIDIA DRIVETM PXself-driving car computer, also with Torch 7, was used to determine where to drive—while operating at 30 frames per second (FPS). The system is trained to automatically learn the internal representations of necessary processing steps, such as detecting useful road features, with only the human steering angle as the training signal. We never explicitly trained it to detect, for example, the outline of roads. In contrast to methods using explicit decomposition of the problem, such as lane marking detection, path planning, and control, our end-to-end system optimizes all processing steps simultaneously.
We believe that end-to-end learning leads to better performance and smaller systems. Better performance results because the internal components self-optimize to maximize overall system performance, instead of optimizing human-selected intermediate criteria, e. g., lane detection. Such criteria understandably are selected for ease of human interpretation which doesn’t automatically guarantee maximum system performance. Smaller networks are possible because the system learns to solve the problem with the minimal number of processing steps.
Convolutional Neural Networks to Process Visual Data
CNNs have revolutionized the computational pattern recognition process. Prior to the widespread adoption of CNNs, most pattern recognition tasks were performed using an initial stage of hand-crafted feature extraction followed by a classifier. The important breakthrough of CNNs is that features are now learned automatically from training examples. The CNN approach is especially powerful when applied to image recognition tasks because the convolution operation captures the 2D nature of images. By using the convolution kernels to scan an entire image, relatively few parameters need to be learned compared to the total number of operations.
While CNNs with learned features have been used commercially for over twenty years , their adoption has exploded in recent years because of two important developments.
First, large, labeled data sets such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) are now widely available for training and validation.
Second, CNN learning algorithms are now implemented on massively parallel graphics processing units (GPUs), tremendously accelerating learning and inference ability.
The CNNs that we describe here go beyond basic pattern recognition. We developed a system that learns the entire processing pipeline needed to steer an automobile. The groundwork for this project was actually done over 10 years ago in a Defense Advanced Research Projects Agency (DARPA) seedling project known as DARPA Autonomous Vehicle (DAVE), in which a sub-scale radio control (RC) car drove through a junk-filled alley way. DAVE was trained on hours of human driving in similar, but not identical, environments. The training data included video from two cameras and the steering commands sent by a human operator.
In many ways, DAVE was inspired by the pioneering work of Pomerleau, who in 1989 built the Autonomous Land Vehicle in a Neural Network (ALVINN)system. ALVINN is a precursor to DAVE, and it provided the initial proof of concept that an end-to-end trained neural network might one day be capable of steering a car on public roads. DAVE demonstrated the potential of end-to-end learning, and indeed was used to justify starting the DARPA Learning Applied to Ground Robots (LAGR) program, but DAVE’s performance was not sufficiently reliable to provide a full alternative to the more modular approaches to off-road driving. (DAVE’s mean distance between crashes was about 20 meters in complex environments.)
About a year ago we started a new effort to improve on the original DAVE, and create a robust system for driving on public roads. The primary motivation for this work is to avoid the need to recognize specific human-designated features, such as lane markings, guard rails, or other cars, and to avoid having to create a collection of “if, then, else” rules, based on observation of these features. We are excited to share the preliminary results of this new effort, which is aptly named: DAVE–2.
The DAVE-2 System
Figure 2: High-level view of the data collection system.
Figure 2 shows a simplified block diagram of the collection system for training data of DAVE-2. Three cameras are mounted behind the windshield of the data-acquisition car, and timestamped video from the cameras is captured simultaneously with the steering angle applied by the human driver. The steering command is obtained by tapping into the vehicle’s Controller Area Network (CAN) bus. In order to make our system independent of the car geometry, we represent the steering command as 1/r, where r is the turning radius in meters. We use 1/r instead of r to prevent a singularity when driving straight (the turning radius for driving straight is infinity). 1/r smoothly transitions through zero from left turns (negative values) to right turns (positive values).
Training data contains single images sampled from the video, paired with the corresponding steering command (1/r). Training with data from only the human driver is not sufficient; the network must also learn how to recover from any mistakes, or the car will slowly drift off the road. The training data is therefore augmented with additional images that show the car in different shifts from the center of the lane and rotations from the direction of the road.
The images for two specific off-center shifts can be obtained from the left and the right cameras. Additional shifts between the cameras and all rotations are simulated through viewpoint transformation of the image from the nearest camera. Precise viewpoint transformation requires 3D scene knowledge which we don’t have, so we approximate the transformation by assuming all points below the horizon are on flat ground, and all points above the horizon are infinitely far away. This works fine for flat terrain, but for a more complete rendering it introduces distortions for objects that stick above the ground, such as cars, poles, trees, and buildings. Fortunately these distortions don’t pose a significant problem for network training. The steering label for the transformed images is quickly adjusted to one that correctly steers the vehicle back to the desired location and orientation in two seconds.
Figure 3: Training the neural network.
Figure 3 shows a block diagram of our training system. Images are fed into a CNN that then computes a proposed steering command. The proposed command is compared to the desired command for that image, and the weights of the CNN are adjusted to bring the CNN output closer to the desired output. The weight adjustment is accomplished using back propagation as implemented in the Torch 7 machine learning package.
Once trained, the network is able to generate steering commands from the video images of a single center camera. Figure 4 shows this configuration.
Figure 4: The trained network is used to generate steering commands from a single front-facing center camera.
Training data was collected by driving on a wide variety of roads and in a diverse set of lighting and weather conditions. We gathered surface street data in central New Jersey and highway data from Illinois, Michigan, Pennsylvania, and New York. Other road types include two-lane roads (with and without lane markings), residential roads with parked cars, tunnels, and unpaved roads. Data was collected in clear, cloudy, foggy, snowy, and rainy weather, both day and night. In some instances, the sun was low in the sky, resulting in glare reflecting from the road surface and scattering from the windshield.
The data was acquired using either our drive-by-wire test vehicle, which is a 2016 Lincoln MKZ, or using a 2013 Ford Focus with cameras placed in similar positions to those in the Lincoln. Our system has no dependencies on any particular vehicle make or model. Drivers were encouraged to maintain full attentiveness, but otherwise drive as they usually do. As of March 28, 2016, about 72 hours of driving data was collected.
Figure 5: CNN architecture. The network has about 27 million connections and 250 thousand parameters.
We train the weights of our network to minimize the mean-squared error between the steering command output by the network, and either the command of the human driver or the adjusted steering command for off-center and rotated images (see “Augmentation”, later). Figure 5 shows the network architecture, which consists of 9 layers, including a normalization layer, 5 convolutional layers, and 3 fully connected layers. The input image is split into YUV planes and passed to the network.
The first layer of the network performs image normalization. The normalizer is hard-coded and is not adjusted in the learning process. Performing normalization in the network allows the normalization scheme to be altered with the network architecture, and to be accelerated via GPU processing.
The convolutional layers are designed to perform feature extraction, and are chosen empirically through a series of experiments that vary layer configurations. We then use strided convolutions in the first three convolutional layers with a 2×2 stride and a 5×5 kernel, and a non-strided convolution with a 3×3 kernel size in the final two convolutional layers.
We follow the five convolutional layers with three fully connected layers, leading to a final output control value which is the inverse-turning-radius. The fully connected layers are designed to function as a controller for steering, but we noted that by training the system end-to-end, it is not possible to make a clean break between which parts of the network function primarily as feature extractor, and which serve as controller.
The first step to training a neural network is selecting the frames to use. Our collected data is labeled with road type, weather condition, and the driver’s activity (staying in a lane, switching lanes, turning, and so forth). To train a CNN to do lane following, we simply select data where the driver is staying in a lane, and discard the rest. We then sample that video at 10 FPS because a higher sampling rate would include images that are highly similar, and thus not provide much additional useful information. To remove a bias towards driving straight the training data includes a higher proportion of frames that represent road curves.
After selecting the final set of frames, we augment the data by adding artificial shifts and rotations to teach the network how to recover from a poor position or orientation. The magnitude of these perturbations is chosen randomly from a normal distribution. The distribution has zero mean, and the standard deviation is twice the standard deviation that we measured with human drivers. Artificially augmenting the data does add undesirable artifacts as the magnitude increases (as mentioned previously).
Before road-testing a trained CNN, we first evaluate the network’s performance in simulation. Figure 6 shows a simplified block diagram of the simulation system, and Figure 7 shows a screenshot of the simulator in interactive mode.
Figure 6: Block-diagram of the drive simulator.
The simulator takes prerecorded videos from a forward-facing on-board camera connected to a human-driven data-collection vehicle, and generates images that approximate what would appear if the CNN were instead steering the vehicle. These test videos are time-synchronized with the recorded steering commands generated by the human driver.
Since human drivers don’t drive in the center of the lane all the time, we must manually calibrate the lane’s center as it is associated with each frame in the video used by the simulator. We call this position the “ground truth”.
The simulator transforms the original images to account for departures from the ground truth. Note that this transformation also includes any discrepancy between the human driven path and the ground truth. The transformation is accomplished by the same methods as described previously.
The simulator accesses the recorded test video along with the synchronized steering commands that occurred when the video was captured. The simulator sends the first frame of the chosen test video, adjusted for any departures from the ground truth, to the input of the trained CNN, which then returns a steering command for that frame. The CNN steering commands as well as the recorded human-driver commands are fed into the dynamic model  of the vehicle to update the position and orientation of the simulated vehicle.
Figure 7: Screenshot of the simulator in interactive mode. See text for explanation of the performance metrics. The green area on the left is unknown because of the viewpoint transformation. The highlighted wide rectangle below the horizon is the area which is sent to the CNN.
The simulator then modifies the next frame in the test video so that the image appears as if the vehicle were at the position that resulted by following steering commands from the CNN. This new image is then fed to the CNN and the process repeats.
The simulator records the off-center distance (distance from the car to the lane center), the yaw, and the distance traveled by the virtual car. When the off-center distance exceeds one meter, a virtual human intervention is triggered, and the virtual vehicle position and orientation is reset to match the ground truth of the corresponding frame of the original test video.
We evaluate our networks in two steps: first in simulation, and then in on-road tests.
In simulation we have the networks provide steering commands in our simulator to an ensemble of prerecorded test routes that correspond to about a total of three hours and 100 miles of driving in Monmouth County, NJ. The test data was taken in diverse lighting and weather conditions and includes highways, local roads, and residential streets.
We estimate what percentage of the time the network could drive the car (autonomy) by counting the simulated human interventions that occur when the simulated vehicle departs from the center line by more than one meter. We assume that in real life an actual intervention would require a total of six seconds: this is the time required for a human to retake control of the vehicle, re-center it, and then restart the self-steering mode. We calculate the percentage autonomy by counting the number of interventions, multiplying by 6 seconds, dividing by the elapsed time of the simulated test, and then subtracting the result from 1:
Thus, if we had 10 interventions in 600 seconds, we would have an autonomy value of
After a trained network has demonstrated good performance in the simulator, the network is loaded on the DRIVE PX in our test car and taken out for a road test. For these tests we measure performance as the fraction of time during which the car performs autonomous steering. This time excludes lane changes and turns from one road to another. For a typical drive in Monmouth County NJ from our office in Holmdel to Atlantic Highlands, we are autonomous approximately 98% of the time. We also drove 10 miles on the Garden State Parkway (a multi-lane divided highway with on and off ramps) with zero intercepts.
Here is a video of our test car driving in diverse conditions.
Visualization of Internal CNN State
Figure 8: How the CNN “sees” an unpaved road. Top: subset of the camera image sent to the CNN. Bottom left: Activation of the first layer feature maps. Bottom right: Activation of the second layer feature maps. This demonstrates that the CNN learned to detect useful road features on its own, i. e., with only the human steering angle as training signal. We never explicitly trained it to detect the outlines of roads.
Figures 8 and 9 show the activations of the first two feature map layers for two different example inputs, an unpaved road and a forest. In case of the unpaved road, the feature map activations clearly show the outline of the road while in case of the forest the feature maps contain mostly noise, i. e., the CNN finds no useful information in this image.
This demonstrates that the CNN learned to detect useful road features on its own, i. e., with only the human steering angle as training signal. We never explicitly trained it to detect the outlines of roads, for example.
Figure 9: Example image with no road. The activations of the first two feature maps appear to contain mostly noise, i. e., the CNN doesn’t recognize any useful features in this image.
We have empirically demonstrated that CNNs are able to learn the entire task of lane and road following without manual decomposition into road o
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Winter 1989.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks.
Danwei Wang and Feng Qi. Trajectory planning for a four-wheel-steering vehicle. In Proceedings of the 2001 IEEE International Conference on Robotics & Automation, May 21–26 2001. URL: http://www.ntu.edu.sg/home/edwwang/confpapers/wdwicar01.pdf.
rlane marking detection, semantic abstraction, path planning, and control. A small amount of training data from less than a hundred hours of driving was sufficient to train the car to operate in diverse conditions, on highways, local and residential roads in sunny, cloudy, and rainy conditions.
The CNN is able to learn meaningful road features from a very sparse training signal (steering alone).
The system learns for example to detect the outline of a road without the need of explicit labels during training.
More work is needed to improve the robustness of the network, to find methods to verify the robustness, and to improve visualization of the network-internal processing steps.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop- agation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Winter 1989. URL: http://yann.lecun.org/exdb/publis/pdf/lecun-89e.pdf.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. URL: http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks. pdf.
L. D. Jackel, D. Sharman, Stenard C. E., Strom B. I., , and D Zuckert. Optical character recognition for self-service banking. AT&T Technical Journal, 74(1):16–24, 1995.
Large scale visual recognition challenge (ILSVRC). URL: http://www.image-net.org/ challenges/LSVRC/.
Net-Scale Technologies, Inc. Autonomous off-road vehicle control using end-to-end learning, July 2004. Final technical report. URL: http://net-scale.com/doc/net-scale-dave-report.pdf.
Dean A. Pomerleau. ALVINN, an autonomous land vehicle in a neural network. Technical report, Carnegie Mellon University, 1989. URL: http://repository.cmu.edu/cgi/viewcontent. cgi?article=2874&context=compsci.
Danwei Wang and Feng Qi. Trajectory planning for a four-wheel-steering vehicle. In Proceedings of the 2001 IEEE International Conference on Robotics & Automation, May 21–26 2001. URL: http: //www.ntu.edu.sg/home/edwwang/confpapers/wdwicar01.pdf.
Today we announced our funding of Xnor.ai. We are excited to be working with Ali Farhadi, Mohammad Rastegari and their team on this new company. We are also looking forward to working with Paul Allen’s team at the Allen Institute for AI and in particular our good friend and CEO of AI2, Dr. Oren Etzioni who is joining the board of Xnor.ai. Machine Learning and AI have been a key investment theme for us for the past several years and bringing deep learning capabilities such as image and speech recognition to small devices is a huge challenge.
Mohammad and Ali and their team have developed a platform that enables low resource devices to perform tasks that usually require large farms of GPUs in cloud environments. This, we believe, has the opportunity to change how we think about certain types of deep learning use cases as they get extended from the core to the edge. Image and voice recognition are great examples. These are broad areas of use cases out in the world – usually with a mobile device, but right now they require the device to be connected to the internet so those large farms of GPUs can process all the information your device is capturing/sending and having the core transmit back the answer. If you could do that on your phone (while preserving battery life) it opens up a new world of options.
It is just these kinds of inventions that put the greater Seattle area at the center of the revolution in machine learning and AI that is upon us. Xnor.ai came out of the outstanding work the team was doing at the Allen Institute for Artificial Intelligence (AI2.) and Ali is a professor at the University of Washington. Between Microsoft, Amazon, the University of Washington and research institutes such as AI2, our region is leading the way as new types of intelligent applications takes shape. Madrona is energized to play our role as company builder and support for these amazing inventors and founders.
AI acceleration startup Xnor.ai collects $2.6M in funding
I was excited by the promise of Xnor.ai and its technique that drastically reduces the computing power necessary to perform complex operations like computer vision. Seems I wasn’t the only one: the company, just officially spun off from the Allen Institute for AI (AI2), has attracted $2.6 million in seed funding from its parent company and Madrona Venture Group.
The specifics of the product and process you can learn about in detail in my previous post, but the gist is this: machine learning models for things like object and speech recognition are notoriously computation-heavy, making them difficult to implement on smaller, less powerful devices. Xnor.ai’s researchers use a bit of mathematical trickery to reduce that computing load by an order of magnitude or two — something it’s easy to see the benefit of.
“Imagine what is possible if that style of computing could be done on the device in your hand, on your wrist, or in your car,” said Madrona’s managing director, Matt McIlwain, in a press release. I’m sure they’re all imagining very hard right now. “Machine Learning and AI have been a key investment theme for us for the past several years and bringing deep learning capabilities such as image and speech recognition to small devices is a huge challenge,” he added in a company blog post.
McIlwain will join AI2 CEO Oren Etzioni on the board of Xnor.ai; Ali Farhadi, who led the original project, will be the company’s CEO, and Mohammad Rastegari is CTO.
The new company aims to facilitate commercial applications of its technology (it isn’t quite plug and play yet), but the research that led up to it is, like other AI2 work, open source.
The market for artificial intelligence (AI) technologies is flourishing. Beyond the hype and the heightened media attention, the numerous startups and the internet giants racing to acquire them, there is a significant increase in investment and adoption by enterprises. A Narrative Science survey found last year that 38% of enterprises are already using AI, growing to 62% by 2018. Forrester Research predicted a greater than 300% increase in investment in artificial intelligence in 2017 compared with 2016. IDC estimated that the AI market will grow from $8 billion in 2016 to more than $47 billion in 2020.
Based on Forrester’s analysis, here’s my list of the 10 hottest AI technologies:
Natural Language Generation: Producing text from computer data. Currently used in customer service, report generation, and summarizing business intelligence insights. Sample vendors:
Speech Recognition: Transcribe and transform human speech into format useful for computer applications. Currently used in interactive voice response systems and mobile applications. Sample vendors:
Virtual Agents: “The current darling of the media,” says Forrester (I believe they refer to my evolving relationships with Alexa), from simple chatbots to advanced systems that can network with humans. Currently used in customer service and support and as a smart home manager. Sample vendors:
Machine Learning Platforms: Providing algorithms, APIs, development and training toolkits, data, as well as computing power to design, train, and deploy models into applications, processes, and other machines. Currently used in a wide range of enterprise applications, mostly `involving prediction or classification. Sample vendors:
AI-optimized Hardware: Graphics processing units (GPU) and appliances specifically designed and architected to efficiently run AI-oriented computational jobs. Currently primarily making a difference in deep learning applications. Sample vendors:
Decision Management: Engines that insert rules and logic into AI systems and used for initial setup/training and ongoing maintenance and tuning. A mature technology, it is used in a wide variety of enterprise applications, assisting in or performing automated decision-making. Sample vendors:
Advanced Systems Concepts,
Deep Learning Platforms: A special type of machine learning consisting of artificial neural networks with multiple abstraction layers. Currently primarily used in pattern recognition and classification applications supported by very large data sets. Sample vendors:
Biometrics: Enable more natural interactions between humans and machines, including but not limited to image and touch recognition, speech, and body language. Currently used primarily in market research. Sample vendors:
Robotic Process Automation: Using scripts and other methods to automate human action to support efficient business processes. Currently used where it’s too expensive or inefficient for humans to execute a task or a process. Sample vendors:
Advanced Systems Concepts,
Text Analytics and NLP: Natural language processing (NLP) uses and supports text analytics by facilitating the understanding of sentence structure and meaning, sentiment, and intent through statistical and machine learning methods. Currently used in fraud detection and security, a wide range of automated assistants, and applications for mining unstructured data. Sample vendors:
There are certainly many business benefits gained from AI technologies today, but according to a survey Forrester conducted last year, there are also obstacles to AI adoption as expressed by companies with no plans of investing in AI:
There is no defined business case
Not clear what AI can be used for
Don’t have the required skills
Need first to invest in modernizing data mgt platform
Don’t have the budget
Not certain what is needed for implementing an AI system
AI systems are not proven
Do not have the right processes or governance
AI is a lot of hype with little substance
Don’t own or have access to the required data
Not sure what AI means
Once enterprises overcome these obstacles, Forrester concludes, they stand to gain from AI driving accelerated transformation in customer-facing applications and developing an interconnected web of enterprise intelligence.
Follow me on Twitter @GilPress or Facebook or Google+
Google and others think software that learns to learn could take over some work done by AI experts.
Progress in artificial intelligence causes some people to worry that software will take jobs such as driving trucks away from humans. Now leading researchers are finding that they can make software that can learn to do one of the trickiest parts of their own jobs—the task of designing machine-learning software.
In one experiment, researchers at the Google Brain artificial intelligence research group had software design a machine-learning system to take a test used to benchmark software that processes language. What it came up with surpassed previously published results from software designed by humans.
In recent months several other groups have also reported progress on getting learning software to make learning software. They include researchers at
Google’s other artificial intelligence research group, DeepMind.
If self-starting AI techniques become practical, they could increase the pace at which machine-learning software is implemented across the economy. Companies must currently pay a premium for machine-learning experts, who are in short supply.
Jeff Dean, who leads the Google Brain research group, mused last week that some of the work of such workers could be supplanted by software. He described what he termed “automated machine learning” as one of the most promising research avenues his team was exploring.
“Currently the way you solve problems is you have expertise and data and computation,” said Dean, at the AI Frontiers conference in Santa Clara, California. “Can we eliminate the need for a lot of machine-learning expertise?”
One set of experiments from Google’s DeepMind group suggests that what researchers are terming “learning to learn” could also help lessen the problem of machine-learning software needing to consume vast amounts of data on a specific task in order to perform it well.
The researchers challenged their software to create learning systems for collections of multiple different, but related, problems, such as navigating mazes. It came up with designs that showed an ability to generalize, and pick up new tasks with less additional training than would be usual.
The idea of creating software that learns to learn has been around for a while, but previous experiments didn’t produce results that rivaled what humans could come up with. “It’s exciting,” says Yoshua Bengio, a professor at the University of Montreal, who previously explored the idea in the 1990s.
Bengio says the more potent computing power now available, and the advent of a technique called deep learning, which has sparked recent excitement about AI, are what’s making the approach work. But he notes that so far it requires such extreme computing power that it’s not yet practical to think about lightening the load, or partially replacing, machine-learning experts.
Google Brain’s researchers describe using 800 high-powered graphics processors to power software that came up with designs for image recognition systems that rivaled the best designed by humans.
Otkrist Gupta, a researcher at the MIT Media Lab, believes that will change. He and MIT colleagues plan to open-source the software behind their own experiments, in which learning software designed deep-learning systems that matched human-crafted ones on standard tests for object recognition.
Gupta was inspired to work on the project by frustrating hours spent designing and testing machine-learning models. He thinks companies and researchers are well motivated to find ways to make automated machine learning practical.
“Easing the burden on the data scientist is a big payoff,” he says. “It could make you more productive, make you better models, and make you free to explore higher-level ideas.”
Driving your car until it breaks down on the road is never anyone’s favorite way to learn the need for routine maintenance. But preventive or scheduled maintenance checks often miss many of the problems that can come up. An Israeli startup has come up with a better idea: Use artificial intelligence to listen for early warning signs that a car might be nearing a breakdown.
The service of 3DSignals, a startup based in Kefar Sava, Israel, relies on the artificial intelligence technique known asdeep learning to understand the noise patterns of troubled machines and predict problems in advance. 3DSignals has already begun talking with leading European automakers about possibly using the deep learning service to detect possible trouble both in auto factory machinery and in the cars themselves. The startup has even chatted with companies about using their service to automatically detect problems in future taxi fleets of driverless cars.
Deep learning usually refers to software algorithms known as artificial neural networks. These neural networks can learn to become better at specific tasks by filtering relevant data through multiple (deep) layers of artificial neurons. Many companies such as Google and Facebook have used deep learning to develop AI systems that
Many tech giants have also applied deep learning to make their services become better at automatically recognizing the spoken sounds of different human languages. But few companies have bothered with using deep learning to develop AI that’s good at listening to other acoustic signals such as the sounds of machines or music. That’s where 3DSignals hopes it can become a big player with its deep learning focus on more general sound patterns, Lavi explains.
“I think most of the world is occupied with deep learning on images. This is by far the most popular application and the most recent. But part of the industry is doing deep learning on acoustics focused on speech recognition and conversation. I think we are probably in the very small group of companies doing acoustics which is more general. This is my aim, to be the world leader in general acoustics deep learning.”
For each client, 3DSignals installs ultrasonic microphones that can detect sounds ranging up to 100 kilohertz (human hearing range is between 20 hertz and 20 kilohertz). The startup’s “Internet of Things” service connects the microphones to a computing device that can process some of the data and then upload the information to an online network where the deep learning algorithms do their work. Clients can always check the status of their machines by using any Web-connected device such as a smartphone or tablet.
The first clients for 3DSignals include heavy industry companies operating machinery such as circular cutting blades in mills or hydroelectric turbines in power plants. These companies started out by purchasing the first tier of the 3DSignals service that does not use deep learning. Instead, this first tier of service uses software that relies on basic physics modeling of certain machine parts—such as circular cutting saws—to predict when some parts may start to wear out. That allows the clients to begin getting value from day one.
The second tier of the service uses a deep learning algorithm and the sounds coming from the microphones to help detect strange or unusual noises from the machines. The deep learning algorithms train on sound patterns that can signal general problems with the machines. But only the third tier of the service, also using deep learning, can classify the sounds as indicating specific types of problems. Before this can happen, though, the clients need to help train the deep learning algorithm by first labeling certain sound patterns as belonging to specific types of problems.
“After a while, we can not only say when problem type A happens, but we can say before it happens, you’re going to have problem type A in five hours,” Lavi says. “Some problems don’t happen instantly; there’s a deterioration.”
When trained, the 3DSignals deep learning algorithms are able to identify predict specific problems in advance with 98 percent accuracy. But the current clients using the 3DSignals system have not yet begun taking advantage of this classification capability; they are still building their training datasets by having people manually label specific sound signatures as belonging to specific problems.
The one-year-old startup has just 15 employees, but it has grown fairly fast and raised $3.3 million so far from investors such as Dov Moran, the Israeli entrepreneur credited with being one of the first to invent USB flash drives. Lavi and his fellow co-founders are already eying several big markets that include automobiles and the energy sector beyond hydroelectric power plants. A series A funding round to attract venture capital is planned for sometime in 2017.
If all goes well, 3DSignals could expand its lead in the growing market for providing “predictive maintenance” to factories, power plants, and car owners. The impending arrival of driverless cars may put even more responsibility on the metaphorical shoulders of a deep learning AI that could listen for problems while the human passengers tune out from the driving experience. On top of all this, 3DSignals has the chance to pioneer the advancement of deep learning in listening to general sounds. Not bad for a small startup.
“It’s important for us to be specialists in general acoustic deep learning, because the research literature does not cover it,” Lavi says.
An artist’s impression of the DNC. Credit: DeepMind
The DeepMind artificial intelligence (AI) being developed by Google‘s parent company, Alphabet, can now intelligently build on what’s already inside its memory, the system’s programmers have announced.
Their new hybrid system – called a Differential Neural Computer (DNC) – pairs a neural network with the vast data storage of conventional computers, and the AI is smart enough to navigate and learn from this external data bank.
What the DNC is doing is effectively combining external memory (like the external hard drive where all your photos get stored) with the neural network approach of AI, where a massive number of interconnected nodes work dynamically to simulate a brain.
“These models… can learn from examples like neural networks, but they can also store complex data like computers,” write DeepMind researchers Alexander Graves and Greg Waynein a blog post.
At the heart of the DNC is a controller that constantly optimises its responses, comparing its results with the desired and correct ones. Over time, it’s able to get more and more accurate, figuring out how to use its memory data banks at the same time.
Take a family tree: after being told about certain relationships, the DNC was able to figure out other family connections on its own – writing, rewriting, and optimising its memory along the way to pull out the correct information at the right time.
Another example the researchers give is a public transit system, like the London Underground. Once it’s learned the basics, the DNC can figure out more complex relationships and routes without any extra help, relying on what it’s already got in its memory banks.
In other words, it’s functioning like a human brain, taking data from memory (like tube station positions) and figuring out new information (like how many stops to stay on for).
Of course, any smartphone mapping app can tell you the quickest way from one tube station to another, but the difference is that the DNC isn’t pulling this information out of a pre-programmed timetable – it’s working out the information on its own, and juggling a lot of data in its memory all at once.
The approach means a DNC system could take what it learned about the London Underground and apply parts of its knowledge to another transport network, like the New York subway.
The system points to a future where artificial intelligence could answer questions on new topics, by deducing responses from prior experiences, without needing to have learned every possible answer beforehand.
Of course, that’s how DeepMind was able to beat human champions at Go – by studying millions of Go moves. But by adding external memory, DNCs are able to take on much more complex tasks and work out better overall strategies, its creators say.
“Like a conventional computer, [a DNC] can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data,” the researchers explain in Nature.
In another test, the DNC was given two bits of information: “John is in the playground,” and “John picked up the football.” With those known facts, when asked “Where is the football?“, it was able to answer correctly by combining memory with deep learning. (The football is in the playground, if you’re stuck.)
Making those connections might seem like a simple task for our powerful human brains, but until now, it’s been a lot harder for virtual assistants, such as Siri, to figure out.
With the advances DeepMind is making, the researchers say we’re another step forward to producing a computer that can reason independently.
And then we can all start enjoying our robot-driven utopia – or technological dystopia – depending on your point of view.
Google’s DeepMind artificial intelligence lab does more than just develop computer programs capable of beating the world’s best human players in the ancient game of Go. The DeepMind unit has also been working on the next generation of deep learning software that combines the ability to recognize data patterns with the memory required to decipher more complex relationships within the data.
Deep learning is the latest buzz word for artificial intelligence algorithms called neural networks that can learn over time by filtering huge amounts of relevant data through many “deep” layers. The brain-inspired neural network layers consist of nodes (also known as neurons). Tech giants such as Google, Facebook, Amazon, and Microsoft have been training neural networks to learn how to better handle tasks such as recognizing images of dogs or making better Chinese-to-English translations. These AI capabilities have already benefited millions of people using Google Translate and other online services.
But neural networks face huge challenges when they try to rely solely on pattern recognition without having the external memory to store and retrieve information. To improve deep learning’s capabilities, Google DeepMind created a “differentiable neural computer” (DNC) that gives neural networks an external memory for storing information for later use.
“Neural networks are like the human brain; we humans cannot assimilate massive amounts of data and we must rely on external read-write memory all the time,” says Jay McClelland, director of the Center for Mind, Brain and Computation at Stanford University. “We once relied on our physical address books and Rolodexes; now of course we rely on the read-write storage capabilities of regular computers.”
McClelland is a cognitive scientist who served as one of several independent peer reviewers for the Google DeepMind paper that describes development of this improved deep learning system. The full paper is presented in the 12 Oct 2016 issue of the journal Nature.
The DeepMind team found that the DNC system’s combination of the neural network and external memory did much better than a neural network alone in tackling the complex relationships between data points in so-called “graph tasks.” For example, they asked their system to either simply take any path between points A and B or to find the shortest travel routes based on a symbolic map of the London Underground subway.
An unaided neural network could not even finish the first level of training, based on traveling between two subway stations without trying to find the shortest route. It achieved an average accuracy of just 37 percent after going through almost two million training examples. By comparison, the neural network with access to external memory in the DNC system successfully completed the entire training curriculum and reached an average of 98.8 percent accuracy on the final lesson.
The external memory of the DNC system also proved critical to success in performing logical planning tasks such as solving simple block puzzle challenges. Again, a neural network by itself could not even finish the first lesson of the training curriculum for the block puzzle challenge. The DNC system was able to use its memory to store information about the challenge’s goals and to effectively plan ahead by writing its decisions to memory before acting upon them.
In 2014, DeepMind’s researchers developed another system, called the neural Turing machine, that also combined neural networks with external memory. But the neural Turing machine was limited in the way it could access “memories” (information) because such memories were effectively stored and retrieved in fixed blocks or arrays. The latest DNC system can access memories in any arbitrary location, McClelland explains.
The DNC system’s memory architecture even bears a certain resemblance to how the hippocampus region of the brain supports new brain cell growth and new connections in order to store new memories. Just as the DNC system uses the equivalent of time stamps to organize the storage and retrieval of memories, human “free recall” experiments have shown that people are more likely to recall certain items in the same order as first presented.
Despite these similarities, the DNC’s design was driven by computational considerations rather than taking direct inspiration from biological brains, DeepMind’s researchers write in their paper. But McClelland says that he prefers not to think of the similarities as being purely coincidental.
“The design decisions that motivated the architects of the DNC were the same as those that structured the human memory system, although the latter (in my opinion) was designed by a gradual evolutionary process, rather than by a group of brilliant AI researchers,” McClelland says.
Human brains still have significant advantages over any brain-inspired deep learning software. For example, human memory seems much better at storing information so that it is accessible by both context or content, McClelland says. He expressed hope that future deep learning and AI research could better capture the memory advantages of biological brains.
DeepMind’s DNC system and similar neural learning systems may represent crucial steps for the ongoing development of AI. But the DNC system still falls well short of what McClelland considers the most important parts of human intelligence.
The DNC is a sophisticated form of external memory, but ultimately it is like the papyrus on which Euclid wrote the elements. The insights of mathematicians that Euclid codified relied (in my view) on a gradual learning process that structured the neural circuits in their brains so that they came to be able to see relationships that others had not seen, and that structured the neural circuits in Euclid’s brain so that he could formulate what to write. We have a long way to go before we understand fully the algorithms the human brain uses to support these processes.
It’s unclear when or how Google might take advantage of the capabilities offered by the DNC system to boost its commercial products and services. The DeepMind team was “heads down in research” or too busy with travel to entertain media questions at this time, according to a Google spokesperson.
But Herbert Jaeger, professor for computational science at Jacobs University Bremen in Germany, sees the DeepMind team’s work as a “passing snapshot in a fast evolution sequence of novel neural learning architectures.” In fact, he’s confident that the DeepMind team already has something better than the DNC system described in the Nature paper. (Keep in mind that the paper was submitted back in January 2016.)
DeepMind’s work is also part of a bigger trend in deep learning, Jaeger says. The leading deep learning teams at Google and other companies are racing to build new AI architectures with many different functional modules—among them, attentional control or working memory; they then train the systems through deep learning.
“The DNC is just one among dozens of novel, highly potent, and cleverly-thought-out neural learning systems that are popping up all over the place,” Jaeger says.
So what’s new?
Our 2014 system used the Inception V1image classification model to initialize the image encoder, which
produces the encodings that are useful for recognizing different objects in the images. This was the best image model available at the time, achieving 89.6% top-5 accuracy on the benchmark ImageNet 2012 image classification task. We replaced this in 2015 with the newer Inception V2image classification model, which achieves 91.8% accuracy on the same task.The improved vision component gave our captioning system an accuracy boost of 2 points in the BLEU-4 metric (which is commonly used in machine translation to evaluate the quality of generated sentences) and was an important factor of its success in the captioning challenge.Today’s code release initializes the image encoder using the Inception V3model, which achieves 93.9% accuracy on the ImageNet classification task. Initializing the image encoder with a better vision model gives the image captioning system a better ability to recognize different objects in the images, allowing it to generate more detailed and accurate descriptions. This gives an additional 2 points of improvement in the BLEU-4 metric over the system used in the captioning challenge.Another key improvement to the vision component comes from fine-tuning the image model. This step addresses the problem that the image encoder is initialized by a model trained to classify objects in images, whereas the goal of the captioning system is to describe the objects in images using the encodings produced by the image model. For example, an image classification model will tell you that a dog, grass and a frisbee are in the image, but a natural description should also tell you the color of the grass and how the dog relates to the frisbee. In the fine-tuning phase, the captioning system is improved by jointly training its vision and language components on human generated captions. This allows the captioning system to transfer information from the image that is specifically useful for generating descriptive captions, but which was not necessary for classifying objects. In particular, after fine-tuning it becomes better at correctly describing the colors of objects. Importantly, the fine-tuning phase must occur after the language component has already learned to generate captions – otherwise, the noisiness of the randomly initialized language component causes irreversible corruption to the vision component. For more details, read the full paper here.
Left: the better image model allows the captioning model to generate more detailed and accurate descriptions. Right: after fine-tuning the image model, the image captioning system is more likely to describe the colors of objects correctly.
Until recently our image captioning system was implemented in the DistBelief software framework. The TensorFlow implementation released today achieves the same level of accuracy with significantly faster performance: time per training step
is just 0.7 seconds in TensorFlow compared to 3 seconds in DistBelief on an Nvidia K20 GPU, meaning that total training time is just 25% of the time previously required.A natural question is whether our captioning system can generate novel descriptions of previously unseen contexts and interactions. The system is trained by showing it hundreds of thousands of images that were captioned manually by humans, and it often re-uses human captions when presented with scenes similar to what it’s seen before.
When the model is presented with scenes similar to what it’s seen before, it will often re-use human generated captions.
So does it really understand the objects and their interactions in each image? Or does it always regurgitate descriptions from the training data? Excitingly, our model does indeed develop the ability to generate accurate new captions when presented with completely new scenes, indicating a deeper understanding of the objects and context in the images. Moreover, it learns how to express that knowledge in natural-sounding English phrases despite receiving no additional language training other than reading the human captions.
Our model generates a completely new caption using concepts learned from similar scenes in the training set
We hope that sharing this model in TensorFlow will help push forward image captioning research and applications, and will also
allow interested people to learn and have fun. To get started training your own image captioning system, and for more details on the neural network architecture, navigate to the model’s home-page here. While our system uses the Inception V3 image classification model, you could even try training our system with the recently released Inception-ResNet-v2 model to see if it can do even better!
AI (Artificial intelligence) is a subfield of computer science that was created in the 1960s, and it was/is concerned with solving tasks that are easy for humans but hard for computers. In particular, a so-called Strong AI would be a system that can do anything a human can (perhaps without purely physical things). This is fairly generic and includes all kinds of tasks such as
given some AI problem that can be described in discrete terms (e.g. out of a particular set of actions, which one is the right one), and
given a lot of information about the world,
figure out what is the “correct” action, without having the programmer program it in.
Typically some outside process is needed to judge whether the action was correct or not.
In mathematical terms, it’s a function: you feed in some input, and you want it to to produce the right output, so the whole problem is simply to build a model of this mathematical function in some automatic way. To draw a distinction with AI, if I can write a very clever program that has human-like behavior, it can be AI, but unless its parameters are automatically learned from data, it’s not machine learning.
Deep learning is one kind of machine learning that’s very popular now. It involves a particular kind of mathematical model that can be thought of as a composition of simple blocks (function composition) of a certain type, and where some of these blocks can be adjusted to better predict the final outcome.
The word “deep” means that the composition has many of these blocks stacked on top of each other, and the tricky bit is how to adjust the blocks that are far from the output, since a small change there can have very indirect effects on the output. This is done via something called Backpropagation inside of a larger process called Gradient descent which lets you change the parameters in a way that improves your model.