Category: DeepMind


Stunning AI Breakthrough Takes Us One Step Closer to the Singularity

By Hugo Angel,

As a new Nature paper points out, “There are an astonishing 10 to the
power of 170 possible board configurations in Go—more than the number of
atoms in the known universe.” (Image: DeepMind)
Remember AlphaGo, the first artificial intelligence to defeat a grandmaster at Go?
Well, the program just got a major upgrade, and it can now teach itself how to dominate the game without any human intervention. But get this: In a tournament that pitted AI against AI, this juiced-up version, called AlphaGo Zero, defeated the regular AlphaGo by a whopping 100 games to 0, signifying a major advance in the field. Hear that? It’s the technological singularity inching ever closer.A new paper published in Nature today describes how the artificially intelligent system that defeated Go grandmaster Lee Sedol in 2016 got its digital ass kicked by a new-and-improved version of itself. And it didn’t just lose by a little—it couldn’t even muster a single win after playing a hundred games. Incredibly, it took AlphaGo Zero (AGZ) just three days to train itself from scratch and acquire literally thousands of years of human Go knowledge simply by playing itself. The only input it had was what it does to the positions of the black and white pieces on the board.

  • In addition to devising completely new strategies,
  • the new system is also considerably leaner and meaner than the original AlphaGo.
Lee Sedol getting crushed by AlphaGo in 2016. (Image: AP)

Now, every once in a while the field of AI experiences a “holy shit” moment, and this would appear to be one of those moments. Looking back, other “holy shit” moments include:

This latest achievement qualifies as a “holy shit” moment for a number of reasons.

First of all, the original AlphaGo had the benefit of learning from literally thousands of previously played Go games, including those played by human amateurs and professionals. AGZ, on the other hand, received no help from its human handlers, and had access to absolutely nothing aside from the rules of the game. Using “reinforcement learning,” AGZ played itself over and over again, “starting from random play, and without any supervision or use of human data,” according to the Google-owned DeepMind researchers in their study. This allowed the system to improve and refine its digital brain, known as a neural network, as it continually learned from experience. This basically means that AlphaGo Zero was its own teacher.

This technique is more powerful than previous versions of AlphaGo because it is no longer constrained by the limits of human knowledge,” notes the DeepMind team in a release. “Instead, it is able to learn tabula rasa [from a clean slate] from the strongest player in the world: AlphaGo itself.

When playing Go, the system considers the most probable next moves (a “policy network”), and then estimates the probability of winning based on those moves (its “value network”). AGZ requires about 0.4 seconds to make these two assessments. The original AlphaGo was equipped with a pair of neural networks to make similar evaluations, but for AGZ, the Deepmind developers merged the policy and value networks into one, allowing the system to learn more efficiently. What’s more, the new system is powered by four tensor processing units (TPUS)—specialized chips for neural network training. Old AlphaGo needed 48 TPUs.

After just three days of self-play training and a total of 4.9 million games played against itself, AGZ acquired the expertise needed to trounce AlphaGo (by comparison, the original AlphaGo had 30 million games for inspiration). After 40 days of self-training, AGZ defeated another, more sophisticated version of AlphaGo called AlphaGo “Master” that defeated the world’s best Go players and the world’s top ranked Go player, Ke Jie. Earlier this year, both the original AlphaGo and AlphaGo Master won a combined 60 games against top professionals. The rise of AGZ, it would now appear, has made these previous versions obsolete.

The time when humans can have a meaningful conversation with an AI has always seemed far off and the stuff of science fiction. But for Go players, that day is here.

This is a major achievement for AI, and the subfield of reinforcement learning in particular. By teaching itself, the system matched and exceeded human knowledge by an order of magnitude in just a few days, while also developing 

  • unconventional strategies and
  • creative new moves.

For Go players, the breakthrough is as sobering as it is exciting; they’re learning things from AI that they could have never learned on their own, or would have needed an inordinate amount of time to figure out.
[AlphaGo Zero’s] games against AlphaGo Master will surely contain gems, especially because its victories seem effortless,” wrote Andy Okun and Andrew Jackson, members of the American Go Association, in a Nature News and Views article. “At each stage of the game, it seems to gain a bit here and lose a bit there, but somehow it ends up slightly ahead, as if by magic… The time when humans can have a meaningful conversation with an AI has always seemed far off and the stuff of science fiction. But for Go players, that day is here.”

No doubt, AGZ represents a disruptive advance in the world of Go, but what about its potential impact on the rest of the world? According to Nick Hynes, a grad student at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), it’ll be a while before a specialized tool like this will have an impact on our daily lives.“So far, the algorithm described only works for problems where there are a countable number of actions you can take, so it would need modification before it could be used for continuous control problems like locomotion [for instance],” Hynes told Gizmodo. “Also, it requires that you have a really good model of the environment. In this case, it literally knows all of the rules. That would be as if you had a robot for which you could exactly predict the outcomes of actions—which is impossible for real, imperfect physical systems.

The nice part, he says, is that there are several other lines of AI research that address both of these issues (e.g. machine learning, evolutionary algorithms, etc.), so it’s really just a matter of integration. “The real key here is the technique,” says Hynes.

It’s like an alien civilization inventing its own mathematics which allows it to do things like time travel…Although we’re still far from ‘The Singularity,’ we’re definitely heading in that direction.
As expected—and desired—we’re moving farther away from the classic pattern of getting a bunch of human-labeled data and training a model to imitate it,” he said. “What we’re seeing here is a model free from human bias and presuppositions: It can learn whatever it determines is optimal, which may indeed be more nuanced that our own conceptions of the same. It’s like an alien civilization inventing its own mathematics which allows it to do things like time travel,” to which he added: “Although we’re still far from ‘The Singularity,’ we’re definitely heading in that direction.


Noam Brown, a Carnegie Mellon University computer scientist who helped to develop the first AI to defeat top humans in no-limit poker, says the DeepMind researchers have achieved an impressive result, and that it could lead to bigger, better things in AI.

While the original AlphaGo managed to defeat top humans, it did so partly by relying on expert human knowledge of the game and human training data,” Brown told Gizmodo. “That led to questions of whether the techniques could extend beyond Go. AlphaGo Zero achieves even better performance without using any expert human knowledge. It seems likely that the same approach could extend to all perfect-information games [such as chess and checkers]. This is a major step toward developing general-purpose AIs.

As both Hynes and Brown admit, this latest breakthrough doesn’t mean the technological singularity—that hypothesized time in the future when greater-than-human machine intelligence achieves explosive growth—is imminent. But it should cause pause for thought. Once 

  • we teach a system the rules of a game or 
  • the constraints of a real-world problem, 

the power of reinforcement learning makes it possible to simply press the start button and let the system do the rest. It will then figure out the best ways to succeed at the task, devising solutions and strategies that are beyond human capacities, and possibly even human comprehension.

As noted, AGZ and the game of Go represent an oversimplified, constrained, and highly predictable picture of the world, but in the future, AI will be tasked with more complex challenges. Eventually, self-teaching systems will be used to solve more pressing problems, such as protein folding to conjure up new medicines and biotechnologies, figuring out ways to reduce energy consumption, or when we need to design new materials. A highly generalized self-learning system could also be tasked with improving itself, leading to artificial general intelligence (i.e. a very human-like intelligence) and even artificial superintelligence.

As the DeepMind researchers conclude in their study, “Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules.

And indeed, now that human players are no longer dominant in games like chess and Go, it can be said that we’ve already entered into the era of superintelligence. This latest breakthrough is the tiniest hint of what’s still to come.

[Nature]

ORIGINAL: Gizmodo

By George Dvorsky
2017/10/18

Google DeepMind has built an AI machine that could learn as quickly as humans before long

By Hugo Angel,

Neural Episodic Control. Architecture of episodic memory module for a single action

Emerging Technology from the arXiv

Intelligent machines have humans in their sights.

Deep-learning machines already have superhuman skills when it comes to tasks such as

  • face recognition,
  • video-game playing, and
  • even the ancient Chinese game of Go.

So it’s easy to think that humans are already outgunned.

But not so fast. Intelligent machines still lag behind humans in one crucial area of performance: the speed at which they learn. When it comes to mastering classic video games, for example, the best deep-learning machines take some 200 hours of play to reach the same skill levels that humans achieve in just two hours.

So computer scientists would dearly love to have some way to speed up the rate at which machines learn.

Today, Alexander Pritzel and pals at Google’s DeepMind subsidiary in London claim to have done just that. These guys have built a deep-learning machine that is capable of rapidly assimilating new experiences and then acting on them. The result is a machine that learns significantly faster than others and has the potential to match humans in the not too distant future.

First, some background.

Deep learning uses layers of neural networks to look for patterns in data. When a single layer spots a pattern it recognizes, it sends this information to the next layer, which looks for patterns in this signal, and so on.

So in face recognition,

  • one layer might look for edges in an image,
  • the next layer for circular patterns of edges (the kind that eyes and mouths make), and
  • the next for triangular patterns such as those made by two eyes and a mouth.
  • When all this happens, the final output is an indication that a face has been spotted.

Of course, the devil is in the details. There are various systems of feedback to allow the system to learn by adjusting various internal parameters such as the strength of connections between layers. These parameters must change slowly, since a big change in one layer can catastrophically affect learning in the subsequent layers. That’s why deep neural networks need so much training and why it takes so long.

Pritzel and co have tackled this problem with a technique they call Neural Episodic Control. “Neural episodic control demonstrates dramatic improvements on the speed of learning for a wide range of environments,” they say. “Critically, our agent is able to rapidly latch onto highly successful strategies as soon as they are experienced, instead of waiting for many steps of optimisation.

The basic idea behind DeepMind’s approach is to copy the way humans and animals learn quickly. The general consensus is that humans can tackle situations in two different ways.

  • If the situation is familiar, our brains have already formed a model of it, which they use to work out how best to behave. This uses a part of the brain called the prefrontal cortex.
  • But when the situation is not familiar, our brains have to fall back on another strategy. This is thought to involve a much simpler test-and-remember approach involving the hippocampus. So we try something and remember the outcome of this episode. If it is successful, we try it again, and so on. But if it is not a successful episode, we try to avoid it in future.

This episodic approach suffices in the short term while our prefrontal brain learns. But it is soon outperformed by the prefrontal cortex and its model-based approach.

Pritzel and co have used this approach as their inspiration. Their new system has two approaches.

  • The first is a conventional deep-learning system that mimics the behaviur of the prefrontal cortex.
  • The second is more like the hippocampus. When the system tries something new, it remembers the outcome.

But crucially, it doesn’t try to learn what to remember. Instead, it remembers everything. “Our architecture does not try to learn when to write to memory, as this can be slow to learn and take a significant amount of time,” say Pritzel and co. “Instead, we elect to write all experiences to the memory, and allow it to grow very large compared to existing memory architectures.

They then use a set of strategies to read from this large memory quickly. The result is that the system can latch onto successful strategies much more quickly than conventional deep-learning systems.

They go on to demonstrate how well all this works by training their machine to play classic Atari video games, such as Breakout, Pong, and Space Invaders. (This is a playground that DeepMind has used to train many deep-learning machines.)

The team, which includes DeepMind cofounder Demis Hassibis, shows that neural episodic control vastly outperforms other deep-learning approaches in the speed at which it learns. “Our experiments show that neural episodic control requires an order of magnitude fewer interactions with the environment,” they say.

That’s impressive work with significant potential. The researchers say that an obvious extension of this work is to test their new approach on more complex 3-D environments.

It’ll be interesting to see what environments the team chooses and the impact this will have on the real world. We’ll look forward to seeing how that works out.

Ref: Neural Episodic Control : arxiv.org/abs/1703.01988

ORIGINAL: MIT Technology Review

Google’s AI can now learn from its own memory independently

By Hugo Angel,

An artist’s impression of the DNC. Credit: DeepMind
The DeepMind artificial intelligence (AI) being developed by Google‘s parent company, Alphabet, can now intelligently build on what’s already inside its memory, the system’s programmers have announced.
Their new hybrid system – called a Differential Neural Computer (DNC)pairs a neural network with the vast data storage of conventional computers, and the AI is smart enough to navigate and learn from this external data bank. 
What the DNC is doing is effectively combining external memory (like the external hard drive where all your photos get stored) with the neural network approach of AI, where a massive number of interconnected nodes work dynamically to simulate a brain.
These models… can learn from examples like neural networks, but they can also store complex data like computers,” write DeepMind researchers Alexander Graves and Greg Wayne in a blog post.
At the heart of the DNC is a controller that constantly optimises its responses, comparing its results with the desired and correct ones. Over time, it’s able to get more and more accurate, figuring out how to use its memory data banks at the same time.
Take a family tree: after being told about certain relationships, the DNC was able to figure out other family connections on its own – writing, rewriting, and optimising its memory along the way to pull out the correct information at the right time.
Another example the researchers give is a public transit system, like the London Underground. Once it’s learned the basics, the DNC can figure out more complex relationships and routes without any extra help, relying on what it’s already got in its memory banks.
In other words, it’s functioning like a human brain, taking data from memory (like tube station positions) and figuring out new information (like how many stops to stay on for).
Of course, any smartphone mapping app can tell you the quickest way from one tube station to another, but the difference is that the DNC isn’t pulling this information out of a pre-programmed timetable – it’s working out the information on its own, and juggling a lot of data in its memory all at once.
The approach means a DNC system could take what it learned about the London Underground and apply parts of its knowledge to another transport network, like the New York subway.
The system points to a future where artificial intelligence could answer questions on new topics, by deducing responses from prior experiences, without needing to have learned every possible answer beforehand.
Credit: DeepMind

Of course, that’s how DeepMind was able to beat human champions at Go – by studying millions of Go moves. But by adding external memory, DNCs are able to take on much more complex tasks and work out better overall strategies, its creators say.

Like a conventional computer, [a DNC] can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data,” the researchers explain in Nature.
In another test, the DNC was given two bits of information: “John is in the playground,” and “John picked up the football.” With those known facts, when asked “Where is the football?“, it was able to answer correctly by combining memory with deep learning. (The football is in the playground, if you’re stuck.)
Making those connections might seem like a simple task for our powerful human brains, but until now, it’s been a lot harder for virtual assistants, such as Siri, to figure out.
With the advances DeepMind is making, the researchers say we’re another step forward to producing a computer that can reason independently.
And then we can all start enjoying our robot-driven utopia – or technological dystopia – depending on your point of view.
ORIGINAL: ScienceAlert
By DAVID NIELD

14 OCT 2016

Google’s Deep Mind Gives AI a Memory Boost That Lets It Navigate London’s Underground

By Hugo Angel,

Photo: iStockphoto

Google’s DeepMind artificial intelligence lab does more than just develop computer programs capable of beating the world’s best human players in the ancient game of Go. The DeepMind unit has also been working on the next generation of deep learning software that combines the ability to recognize data patterns with the memory required to decipher more complex relationships within the data.

Deep learning is the latest buzz word for artificial intelligence algorithms called neural networks that can learn over time by filtering huge amounts of relevant data through many “deep” layers. The brain-inspired neural network layers consist of nodes (also known as neurons). Tech giants such as Google, Facebook, Amazon, and Microsoft have been training neural networks to learn how to better handle tasks such as recognizing images of dogs or making better Chinese-to-English translations. These AI capabilities have already benefited millions of people using Google Translate and other online services.
But neural networks face huge challenges when they try to rely solely on pattern recognition without having the external memory to store and retrieve information. To improve deep learning’s capabilities, Google DeepMind created a “differentiable neural computer” (DNC) that gives neural networks an external memory for storing information for later use.
Neural networks are like the human brain; we humans cannot assimilate massive amounts of data and we must rely on external read-write memory all the time,” says Jay McClelland, director of the Center for Mind, Brain and Computation at Stanford University. “We once relied on our physical address books and Rolodexes; now of course we rely on the read-write storage capabilities of regular computers.
McClelland is a cognitive scientist who served as one of several independent peer reviewers for the Google DeepMind paper that describes development of this improved deep learning system. The full paper is presented in the 12 Oct 2016 issue of the journal Nature.
The DeepMind team found that the DNC system’s combination of the neural network and external memory did much better than a neural network alone in tackling the complex relationships between data points in so-called “graph tasks.” For example, they asked their system to either simply take any path between points A and B or to find the shortest travel routes based on a symbolic map of the London Underground subway.
An unaided neural network could not even finish the first level of training, based on traveling between two subway stations without trying to find the shortest route. It achieved an average accuracy of just 37 percent after going through almost two million training examples. By comparison, the neural network with access to external memory in the DNC system successfully completed the entire training curriculum and reached an average of 98.8 percent accuracy on the final lesson.
The external memory of the DNC system also proved critical to success in performing logical planning tasks such as solving simple block puzzle challenges. Again, a neural network by itself could not even finish the first lesson of the training curriculum for the block puzzle challenge. The DNC system was able to use its memory to store information about the challenge’s goals and to effectively plan ahead by writing its decisions to memory before acting upon them.
In 2014, DeepMind’s researchers developed another system, called the neural Turing machine, that also combined neural networks with external memory. But the neural Turing machine was limited in the way it could access “memories” (information) because such memories were effectively stored and retrieved in fixed blocks or arrays. The latest DNC system can access memories in any arbitrary location, McClelland explains.
The DNC system’s memory architecture even bears a certain resemblance to how the hippocampus region of the brain supports new brain cell growth and new connections in order to store new memories. Just as the DNC system uses the equivalent of time stamps to organize the storage and retrieval of memories, human “free recall” experiments have shown that people are more likely to recall certain items in the same order as first presented.
Despite these similarities, the DNC’s design was driven by computational considerations rather than taking direct inspiration from biological brains, DeepMind’s researchers write in their paper. But McClelland says that he prefers not to think of the similarities as being purely coincidental.
The design decisions that motivated the architects of the DNC were the same as those that structured the human memory system, although the latter (in my opinion) was designed by a gradual evolutionary process, rather than by a group of brilliant AI researchers,” McClelland says.
Human brains still have significant advantages over any brain-inspired deep learning software. For example, human memory seems much better at storing information so that it is accessible by both context or content, McClelland says. He expressed hope that future deep learning and AI research could better capture the memory advantages of biological brains.
 
DeepMind’s DNC system and similar neural learning systems may represent crucial steps for the ongoing development of AI. But the DNC system still falls well short of what McClelland considers the most important parts of human intelligence.
The DNC is a sophisticated form of external memory, but ultimately it is like the papyrus on which Euclid wrote the elements. The insights of mathematicians that Euclid codified relied (in my view) on a gradual learning process that structured the neural circuits in their brains so that they came to be able to see relationships that others had not seen, and that structured the neural circuits in Euclid’s brain so that he could formulate what to write. We have a long way to go before we understand fully the algorithms the human brain uses to support these processes.
It’s unclear when or how Google might take advantage of the capabilities offered by the DNC system to boost its commercial products and services. The DeepMind team was “heads down in research” or too busy with travel to entertain media questions at this time, according to a Google spokesperson.
But Herbert Jaeger, professor for computational science at Jacobs University Bremen in Germany, sees the DeepMind team’s work as a “passing snapshot in a fast evolution sequence of novel neural learning architectures.” In fact, he’s confident that the DeepMind team already has something better than the DNC system described in the Nature paper. (Keep in mind that the paper was submitted back in January 2016.)
DeepMind’s work is also part of a bigger trend in deep learning, Jaeger says. The leading deep learning teams at Google and other companies are racing to build new AI architectures with many different functional modules—among them, attentional control or working memory; they then train the systems through deep learning.
The DNC is just one among dozens of novel, highly potent, and cleverly-thought-out neural learning systems that are popping up all over the place,” Jaeger says.
ORIGINAL: IEEE Spectrum
12 Oct 2016

Partnership on Artificial Intelligence to Benefit People and Society

By Hugo Angel,

Established to study and formulate best practices on AI technologies, to advance the public’s understanding of AI, and to serve as an open platform for discussion and engagement about AI and its influences on people and society.

THE LATEST
INDUSTRY LEADERS ESTABLISH PARTNERSHIP ON AI BEST PRACTICES

Press ReleasesSeptember 28, 2016 NEW YORK —  IBM, DeepMind,/Google,  Microsoft, Amazon, and Facebook today announced that they will create a non-profit organization that will work to advance public understanding of artificial intelligence technologies (AI) and formulate best practices on the challenges and opportunities within the field. Academics, non-profits, and specialists in policy and ethics will be invited to join the Board of the organization, named the Partnership on Artificial Intelligence to Benefit People and Society (Partnership on AI).

The objective of the Partnership on AI is to address opportunities and challenges with AI technologies to benefit people and society. Together, the organization’s members will conduct research, recommend best practices, and publish research under an open license in areas such as ethics, fairness, and inclusivity; transparency, privacy, and interoperability; collaboration between people and AI systems; and the trustworthiness, reliability, and robustness of the technology. It does not intend to lobby government or other policymaking bodies.

The organization’s founding members will each contribute financial and research resources to the partnership and will share leadership with independent third-parties, including academics, user group advocates, and industry domain experts. There will be equal representation of corporate and non-corporate members on the board of this new organization. The Partnership is in discussions with professional and scientific organizations, such as the Association for the Advancement of Artificial Intelligence (AAAI), as well as non-profit research groups including the Allen Institute for Artificial Intelligence (AI2), and anticipates announcements regarding additional participants in the near future.

AI technologies hold tremendous potential to improve many aspects of life, ranging from healthcare, education, and manufacturing to home automation and transportation. Through rigorous research, the development of best practices, and an open and transparent dialogue, the founding members of the Partnership on AI hope to maximize this potential and ensure it benefits as many people as possible.

… Continue reading

WaveNet: A Generative Model for Raw Audio by Google DeepMind

By Hugo Angel,

WaveNet: A Generative Model for Raw Audio
This post presents WaveNet, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%.
We also demonstrate that the same network can be used to synthesize other audio signals such as music, and present some striking samples of automatically generated piano pieces.
Talking Machines
Allowing people to converse with machines is a long-standing dream of human-computer interaction. The ability of computers to understand natural speech has been revolutionised in the last few years by the application of deep neural networks (e.g.,Google Voice Search). However, generating speech with computers — a process usually referred to as speech synthesis or text-to-speech (TTS) — is still largely based on so-called concatenative TTS, where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances. This makes it difficult to modify the voice (for example switching to a different speaker, or altering the emphasis or emotion of their speech) without recording a whole new database.
This has led to a great demand for parametric TTS, where all the information required to generate the data is stored in the parameters of the model, and the contents and characteristics of the speech can be controlled via the inputs to the model. So far, however, parametric TTS has tended to sound less natural than concatenative, at least for syllabic languages such as English. Existing parametric models typically generate audio signals by passing their outputs through signal processing algorithms known asvocoders.
WaveNet changes this paradigm by directly modelling the raw waveform of the audio signal, one sample at a time. As well as yielding more natural-sounding speech, using raw waveforms means that WaveNet can model any kind of audio, including music.
WaveNets

 

Wave animation

 

Researchers usually avoid modelling raw audio because it ticks so quickly: typically 16,000 samples per second or more, with important structure at many time-scales. Building a completely autoregressive model, in which the prediction for every one of those samples is influenced by all previous ones (in statistics-speak, each predictive distribution is conditioned on all previous observations), is clearly a challenging task.
However, our PixelRNN and PixelCNN models, published earlier this year, showed that it was possible to generate complex natural images not only one pixel at a time, but one colour-channel at a time, requiring thousands of predictions per image. This inspired us to adapt our two-dimensional PixelNets to a one-dimensional WaveNet.
Architecture animation

 

 The above animation shows how a WaveNet is structured. It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps.At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio.
Improving the State of the Art
We trained WaveNet using some of Google’s TTS datasets so we could evaluate its performance. The following figure shows the quality of WaveNets on a scale from 1 to 5, compared with Google’s current best TTS systems (parametric and concatenative), and with human speech using Mean Opinion Scores (MOS). MOS are a standard measure for subjective sound quality tests, and were obtained in blind tests with human subjects (from over 500 ratings on 100 test sentences). As we can see, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese.
For both Chinese and English, Google’s current TTS systems are considered among the best worldwide, so improving on both with a single model is a major achievement.

 

Here are some samples from all three systems so you can listen and compare yourself:

US English:

Mandarin Chinese:

Knowing What to Say

In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet. This means the network’s predictions are conditioned not only on the previous audio samples, but also on the text we want it to say.
If we train the network without the text sequence, it still generates speech, but now it has to make up what to say. As you can hear from the samples below, this results in a kind of babbling, where real words are interspersed with made-up word-like sounds:

 

Notice that non-speech sounds, such as breathing and mouth movements, are also sometimes generated by WaveNet; this reflects the greater flexibility of a raw-audio model.
As you can hear from these samples, a single WaveNet is able to learn the characteristics of many different voices, male and female. To make sure it knew which voice to use for any given utterance, we conditioned the network on the identity of the speaker. Interestingly, we found that training on many speakers made it better at modelling a single speaker than training on that speaker alone, suggesting a form of transfer learning.
By changing the speaker identity, we can use WaveNet to say the same thing in different voices:

 

Similarly, we could provide additional inputs to the model, such as emotions or accents, to make the speech even more diverse and interesting.
Making Music
Since WaveNets can be used to model any audio signal, we thought it would also be fun to try to generate music. Unlike the TTS experiments, we didn’t condition the networks on an input sequence telling it what to play (such as a musical score); instead, we simply let it generate whatever it wanted to. When we trained it on a dataset of classical piano music, it produced fascinating samples like the ones below:

 

WaveNets open up a lot of possibilities for TTS, music generation and audio modelling in general. The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems. We are excited to see what we can do with them next.
For more details, take a look at our paper.


ORIGINAL: Google DeepMind
Aäron van den Oord. Research Scientist, DeepMind
Heiga Zen. Research Scientist, Google
Sander Dieleman. Research Scientist, DeepMind
8 September 2016

© 2016 DeepMind Technologies Limited