Thread by @KushalP, This is a thread on the recent MuZero paper from @DeepMind [...]

This is a thread

on the recent MuZero paper from @DeepMind and some of my thoughts on it.

Blog post: https://deepmind.com/blog/article/muzero-mastering-go-chess-shogi-and-atari-without-rules
Article: https://www.nature.com/articles/s41586-020-03051-4

Mastering Atari, Go, chess and shogi by planning with a learned model

A reinforcement-learning algorithm that combines a tree-based search with a learned model achieves superhuman performance in high-performance planning and visually complex domains, without any...

https://deepmind.com/blog/article/muzero-mastering-go-chess-shogi-and-atari-without-rules

A quick recap. If we go back to 2016, the original AlphaGo mixed four things together:

1. Rules of Go (opening books, endgame tables)
2. Policy network (top down): trained on winning moves
3. Value network (bottom up): trained on end states
4. Monte Carlo Tree Search

The follow up in 2017, AlphaGo Zero, combined the Policy and Value network into a single network and no longer needed to be trained on human games — unsupervised learning.

AlphaGo Zero was stronger than AlphaGo, surpassing it within 3 days purely by using "self play". At the time, this was huge as it removed the need for human intervention and the training process could be automated further.

MuZero takes this even further, removing the first building block: predefined knowledge of the game. This is the most exciting thing about this work for me. It can "learn" the rules of the game and as a result the system can be extended from Go to Chess, Shogi, and Atari.

MuZero moves from board games (well defined rules, discrete) to games where you can interact with the game, but don't necessarily have access to the rules of the game. This is a step towards "general" gameplay.

Side note: the "Atari" referred to in the paper are 57 games on the Atari 2600, sometimes called the Atari57 suite of games, which is a long-standing benchmark to gauge agent performance.

MuZero uses "planning" to improve the MCTS algorithm, doing 3 things:

1. Representation: map what the game looks like at that point in time into an abstract "state"

2. Prediction: take the abstract state we have, and make some predictions on what to do next

3. Dynamics: take the abstract state and predicted actions to take, and estimate the next abstract state we could end up in

Here's the diagram in the paper that shows how these three things together help to "plan" a path towards a decision. s0 is the abstract state representation, the follow up state values are the predicted states based on the actions (a1, a2, etc.) taken.

The results stated in the paper are impressive. The orange (horizontal) lines show the performance of AlphaGo Zero, and the blue lines represent where MuZero reaches at the end of training.

A lot has been written about how expensive it can be to train these systems. There are two paths that can be investigated here: reduce the amount of training data and reduce the wall clock time (and power) needed for training.

The team opted to reduce the number of frames (read: images) needed to train the system from 20B down to 200M. Sadly, the system still needed 12 hours of wall clock time to train it. Why this time was still necessary was not stated clearly in the paper.

Maybe they reduced the number of frames, but increased the number of training steps? How much compute was used for this 12 hour period?

MuZero is now at a point where I wonder if it could be used for more abstract (but related) fields of study. Take it from discrete fields, to continuous fields like wayfinding and orienteering. Can it observe mountaineers and figure out how to traverse the wilderness?

On the same day as the paper was published so was this related BBC News piece, which isn't great and doesn't add much, other than stating that they're trying to use MuZero to compress images for YouTube: https://www.bbc.co.uk/news/technology-55403473

Hoping for something with more substance soon

DeepMind's AI agent MuZero could turbocharge YouTube

The successor to AlphaGo is being used to create a more efficient type of video compression.

https://www.bbc.co.uk/news/technology-55403473

A comment on the publishing approach with @nature: the paper was received 3rd April 2020, accepted 7th October 2020, and published 23rd December 2020.

It is fair to assume that the state of the art has likely moved along quite a bit in the proceeding 8 months. This is one of the things that annoys me in particular about academia. Can @DeepMind do better?

You might be wondering how you can run MuZero to play against it or understand how it works. Sadly, the best that we have is the Nature paper and the accompanying blog post.

The team have published pseudocode to arXiv, but it's not enough to reproduce their work: https://arxiv.org/abs/1911.08265

There are attempts at building an independent open implementation of AlphaGo Zero[1], however it does not include the network weights that make the system powerful. Computing the AlphaGo Zero weights will take ~1700 years on commodity hardware.

[1] https://github.com/leela-zero/leela-zero

leela-zero/leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper. - leela-zero/leela-zero

https://github.com/leela-zero/leela-zero

I would like to see @deepmind publish more concrete examples of their findings when ready. It has been 5 years since AlphaGo was shown to the world. Could they use tools like @replicateai to help others to reproduce, learn from, and build on their work?

This is impressive work. I have so many (immediate) questions, but less on the side of RL. I'd really like to understand how the team went about operationalising this system.

How did they measure their progress?
Did they write any tests (unit, integration, other) to check their working?
What did they use to monitor it?
Could they jump into strace to debug the running system and "change" it's path?
Did they use distributed tracing or just images?

How easy was it to dump the state of the system to do point-in-time recovery?
How are they making it cost effective?
How much of the system is actually utilised vs. idle?

What research are they doing to make proceeding training runs and experiments cheaper to run?
Are TPUs worth it or are there better hardware paths that could be taken?
How much of Google's funding is dependent on @deepmind falling in line with certain corporate goals?

Fin thread

Latest Threads Unrolled: