Just a sketch this week, calling your attention to Is Deep Reinforcement Learning Really Superhuman on Atari?, which concludes not only that DRL is worse than the best humans on most Atari games, but by a wide margin.
DRL isn't superhuman on Atari yet¶
Wait, what? I was quite skeptical of this claim. Mnih et al. published the groundbreaking Playing Atari with Deep Reinforcement Learning in 2013, claiming superhuman performance. Surely someone would have noticed by now?
Apparently not, and then most DRL algorithms for the next six years used either the same human scores reported in that paper, or human beginners. It's true that DQN significantly outperformed their own human player, but that player was not, by far, the best in the world. Other recent claims of superhuman performance have proven that claim against the best players in the world (the paper mentions AlphaGo against Lee Sedol, OpenAI Five against OG, and AlphaStar against Mana), but not for the Atari benchmark.
The most poignant detail to me in this paper involved the common "normalized human score", where 0% is the score of a random agent, and 100% is the score of the human baseline. On this scale, the median score achieved by the world record holders across all Atari games is 4.4k%. Clearly you can't claim superhuman performance if there are humans who beat your target by a factor of 44, unless you yourself exceed this score.
For reference, the original Rainbow algorithm achieved a median of 200% over all Atari games, and other algorithms seem to do worse. If the normalized human score is fitted to a maximum equal to the human world record for each game, and run with different time limits, a tuned IQN variant of Rainbow receives a median score of less than 4% (there were other problems with the way benchmarks were done, and correcting for them reduces performance even further).
We have a long way to go then. The paper has a useful analysis drawing on both previous and original research as to why DRL algorithms are so bad at Atari, and I encourage a careful reading. Some of them, such as reward clipping, are called out in previous research as explicitly chosen to improve performance, but (to treat this particular example), it has been mentioned that this causes the agent to prefer many small rewards over a single large reward.
I encourage anyone working with the Atari benchmark to read the paper for themselves.
- I actually find it somewhat personally encouraging that there's room for improvement on Atari. It's easy to experiment, and I have some ideas myself.
- That said, it is rather scary that we could overlook something like this for so long, as a community.
- Anyway, someone will take this as a call to arms, and make progress. Peter Drucker said, "If you can't measure it, you can't improve it." Now that we have better measurements, I predict improvements.