Tsinghua Creates Football AI: The first time to control 10 players to complete the competition, winning 94.4%

“I saw the No. 4 players quickly broke the back, the single knife straight, the ball, the ball, the ball, enter!”

The audience, everyone, what you see now is the scene of Google Ai football game, the field is from the yellow jersey fromTsinghua University’s AI player.

This Tsinghua AI may not usually, they are in hard training.Not only have a star player who has been outstanding, but also has the strongest and close team in the world..

In a variety of international competitions, we will winchampion.

“OH, now I will receive the assists from the teammates from the number 7”, and the ball is coming again! “

The words retired, the above is actually a powerful multi-intelligent Manchester United Jersey strong body intensive learning Ai, which Tsinghua University built in football games.Tikick.

Winning the champion in a number of international events, tikickSingle intelligent control and multi-intelligent controlGainSOTAPerformance, and stillFirst implementation of ten playersComplete the entire football game.

How is this powerful AI team trained?

Prior to this, let’s take a brief understanding of the intensive learning environment used in training, which is this football game:Google Research Football(GRF).

It is published by Google in 2019, providing physical 3D football simulation, supporting all major competition rules, one or more football players in the intelligent gymnastics and other built-in Ai Cristiano Ronaldo Jersey fighting.

In the upper and lower half competitions composed of three thousand steps, the intelligent body needs to be decomposed.19 actions such as moving, passing, shooting, moving ball, horses, sprintsComplete the goal.

It is two in such a football game environment.

First, because more intelligent environmentsThat is, a total of 10 players (excluding goalkeepers) are available, and algorithms need to search for appropriate action combinations in such a huge action space;

two isEveryone knows a football match in a football match.Extremely rareThe algorithm is therefore difficult to get rewards from the environment, and the difficulty of training is greatly increased.

And Tsinghua University this goal is to controlMultiplePlayers Mbappé Jersey complete the game.

They first walked from Kaggle in the GRF World Championships held in 2020.Wekick team of the championTens of thousands of self-playing data, learn from the offline strengthening learning method.

This championship only needs a player in the control court to fight.

How to learn more intelligent body strategies from a single intelligent data set?

Directly learn the single intelligent gynecology in Wekick and copy it to each player, because so everyone will only go to the ball to go to the ball door, and there will be no teamwork.

There is no data in the back-on non-active player, what should Chelsea Jersey I do?

They added in the action setThe twentieth action: build-inAnd give allNon-active playerThis label (if you choose Build-in as a player in the game, the player will take action according to the built-in rules).

Then use a multi-intelligent behavior cloning (MABC) algorithm training model.

For offline strengthening learning, the most core idea is to identify high quality actions in the data and strengthen learning of these actions.

Therefore, it is necessary to give each tag different weight when calculating the target function, preventing players from tend to use only one action as action.

There are two points of weight distribution here:

First, pick out from the data setMore populationThe competition, only use these high quality Kits Football Kits data to train, due to the dense reward, the model can accelerate convergence and improve performance.

The second is to train the critical network to give all actions and calculate the advantage function using its results, thenGive advantageous functionsIt is revealed to give lower weight.

In order to avoid gradient explosion and disappearance, the advantage function has made an appropriate cropping.

finalDistributed Training Architecture by a Learner with multiple workersconstitute.

The Learner is responsible for learning and updating the strategy, and Worker is responsible for collecting data, and they perform data by GRPC, the exchange and sharing of network parameters.

Worker can use multi-process simultaneous interactions with multiple gaming environments, or to read offline data synchronously via I / O.

This parallelization is implemented greatly enhanced the speed of data collection, thusImprove training speed (5 hours can achieve the same performance of other distributed training algorithms for two days).

In addition, by modular design, the frame can also switch a single-node debug mode and multi-node distributed training mode without modifying any code.Reduce algorithm implementation and training difficulty.

In the different algorithm comparison results on the multi-smart (GRF) game, Tikick’s final algorithm (+ AW) achieves best performance with the highest winning rate (94.4%) and the largest target.

Trueskill (Ranking System of Competition Games in Machine) is also the first.

Tikick and built-in AI have reached 94.4% of the winning rate and 3 points of the field.

After the Tikick’s baseline algorithm in the GRF academic scene is discovered, Tikick has reached the best performance and the lowest sample complexity under all scenarios, and the gap is obvious.

Compared with the baseline MAPPO, it also found that four scenes in five scenarios can reach the highest score.

Finally, give the Tikick battle audience video for everyone to enjoy:

A doctoral student of Huang Shiyu, Tsinghua University, research direction for computer vision, strengthen learning and deep learning cross-areas. He worked in Huawei Noah’s Ark Lab, Tencent Ai, Carnegiemelon University and Shang Dynasty.

Common one is also from Chen Wenze from Tsinghua University.

In addition, the author also includes Longfei Zhang, Tencent Ai Laboratory, Tencent Ai Lab, Tencent Ai Lab.

The author is a professor of Zhu Jun, Tsinghua University.

Paper address: https://ARXIV.ORG/Abs/2110.04507

Project address: https://github.com/tartrl/tikick

Reference link: https://zhuanlan.zhihu.com/p/421572915

This article comes from the WeChat public account “Quantum Bit” (ID: Qbitai), Author: Fengqing, 36 经 发.