Home > News content

AI has captured many people and has re-entered Science, training costs $150, and wins 1,000 knives per hour.

via:博客园     time:2019/7/12 12:20:33     readed:187

In the unrestricted Texas Hold'em six-player match, AI Pluribus successfully defeated five expert human players. Pluribus was developed by Facebook and Carnegie Mellon University (CMU) to fulfill the unfinished tasks of the predecessor Libratus, which has been published in the latest issue of Science.

Author: Synced

Source: Heart of the Machine (ID: almosthuman2014)

Six-player unlimited play is the most popular way to play Texas Hold'em. The result of Facebook and CMU is the first AI to beat a human professional in a game with two (or more) human players.

In January 2017, Libratus, an artificial intelligence program developed by CMU scholars Noam Brown and Tuomas Sandholm, successfully defeated four of the world's top pros in a 20-day 1-on-1 unrestricted game at Rivers Casino in Pittsburgh, Pennsylvania. This has also become a milestone in the next game after Go, another difficult game was captured by AI. At the end of 2017, Libratus' papers were also included in the journal Science.

"Cold Master" uses a lot of calculations and game theory to overcome incomplete card games. Another paper in the study, Safe and Nested Subgame Solving for Imperfect-Information Games, became the best paper for NIPS 2017.

data-ratio=0.6666666666666666

Tuomas Sandholm (left), a professor of computer science at Carnegie Mellon University, and his student, is currently a Facebook scientist, Noam Brown.

From 1 to 1 to play 6 players, how has artificial intelligence experienced? "Although from two to six seems to be a gradual process, it is actually a huge challenge," said Julian Togelius, an assistant professor at New York University who studies games and artificial intelligence. "The research on multiplayer games has not appeared in all games before."

Based on the "Cold Master", the new algorithm Pluribus proposed by Noam Brown and Tuomas Sandholm requires less computing power. In a 12-day, more than 10,000 hand, Pluribus defeated 15 top human players. "Many AI researchers have previously thought that achieving such a goal is impossible," Noam Brown said.

For decades, poker has been a difficult and important challenge in the field of artificial intelligence. The reason is that poker contains hidden information, which means you can't know each other's cards. To win in poker, you need to bluff or use other strategies, which are generally not needed in a board game. This makes it very difficult to apply artificial intelligence in poker.

Nowadays artificial intelligence has learned bluff, and can also see through the human player's bluff. But in Noam Brown's view, these techniques are also strategies that are determined by mathematical processes.

According to reports, the game designed by Facebook and Carnegie Mellon University is divided into two modes: 1 AI+5 personal player and 5 AI+1 personal players. Pluribus has won both modes. If a chip is worth $1, the Pluribus can win an average of $5 per game and win $1,000 in an hour with five players. Professional poker players consider these results to be a decisive victory advantage.

This is the first time AI has beaten top pros in a large benchmark game where the number of players (or teams) is greater than 2. Here are the details about Pluribus.

Paper: Superhuman AI for multiplayer poker

Paper link:Https://science.sciencemag.org/content/early/2019/07/10/science.aay2400

Pluribus has made several improvements based on Libratus and other algorithms and code. In 2017, Libratus defeated the top humans in the double-handed double-handed bet (see: "The Science of Science: The Libratus Beats the Top Players in Doubles". These algorithms and code were developed by the Carnegie Mellon University Research Laboratory led by Tuomas Sandholm.

It is worth mentioning that Pluribus integrates a new online search algorithm that can effectively evaluate its decisions by searching the first few steps instead of just searching for the end of the game. In addition, Pluribus takes advantage of the new, faster self-play imperfect information game algorithm. In summary, these improvements make it possible to train Pluribus with minimal processing power and memory. The total value of cloud computing resources used in training is less than $150. This efficiency is in stark contrast to other recent artificial intelligence milestones, which often cost millions of dollars in computing resources.

This video shows Pluribus playing against professional human poker players (the face is facing up to make it easier to see Pluribus' strategy).

The significance of these innovations goes far beyond the poker game, because two-player zeros and interactions (one loses and one win) are very common in entertainment games, but they are very rare in real life. Real-world —— taking action on harmful content, responding to cybersecurity challenges, and managing online auctions or navigation traffic —— often involves multiple participants and/or hidden information. Multi-player interactions present serious theoretical and practical challenges to past AI technologies. Facebook's results show that a well-structured artificial intelligence algorithm can surpass human performance in zero-sum games of two or more people.

First, win in 6 players poker

Compared to the typical games of the past, 6-player poker has two main challenges.

Not just a simple double zero game

The breakthroughs in all past games have been limited to two or two teams of zero-sum competitions (such as Chess, Chess, StarCraft 2 or Dota2). In these competitions, AI was successful because they tried to evaluate the use of Nash balancing strategies. In the two- and two-team zero-sum game, no matter what the opponent does, making a precise Nash equilibrium may not lose the game. (For example, the Nash Equilibrium Strategy for Stone Scissors is to randomly select stones, cloth, or scissors with the same probability.)

Although there is a Nash equilibrium in any restricted game, it is often difficult to effectively calculate the Nash equilibrium in a game with three or more players. (This is also true for both players and games.) In addition, in a game of more than two players, even if an accurate Nash equilibrium strategy is made, it is possible to lose the game. For example, in the game Lemonade Stand game, each player selects a point on a circle at the same time and wants to be as far away as possible from any other player. Nash equilibrium is the equal distance between all participants along the ring, but there are many ways to do this. If each player independently calculates one of the balance points, the joint strategy is unlikely to cause all players to be equally spaced along the loop. As shown below:

data-ratio=0.284375

In addition to the two-zero game, the shortcomings of Nash Equilibrium have led researchers to think: What should be the correct goal of this game?

In six-person poker, researchers believe that their goal should not be a specific game theory solution concept, but to create an AI that can beat human opponents by experience for a long time, including elite human professionals. (For AI robots, this is usually considered a performance of "Superman.")

Researchers say that the algorithm they use to build Pluribus does not guarantee convergence beyond the two-zero game.Nash Equilibrium. Nonetheless, they observed that Pluribus' strategy in six-player poker consistently beats pros, so these algorithms can generate superhuman strategies in a wider range of scenarios beyond the two-zero game.

Hidden information in more complex environments

No other game has the same hidden information challenge as poker, and each player has information that other players don't have (your own face). A successful poker AI must reason about this hidden message and carefully balance its strategy (to remain unpredictable) while taking good action.

For example, bluff is occasionally effective, but bluff is always easy to catch, resulting in a lot of money. Therefore, it is necessary to carefully balance the probability of bluff and the probability of a strong bet. In other words, the value of an action in an imperfect information game depends on its probability of being selected and the probability of selecting other actions.

Conversely, in a perfect information game, the player does not have to worry about the probability of balancing the action. Good moves in chess, no matter how good the probability of choice.

Like the previous Libratus Poker AI, in a game like two player unlimited Texas Hold'em games, the game is based on a combination of a theoretically reasonable self-game algorithm based on Counterfactual Regret Minimization (CFR) and a well-constructed search program. Hide information issues.

However, adding extra players to poker increases the complexity of the game exponentially. Even with a calculation of up to 10,000 times, those previous technologies could not be extended to six-person poker.

The new technology used by Pluribus can handle this challenge better than anything else.

Second, understand the blueprint strategy of Pluribus

Pluribus' core strategy is to learn by self-gaming. In the process, AI competes with itself and does not use any human game data as input. The AI ​​first randomly selects the gameplay, and then, as the action for each step is determined, the performance is gradually increased and the probability distribution is fitted to these actions. In the end, AI's performance will be better than the previous strategy version. The self-gaming strategy in Pluribus is an improved version of Monte Carlo CFR (MCCFR).

In each iteration, MCCFR specifies that one of the parties is a "traverser" object, updating the current policy of this party in the iteration. At the beginning of the iteration, MCCFR simulates a piece of poker based on the current strategy of all players (initially completely random). When the simulation is complete, the algorithm reviews each strategy of the "traverser" object and calculates how much its winning rate can be increased or decreased if other actions are selected. After that, the AI ​​re-evaluates the advantages of each hypothetical decision after the implementation of this decision, and so on.

The figure shows how the Monte Carlo Counterfactual Regret Minimization algorithm updates the walker's strategy by evaluating real and hypothetical actions. The walker in Pluribus traverses in a depth-first manner for optimization purposes.

It is possible to explore the results of other hypotheses because AI is self-gaming. If the AI ​​wants to know what happens after other options, it just needs to ask yourself how to respond to these behaviors.

The difference between what the "traverser" object actually made and what choices might be made was added to the counterfactual regret behavior. At the end of the iteration, the strategy for the "traverser" object is updated. Therefore, the choice of having a higher probability of counterfactual regret is selected. Keeping the strategy of every action in an unlimited game like Texas Hold'em requires more bytes than the entire universe. To reduce the complexity of the game, the researchers asked the AI ​​to ignore some actions and use an abstraction approach to aggregate similar decision points. After abstraction, the aggregation of decision points is considered unique.

The self-gaming result of Pluribus is called the blueprint strategy. In actual games, Pluribus uses a search algorithm to enhance this blueprint strategy. But Pluribus does not adjust its strategy based on the tendency observed from the opponent.

data-ratio=0.5944444444444444

This picture shows how Pluribus' blueprint strategy is gradually improved during the training process.Its performance is assessed by the final snapshot of the training.Researchers did not use search in these comparisons, and they evaluated the performance of ordinary human players and top human players based on discussions with human professional players.The figure also shows when Pluribus stops limping, a style that advanced human players usually avoid.

data-ratio=0.5625

The researchers trained the blueprint strategy for 8 days, using a 64-core server and requiring less than 512G of memory. They are not using GPUs. In a typical cloud, this only costs $150. Compared to other AI studies, including other self-game AI, this consumption is small. Thanks to the algorithmic improvements, researchers can achieve significant performance improvements in a low-cost computing environment.

Third, a more efficient search strategy

Due to the size and complexity of unlimited Texas Hold'em, the blueprint strategy must be coarse-grained. In the actual process, Pluribus improves the blueprint strategy through real-time search to determine better, more granular policies for specific situations.

AI bots often use real-time search in many perfect information games, including two-ply search, alpha-beta pruning search, and Monte Carlo tree search. For example, when the model decides where to go next, the chess AI will usually consider some subsequent moving steps until the algorithm's look-ahead reaches the upper limit of the leaf node or depth.

However, these search methods are not suitable for imperfect information games because they do not consider the ability of the opponent to move beyond the leaf nodes. This weakness caused the search algorithm to create a fragile, unbalanced strategy that allowed the adversary to quickly detect the error. The AI ​​bot had previously been unable to extend the game to six participants.

Instead, Pluribus uses a new approach in which the searcher explicitly considers the actual situation of imperfect information games, ie any participant can move to a leaf node strategy outside the subgame. Specifically, the researchers do not assume that all participants need to play a game based on a single fixed strategy outside the leaf nodes, which results in only a single fixed value for the leaf nodes. When the search has reached the leaf node, the researcher assumes that each participant will choose from four different strategies for the remaining games.

The four continuation strategies used by researchers in Pluribus are pre-computed blueprint strategies; modifications are made based on the blueprint strategy to bias the strategy to fold; modify the blueprint strategy to bias it to bid; modify Blueprint strategy to bias it to raise.

This technique allows the searcher to find a more balanced strategy that performs better overall. Because the choice of an unbalanced strategy will lead the opponent to other continuation strategies, resulting in punishment. For example, playing rock-paper-scissors, I only have stones, then the opponent can definitely learn the strategy of only cloth.

As the researchers point out, another challenge in searching for incomplete information games is that the best strategy for a particular situation for a participant depends on the opponent's perception of his gameplay. For example, playing Texas Hold'em, if a participant never bluffs, then its opponents will always know that they should fold when they increase their bets.

In response to this situation, Pluribus tracks the probability of occurrence of the current situation at each hand according to its own strategy. Regardless of where it is actually, Pluribus will first predict the action to be taken in each hand —— thus carefully balancing its strategy in all hands, making it impossible for players to predict their next move. Once this balancing strategy covering all hands is calculated, Pluribus will then perform an action for the hand it is actually in.

During the game, Pluribus runs on two CPUs. In contrast, in 2016 and Li Shishi's Go game, AlphaGo used 1920 CPUs and 280 GPUs. At the same time, Pluribus uses no more than 128GB of memory. When searching for each sub-branch, it takes between 1 second and 33 seconds depending on the situation on site. Pluribus plays twice as fast as human professional players: in a six-player game scene, it only takes 20 seconds per hand when playing against itself.

4. How effective is the confrontation between Pluribus and human players?

The researchers evaluated Pluribus's effectiveness by confronting a group of top human poker players. These players include "Jesus" Chris Ferguson (2000 World Poker Series Champion), Greg Merson (2012 World Poker Series Champion) and Darren Elias (4th World Poker Tour Champion). The complete list of human players is as follows: Jimmy Chou, Seth Davies, Michael Gagliano, Anthony Gregg, Dong Kim, Jason Les, Linus Loeliger, Daniel McAulay, Nick Petrangelo, Sean Ruane, Trevor Savage and Jake Toole.

When AI systems compete with humans in other benchmark games, machines sometimes perform very well at first, but as human players discover their weaknesses, they eventually defeat them. If AI wants to take complete control of a game, it must demonstrate the ability to win even if human players can gradually adapt to their rhythm. Over the past few days, professional poker players have played thousands of games with Pluribus, so they have enough time to find out its weaknesses and gradually adapt to it.

Elias said, "Pluribus is playing against the best poker players in the world. "

The following is the interface between Pluribus and human players in the experiment:

data-ratio=0.5944444444444444

The experiment is divided into two modes: one is that five human players confront one AI; the other is that one human player confronts five AI replicas. Therefore, in each confrontation mode, there are six players involved, and each game starts with 10,000 chips. Small blind 50 chips and big blind 100 chips.

Although poker is a skill game, there is also a lot of luck in it. If they are unlucky, top professional players will lose money in 10,000-hand poker games. In order to weaken the role of luck in poker matches, researchers use an AIVAT variance reduction algorithm, which estimates the baseline values of various situations, so as to reduce variance while maintaining sample unbiased. For example, if Pluribus gets a strong hand, AIVAT will subtract the benchmark from its win, thus fighting good luck.

5 human players 1 AI

In the experiment, 10,000 hand poker matches between human players and AI lasted 12 days, and five human players were selected to compete with AI every day. These players will split the $50,000 reward according to their performance to motivate them to perform at their best. After the adoption of AIVAT, Pluribus's winning rate is estimated to be about 5 large blind bets per 100 hands (standard error is 5 bb/100), which is a huge victory for top human poker players (profit P value is 0.021). So if each chip is worth $1, Pluribus can win an average of $5 per hand and $1,000 per hour. This outcomes outweigh the odds that pure professional players will win against mixed professional and amateur players.

Ferguson said after the experiment: "Pluribus is so hard to deal with! It's hard for us to stare at it in any hand. It is not only very good at making thin bets on value, but also good at winning the maximum value from good hand. "

However, it is worth noting that Pluribus is meant to be a tool for AI research. Researchers only use poker games as a way to measure AI's progress in incomplete information multi-agent interaction (related to human top capabilities).

5 AI 1 human playe

Ferguson, Elias and Linus Loeliger participated in the experiment. Loeliger is recognized by many as the top player of the Six-player infinite Deutsche pool. Everyone plays 5000 hand poker with five Pluribus AIs. Pluribus does not adjust its strategy according to the opponent's situation, so deliberate collusion between robots is not a problem. Overall, humans lose 2.3 BB per 100 hands. Elias lost 4.0 BB per 100 hands (standard error was 2.2 bb/100), Ferguson lost 2.5 BB per 100 hands (standard error was 2.2 bb/100), and Loeliger lost 0.5 BB per 100 hands (standard error was 1.0 bb/100).

data-ratio=0.5625

This chart shows Pluribus's average winning rate against professional poker players in the 1000-hand experiment. Lines represent actual results, dashed lines represent a standard deviation.

"the biggest advantage of this AI is its ability to use hybrid strategies," Elias said. "humans want to do the same. For people, this is an implementation problem.

Since Pluribus's strategy was learned by self-play without any human data, it also provides an external perspective on what the best way to play Texas Poker is in a multiplayer, unlimited game.

Pluribus confirms the human tradition of smart play.

Although Pluribus initially tried limping when calculating blueprint strategies offline through self-play, it gradually abandoned this strategy as self-play continued.

Pluribus also disagrees that Donk is a misconception (starting a new round of bets at the end of the previous round); Pluribus prefers to do so than professionals.

"It's really interesting to see some of the strategies it has chosen to play poker AI," Gagliano said. "There are a few games in which humans have not played a role at all, especially those where they bet heavily. "

data-ratio=0.5625

This picture shows the number of chips in Pluribus against top players.Lines represent actual results, dashed lines represent a standard deviation.

Challenges from Poker to Other Imperfect Information Game

AI has achieved many remarkable successes in the zero sum game of perfect information (two participants). However, most real-world strategy interactions involve hidden information and are not zero-sum games between two players. The success of Pluribus shows that there are still more large-scale and extremely complex multi-participant scenarios. Carefully constructed self-game and search algorithms can achieve good results in these scenarios, although there is no strong theoretical support to ensure this effect.

Pluribus is also different, because compared with other recent AI systems, its training and inference costs are much lower in the benchmark game. Although some researchers in this field worry that future AI research will be dominated by large teams with a large amount of computing resources. But the researchers believe Pluribus is a strong evidence that the new method can drive top AI research with only appropriate computing resources.

Although Pluribus was developed for playing poker, the technology it uses is not unique to poker, and it does not require any expertise to develop. This study provides us with a better basic understanding of how to construct general AI to cope with multi-agent environment, which includes both other AI agents and humans. At the same time, building a general multi-agent AI can also enable researchers to compare the AI benchmark results obtained in the research process with the peak of human ability.

Of course, the approach adopted in Pluribus may not succeed in all multi-agent settings. In poker, it is difficult for participants to have the opportunity to communicate with other agents. It is possible to construct a very simple coordination game, so the self-play algorithm can not find a good strategy.

However, many real-world interactions, including anti-fraud, network security and content auditing, can potentially be modeled by Pluribus. That is to say, it is modeled as a scenario involving hidden information, and (or) through the limited communication of multiple agents to build the relationship between different participants. The technology of playing Texas poker even allows Pluribus to help AI communities build more efficient strategies in different areas.

Finally, over the past 16 years, Tuomas Sandholm and CMU teams have been working on strategic reasoning techniques. Pluribus builds and integrates most of the techniques and codes of strategic reasoning, but it also contains poker code, which CMU and Facebook collaborate to complete and will not be used in defense applications.

References:

Https://ai.facebook.com/blog/pluribus-first-ai-to-beat-pros-in-6-player-poker

Https://www.nature.com/articles/d41586-019-02156-9

Https://science.sciencemag.org/content/early/2019/07/10/science.aay2400

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments