Halite Home

Server hiccups in final stage


#22

@truell20, absolutely. My initial simulation was overly simplistic so below is a slightly more rigorous simulation. It's in javascript, so those who dont like curly braces, cover your eyes please.
TLDR : in total across all 1600 players, many tens of thousands of games are run. To converge to 0.1 score difference between two identical players, one of whom is penalised to simulate timeouts, it seems to take 200-400 games (source: ran 4x , very scientific :smiley: )

Key points:
1) requires trueskill, lodash, probability-distributions, all installable via npm

2) an array of 1600 players with true rank 0,1,...1599 (0 is the best) is initialised. Players 2 and 3 are set to the same true rank . Every player has an initial skill array = [mu, sigma ] = [25,25/3] , and score = mu-3*sigma

3) We pick 2-6 players according to the halite distribution, and weighted by the square of their sigma ^2
(this is the way halite does it iirc, ie not the stdev but the variance?)

4) Each player gets an in-game rank which is : their true rank plus some randomness. This is to ensure a player could be beaten by someone 10 true places on average below them

5) player 2 is forced to lose 10% of the first 40 games (this is pretty tame, i can see some of you have 8-10 losses )

6) we run till players 2 and 3 have completed at least 41 games and their scores converge to within 0.1
(the 41 games is to prevent termination at 0 games , when their scores are within 0.1 of each other)

for example, the final output of a run could look like :
...

65870 total games played ------
Player 2 ngames : 177
Player 2 score  : 70.65178381699933 
Player 2 skill  : 83.38413422814966,4.264651198306916

Player 3 ngames : 373
Player 3 score  : 70.72401365247441
Player 3 skill  : 89.62387460208126,4.856915870295299

in total across all 1600 players, many tens of thousands of games are run, for players 2 and 3 to converge to within 0.1 it seems to take 200-400 games (source: ran 4x , very scientific :smiley: )

and now the code

var trueskill = require("trueskill")
var _ = require('lodash')
var PD = require("probability-distributions");

var nplayers = 1600
var nplayersDistr = [2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 6]
var players = [],totalGames = 0.0
var gamePlayers, nplayersInGame, pickProbs, showStatus

// initialise the player array
for (var i = 0; i < nplayers; i++) {
players.push(
 {
	index: i,
	truerank: i,
	skill: [25.0, 25.0 / 3.0],
	score: 0,
	ngames: 0
})

}

// set player 2 and player 3 to have the same true rank
players[3].truerank = players[2].truerank

while (players[2].ngames <= 40 || players[3].ngames <= 40 || Math.abs(players[2].score - players[3].score) > .1) {

//the probability of getting picked = square of current trueskill sigma
pickProbs = _.map(players, function(p) {
	return Math.pow(p.skill[1], 2)
})

nplayersInGame = PD.sample(nplayersDistr, 1)[0]
gamePlayers = PD.sample(players, nplayersInGame, false, pickProbs)
showStatus = false

_.forEach(gamePlayers, function(p) 
{
	// set the players  rank for this game 
	p.rank = p.truerank + Math.random() * 20

	//force player 2 to lose 10% of the  first 40 games
	if (p.index == 2 && p.ngames <= 40) 
	{
		p.rank = Math.random()<.1?2000:p.rank
	}
	p.score = p.skill[0]-3*p.skill[1]
	p.ngames++

	// log info if either player 2 or 3 was in this game
	showStatus = showStatus || (p.index == 2 || p.index == 3)
})

//update skill mu and sigma
try
{
	trueskill.AdjustPlayers(gamePlayers)
}
catch(err)	{}

totalGames++

if (showStatus) 
{
	console.log(totalGames + " total games played ------")
	console.log("Player 2 ngames : " + players[2].ngames)
	console.log("Player 2 score  : " + players[2].score)
	console.log("Player 2 skill  : " + players[2].skill + '\n')

	console.log("Player 3 ngames : " + players[3].ngames)
	console.log("Player 3 score  : " + players[3].score)
	console.log("Player 3 skill  : " + players[3].skill + '\n')
	console.log("----------------")
}
}

#23

Too late here for me to think about how it will effect your results, but the way you are doing pairing is not the method used on the servers.

At first glance I think what you have is almost worst case pairing for trueskill. (EDIT: Notice that after almost 400 games your player 3's sigma is still close to 5!)

Server pairing (for the finals) goes roughly, pick seed from among players with lowest number of games, then choose rest of players in a pareto distribution ranked by absolute difference in mu from the seed player. See lines 113 (seeding) and 116,117 (picking the rest of players) of ManagerAPI.php


#24

@Janzert Right you are- changes made- i now get around 150 games to equalize .

BTW to pre-hedge myself: i'm of course not doing this for my own rankings, i think if i was 36 (at optimistic best) pre final it would be a low probability event indeed for me to even break 30.

Really just to help us all see how many extra games might be needed to mitigate the server issues. My finger in the air estimate : 100-200 , which is quite tractable so that's good. :slight_smile:

73915 total games played ------
Player 2 ngames : 157
Player 2 score  : 55.50761887806812
Player 2 skill  : 60.31956152687806,1.6856951586938065

Player 3 ngames : 154
Player 3 score  : 55.45801789060513
Player 3 skill  : 59.87601841858881,1.6935072644271214

and le new code, quick fix

var trueskill = require("trueskill")
var _ = require('lodash')
var PD = require("probability-distributions");

var nplayers = 1600
var nplayersDistr = [2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 6]
var players = [],totalGames = 0.0
var gamePlayers, nplayersInGame, pickProbs, showStatus

// initialise the player array
for (var i = 0; i < nplayers; i++) {
players.push(
 {
	index: i,
	truerank: i,
	skill: [25.0, 25.0 / 3.0],
	score: 0,
	ngames: 0
})

}

// set player 2 and player 3 to have the same true rank
players[3].truerank = players[2].truerank


while (players[2].ngames <= 40 || players[3].ngames <= 40 || Math.abs(players[2].score - players[3].score) > .1) {

nplayersInGame = PD.sample(nplayersDistr, 1)[0]
// pick seed
players = _.sortBy(players,'ngames')
var seedPlayer = players[0]

_.forEach(players,function(p)
{
	p.pickMetric = (p.index==seedPlayer.index)?Infinity:Math.abs(p.skill[0]-seedPlayer.skill[0])
})

players = _.sortBy(players,'pickMetric')

gamePlayers = [seedPlayer]
for(var i = 0;i<nplayersInGame-1;i++)
{
	gamePlayers.push(players[i])
}

showStatus = false

_.forEach(gamePlayers, function(p) 
{
	// set the players  rank for this game 
	p.rank = p.truerank + Math.random() * 20

	//force player 2 to lose 10% of the  first 40 games
	if (p.index == 2 && p.ngames <= 40) 
	{
		p.rank = Math.random()<.1?2000:p.rank
	}
	p.score = p.skill[0]-3*p.skill[1]
	p.ngames++

	// log info if either player 2 or 3 was in this game
	showStatus = showStatus || (p.index == 2 || p.index == 3)
})

//update skill mu and sigma
try
{
	trueskill.AdjustPlayers(gamePlayers)
}
catch(err)	{}

totalGames++

if (showStatus) 
{
	var player2 = _.filter(players,{"index":2})[0]
	var player3 = _.filter(players,{"index":3})[0]
	console.log(totalGames + " total games played ------")
	console.log("Player 2 ngames : " + player2.ngames)
	console.log("Player 2 score  : " + player2.score)
	console.log("Player 2 skill  : " + player2.skill + '\n')
	
	console.log("Player 3 ngames : " + player3.ngames)
	console.log("Player 3 score  : " + player3.score)
	console.log("Player 3 skill  : " + player3.skill + '\n')
	console.log("----------------")
}
players = _.sortBy(players,'index')
}

#25


Thought this was a nice summary of the end of halite. Time to chill :smile:


#26

Well, that's better but the sigma's are still way too high for the number of games played. At 150 games they should be much closer to 0.7 than 1.7 (although the top couple of players will generally be a little higher than the general population still only about 0.74 maybe).

One part may be that it isn't getting great mixing, since after sorting the players by mu difference it needs to then choose on a pareto distribution rather than strictly from the most desirable. Server is implement like this. First choose the top n = 5/(rnd)**0.65 where rnd is a uniform number in the range of (0, 1]. Then take top n players as sorted by the pickMetric above and uniformly sample from them for the rest of the players in the game (the server does this by randomizing their order and taking the top ones after that.

The other thing that may be throwing it off is that generally the players rank for a game that is fed into trueskill refers to what position they finish in that game not overall leaderboard rank. So for a 6 player game the player ranks should be 1-6, with player rank 1 winning against everyone else and rank 6 losing to everyone else. The easy way to do it here is probably just after the current forEach assigning rank and such, sort the players by p.rank and then assign p.rank 1-n.


#27

Ahh, that looks nice.


#28

This is a half baked idea:

Trueskill never seems to get very sure about who's better. Part of this is because it backs off its certainly (sigma) if it gets a "surprising" result. I think the structure of Halite games may result in tons of "surprise" results.

I think Trueskill assumes that game outcome ranks are based on roughly independent in-game scores. For example, if I give four guys hammers, and count how many nails they drove in an hour, then I'd have independent scores. Importantly the relationship A > B > C > D is true with independent scores.

However, in Halite multi player games, it's more like a mini elimination tournament with random seeding. If the best player in a game starts next to the second best player in a six player game, then the second best player is likely to be the first to be eliminated, even though that second best player may be better than four other players in the game. This generates a "surprise" result for five of the six players in the game. Suddenly trueskill is less certain about everything.

I wonder if the only real result relationship we could be certain about is that FIRST > LAST. I wonder how much more certain this would result in trueskill being.


#29

Edit: Ok, I guess in my example below, player C is still > player D. So your point above is correct, but I'm leaving the below as an interesting thought exercise.

I'd argue that even that isn't necessarily true. Imagine a 4 player game.

Let's assume player A is ranked 1, player B is ranked 2. Player C is ranked 10, player D is ranked 100. The map is set up then where basically players A and B run into each other and players C and D run into each other.

Player C is going to trounce player D. Player C then gets the opportunity to expand unopposed while players A and B duke it out. Player C comes by and says hahahhaa and eats players A & B.

The "best" way to measure who would be the best is to actually have every possible combination of positions be played out so that you can get a better idea of who is really the best on a particular map and matchups. The issue is then you're playing (my combinatoric math skills sucks now) about 10 games for this 4 player match, which completely slows down how many "real" matches you could play.


#30

Yeah, the relationship, C > D would still hold there - and yeah, it ends up being the only reliable thing you can infer from the match.


#31

I'm not sure about everyone else, but I think my bot's skill level is now unaffected by those early timeouts. When I took a look at lunch time I was above 47 and now I'm at 46.5.

So thanks for letting us play lots of games. Hopefully everyone else affected early on is seeing that they are in the right place now.

@shummie your bot definitely improved on Sunday, I'm nowhere near your league anymore. It probably also didn't help me that I completely messed up my 1 on 1 strategy on Sunday by making all changes geared towards 5 and 6 player games. Now in two player games I'm ignoring all those neutral tiles to protect myself from invisible enemies :slight_smile:


#32

**What properties should a good ranking algorithm have? ** imo:
- fast convergence Edit: fast convergence of Rank, if that is the metric used, not score or skill. Rank.
- lesser fluctuation as more games are played
- should handle non transitivity ie, in pairwise one on one matches, @acouette could reliably beat @breeko , @breeko could reliably beat @cdurbin , and @cdurbin could reliably beat @acouette (I chose those names for a, b c)

My gut feel is that the sigma term in trueskill is more trouble than benefit. It enforces non convergence if anything :smiley: all those cpu cycles two sigma is paying for go up in smoke.

I'm sure we have enough brainpower here to design a better ranking algorithm! :smiley: For example, with a new bot: using binary search, fight 2-player matches till you find a place in the leader board. Extend to 3,4,5,6 player games in a principled way.

Two more low hanging fruit :
1) in a match with n players you have n! possible orderings of initial positions. I think it def would be better to make sure initial positions were perfectly symmetrical (at the moment they are almost but not quite), rather than run all orderings.

2) I think the game engine could be radically sped up for halite 2.0!


#33

Here's my "skill" (mu) and "certainty" (sigma) plotted from game 20 onwards.

The purple line is using a lower tau, so that trueskill updates more slowly.


#34

More useful would be a chart of rank from say game 50 onwards. That will pitilessly expose the fluctuations of trueskill. :expressionless:


#35

Here you go, @arjunvis. Official Tau above, a much slower updating Tau one below.

Note that the change around game 200 was the seeding change. Most higher ranked players moved down in mu gradually for a little while.


#36

Thanks @DanielVF - if I read it correctly, this is score not rank? Is it easy to look at what someones leaderboard rank was over time - that's what I think fluctuates more than score. Cheers!


#37

This is "Mu", the best guess of your actual skill, with a shaded background of +-2 "sigma", how certain the algorithm is about your skill .

Your "score" shown in game is (mu - 3 x sigma).


#38

Gotcha. So rank is still fluctuating 5,10,20 places for some users even so late in the competition.

We should measure this if possible, to help the organisers choose/design a better ranking algorithm for halite 2.0 (and for posterity !)