Halite Home

The (Unofficial, Better) Final Rankings


#1

There were two concerns with scoring in finals.

  1. There were some slow servers at the start of finals which caused extra timeouts for bots that needed a lot of CPU time and memory.
  2. Ranks were very volatile, right up to the end.

I've taken the raw game data, and rescored it to eliminate these two effects.

Errors

Here, you can see the high error rate at the start of the finals:

To correct this, I removed all games with errors from being scored. I initially tried removing the entire first set of games with the higher error rate, but Trueskill really needs some initial games with players of different skill levels.

Volatility

Secondly, I fixed the volatility to some degree. Here is the estimated skill for the top 20 players, over the finals:

The problem is that the ranking algorithm is tuned by default to work for humans, who change skills level constantly. To handle this, Trueskill is a little less certain about the past after every new game, allowing it to respond to a person who is suddenly better or worse. However for these bots, we need to disable this and let Trueskill build up a better picture of the bot's score. Here's the top 20 over time without added uncertainty:

Way smoother!

Before I give the revised rankings, you can access the full data and code to run this yourself at https://github.com/DanielVF/Halite-Ranking-Viz/blob/master/Unofficial%20Scoring.ipynb . Secondly, I'd like to thank @Janzert for automating and sharing the collection of game results, and the organizers for a fantastic contest.

The (Unofficial, Better) 120 Final Rankings

I'd recommend picking the higher score from your official or unoffical rankings, and using that number to brag to your friends with.

#1 mzotkiew
#2 shummie
#3 erdman
#4 timfoden
#5 cdurbin
#6 nmalaguti
#7 PeppiKokki
#8 DexGroves
#9 KalraA v92
#10 ewirkerman

#11 moonbirth
#12 acouette
#13 MoreGames
#14 Ziemin
#15 jstaker7
#16 veden
#17 tondonia
#18 fohristiwhirl
#19 tmseiler
#20 cdmurray80

#21 Maximophone
#22 bencalderhead
#23 En3rG
#24 djma
#25 davidgratton
#26 daniel-shields
#27 JWGS1
#28 yangle
#29 frabi
#30 david-wu

#31 bouwkast
#32 varak69
#33 GaudyZircon
#34 hmate9
#35 Gullesnuffs
#36 b7500af1
#37 TheDuck314
#38 danielborowski
#39 breeko
#40 BigBallerShotCaller

#41 DanielVF
#42 Daniel-Wang
#43 Oreshnik
#44 schmit
#45 MrTwiggy
#46 other-ai
#47 jediahkatz
#48 Sydriax
#49 hyPiRion
#50 kindanoob

#51 MHeasell
#52 bengo1023
#53 alexhad6
#54 0x0L
#55 brianvanleeuwen
#56 Kaczmarczyck
#57 dbf256
#58 kragbot
#59 kbcole
#60 arjunvis

#61 jheilema-nerdery
#62 henripal
#63 Vlad-Shcherbina
#64 bveber
#65 jiatinglu99
#66 markstev
#67 happypepper
#68 hilkoc
#69 jwcdbd
#70 sam-huang1223

#71 sanjeevtewani
#72 adrienball
#73 Andrew-peng
#74 Tautvis
#75 DaanPosthuma
#76 hetong007
#77 TheoKanning
#78 NGamma
#79 bmansfie
#80 nvengal

#81 jerzhang75
#82 platatat
#83 KLFrost
#84 funrollloops
#85 agarap
#86 cyberferret44
#87 mh6283-halite
#88 zluhcs
#89 AisleCC
#90 Patricksm

#91 mbrezu
#92 Risitop
#93 byronwall
#94 NeonYazzle
#95 shubhamjain0594
#96 bireland
#97 DollarAkshay
#98 neverfox
#99 pepijno
#100 dcdulin

#101 navidmx
#102 rossmacarthur
#103 potatoes-are-salty
#104 HelgiMagg
#105 uber5001
#106 charlesxxxx
#107 simonbevan
#108 gbenedisgrab
#109 adereth
#110 HalfVoxel

#111 vagarwala
#112 matthalbersma
#113 bkchiu0
#114 Rexxy4
#115 ccapo
#116 chadz
#117 ahitsdavid
#118 revoklaw
#119 Bobbadillio
#120 jstanderfer

#2

Removing all games with errors is definitely going to unfairly give people like @KalraA a boost.


#3

I agree that it gives an unfair boost to people who occasionally time out in regular games. That's why my original plan was to just chop out the initial games so that everything stayed representative. I guess I could combine the two - only remove errors during the high error time.


#4

Rank vs Score stability

I think stability of scores doesn't convey how the rank actually moves around. One could imageine a situation where all scores are jumping around a very tight range, but the ordering is also changing radically and you would thus see the leaderboard moving a lot. Stably converging scores are a necessary but not sufficient condition for a good ranking algorithm : stably converging ranks are what we need !

To test this, I took the set of 95k final games that @DanielVF kindly provided, I then ran a plot of rank (leaderboard position) calculated at every game. That is, assuming games were processed serially one by one, every time a game completed and player scores had updated, i re-sort the scores and calculate the tied rank of each of the 1592 players.

I've shown 3 charts : the top 20 players, the top 7 players, and @JWGS1 (to pick an example of a single player)

The competition path wouldnt have followed this exactly ( mainly cos the games would probably not have been processed in order one by one - I think the serial number refers to creation order not completion order. This is a source of noise since (i'm pretty sure that) trueskill is not commutative , ie, feeding in 10 games in different order would result in different final trueskill scores due to tau . )

I get (yet another) similar yet different top40 ranking , presented with implied disclaimers and blessing for bragging rights below. But, my point is really the marked fluctuations in leaderboard rank, ie 2-5 place moves for the top top players (see @nmalaguti and @peppikokki swap places) , and 20 50 or even 100 place moves even at the very end for mere mortals .

This to my mind is really why trueskill is not useful .

So, if i'm so smart, what is the solution you ask?

:slight_smile: I don't know, but i'm fairly sure it's not trueskill. we should choose a better-suited ranking algorithm for halite 2.0 . Reasons for picking on trueskill: the empirically observed fluctuations , plus halite proficiency is neither normally distributed nor transitive (a beats b, and b beats c, its entirely possible that c reliably beats a)

What does everyone think?

And the ranking from one-by-one processing of games:

1 shummie
2 mzotkiew
3 erdman
4 timfoden
5 cdurbin
6 nmalaguti
7 PeppiKokki
8 moonbirth
9 ewirkerman
10 DexGroves
11 KalraA
12 cdmurray80
13 fohristiwhirl
14 veden
15 jstaker7
16 bencalderhead
17 Maximophone
18 djma
19 breeko
20 En3rG
21 tmseiler
22 frabi
23 vagarwala
24 cbovar
25 jediahkatz
26 JWGS1
27 ecneto
28 tondonia
29 MoreGames
30 Ziemin
31 kbcole
32 samuelfekete
33 Patricksm
34 jwcdbd
35 yangle
36 daniel-shields
37 jannengm
38 hyPiRion
39 DollarAkshay
40 ccapo

#5

Nice, @arjunvis. Have you tried running the same charts with Tau set to zero?


#6

Very interesting stuff. :slight_smile:


#7

Here is a quick look at the performance of a couple of rating systems on the games from the halite finals.

A quick rating system comparison


#8

Awesome! Science!

I'm guessing something that makes all of these algorithms look bad here is that the games in the database are almost all seeded against bots of almost indistinguishable ability. Its a coin toss who would win between bot #700 and #720. But if you tested theses expectations with a games between rank #700 and #150, you'd seen a much smaller error.

It's amazing how close Trueskill and the global Plackett-Luce are.

I'd be concerned with a test percentage of 10%. Given how noisy the game results are, and that we need at least 200 games per bot for things to stabilize - I'm guessing you might get different accuracy results depending on the luck of what was in that 10%.


#9

Thanks.

Yes definitely. See the end section; for instance the top 10 players have a 14 percentage point better prediction rate than the overall rate with default trueskill.

Yes, the performance of both iterative systems is quite impressive.

Just to be clear the ratings are created from the 90% of data and then predictions are run against the 10%.

The test 10% is still 9556 games with 44163 predictions so not too small. It probably still would have been a good idea to run cross validation. But with ~3 hours per training for plackett-luce that would have taken much too long (see below though).

Just now I did make another 90-10 split and rerun the basic test against it. This test set has 932 (9.8%) games that are the same as the original test set. The results are generally worse on the training error and better on the test error. But the order of performance between the various ratings stays the same.

Train 1/Test 1 are the original results, 2 the new results, Test diff the difference in the test results between the two sets.

| Method              | Train 1| Test 1 | Train 2| Test 2 | Test diff
|---------------------|--------|--------|--------|--------|----------
| Plackett-Luce       | 41.44% | 43.87% | 41.56% | 42.99% | 0.88%
| Trueskill           | 41.70% | 44.72% | 41.84% | 44.24% | 0.48%
| Trueskill tau=0     | 41.33% | 44.24% | 41.33% | 43.37% | 0.87%
| Weng-Lin BT-FP      | 42.19% | 44.56% | 42.25% | 44.12% | 0.44%
| Weng-Lin PL         | 43.34% | 45.81% | 43.36% | 45.37% | 0.44%

About 2 hours after posting the comparison yesterday I found a new paper from 2015 with a much faster method to converge Plackett-Luce ratings. Eventually I found that the author has within the last 3 months published a python library implementing it. It generates the ratings in ~14 seconds on the full finals. Trueskill takes about 1 minute. That is absolutely amazing. I haven't had a chance to implement and test incremental updates to simulate new games being added yet. But this makes Plackett-Luce fast enough for a lot more situations now.

I'm leaving town for business this week, but possibly after I get back I can run a cross validation. I will certainly update the section on processing time.


#10

That's crazy that it's so fast.

The issue with the 10% isn't the total number of games - it's the number of games per bot. At even 40 games per bot (which is only happening with the best bots), there's still so little data that if you were to check a perfect, God given, infinitely accurate and precise bot ranking, you'd still show a lot of difference over that sample.

Compare the Bayesian ranking at 40 games vs at 200 games. Much noise.

Trueskill tau=0 in your three tests, has test results varying from 44.24% to 41.33%. Considering 50% error is the base result we would expect from randomness, the gain over randomness is 5.76% - 8.67%, which I'll call a 40% swing in how good it's looking. Just for resampling the test data.

Perhaps a more accurate test number would be 25% (or even 50%) of the games, and then check errors on only the bots that got the 400 games.


#11

Of course there needs to be a representative sample of the population of results in the test set and I may very well have failed to achieve it. But I'm completely failing to see what the convergence rate of ratings has to do with this. One of the major reasons for the test set games is specifically that they are never used to update the ratings. And of course the more games you add to the test set the less games the ratings will have in the training set to converge on.

It can be argued that the training set has more than enough games to get an acceptable level of convergence and the test set would be better off with them. The cross validation sort of lets you have your cake and eat it too in that regard.


#12

I think the convergence of rankings is rough proxy for how many games you need to extract a signal of a certain accuracy from the noisy halite results.

(Halite results are only slightly noisy when you compare good bots against bad bots. But when you are comparing almost adjacent bots outside of the top 10, then the game results are as noisy as a jet engine. And it's those games between close bots that are going to make up the bulk of the validation set.)


#13

Ahh, I understand what you mean now. I think measuring the prediction error is a little different than creating ratings. Instead of needing to set a rating for each individual it's checking one metric over the whole population. If we were trying to get the prediction error for each player then it would probably need roughly the same amount of results per player as is needed to find an accurate rating and the results for players with few games would be suspect.


#14

I think I finally understand what you are are saying as well - for purposes of verifying the rankings, testing the results of 400 random selected games should be the same as testing the results of 400 games of the same bot. In which case testing 9,556 games should be total overkill for verification.

Even though this makes intuitive sense to me now, I think the the giant swings in the verification results over multiple tests show that it doesn't work out in practice, for whatever reason.


#15

I finally finished the full 10-fold cross validation. Order stays the same and all the systems show improved performance. But trueskill tau=0 does relatively better and gets pretty close to plackett-luce.

| Method            | 10% Test | Cross  | Diff
|-------------------|----------|--------|------
| Plackett-Luce     |  43.87%  | 43.64% | 0.23%
| Trueskill         |  44.72%  | 44.45% | 0.27%
| Trueskill tau=0   |  44.24%  | 43.78% | 0.46%
| Weng-Lin BT       |  44.56%  | 44.20% | 0.36%
| Weng-Lin PL       |  45.81%  | 45.49% | 0.32%