Halite Home

Concrete info : achieving top 50 via ML (Policy Gradients)


#1

Ed2: Why this is moderately interesting: it does not train by emulating the best human players eg @nmalaguti or @erdman, but rather by playing many thousands of games against versions of itself and thusly slowly levelling up. So it's a different sort of unsupervised learning

[Ed: changed the constants to k in the rebonato form to make it more obvious what's going on, plus numerous minor edits]
First of all, many thanks again to the creators , and for the generosity of fellow competitors, in particular @nmalaguti (your avatar makes me miss my dog), @erdman and @KalraA who was super helpful on the discord. Also thanks to noncompeting friends James Tromans and Diego Colombo for general sanity checks and advice.

The below details a different approach from either a) careful behaviour design or b) all out unsupervised CNN - based learning. My bot was briefly #49 this am (but pogoes around a lot, my hypothesis on this below). I've had great fun trying out all sorts of languages and techniques on this competition (none of which i had really messed around with before) and this post is i think genuinely different. Lots of concrete info in here, but not so much that it spoils your fun. In sum, i grew my bots via evolution and policy gradients. Some of the things I did:

a) Speeding up halite.exe
The very first thing i did was recompile the halite source for my cpu using gcc -O3 -march=native . My machine has 2 old xeons (2 x 10 cores). This step made halite.exe run approx 3x faster than the ready-to-run binary off the site. YMMV but i highly recommend doing this step, its basically free and cant hurt. I also used cluster js to run the new fast halite.exe in parallel, this was about 4x faster than the python parallel scheduler (no idea why) . So in sum, old halite + parallel python is 12x slower than recompiled halite + nodejs + clusterjs (for me)

b) Rewriting the simulation engine Inevitably, i ended up doing this.
We want a fast function F: (owners,strengths,productions,moves) -> (new_owners,new_strengths)
I first started with @nmalaguti 's excellent process_frame function from his visualiser. I eventually rewrote the processor from scratch in matlab (hey, i like it) . the underlying numerics use C, the process functions can all be recast as BLAS operations, and eventually i had something that could simulate approx 20,000 frames a second (that is, assuming you know the moves, can do about 100 games of 200 frames each per second. Bottom line : rewrite the process_frame function if you havent already. You can then directly use it in your simulator and disintermediate halite.exe entirely. BTW i design my bots in matlab, disclaimer- i dont work for matlab but i do like it, it has a pleasant IDE and makes parallelisation easy.

c) Functional specification of bot behaviour

The remarkable overkillbot shows how complex behavior can arise from simple blocks.
I chose the following representation:
Our emphasis on Area, Production, Damage, Strength should vary through time. define 4 functions A(t) P(t) D(t) and S(t) where each function has a rebonato form f(t) = (k1+k2*t) exp(-k3*t) + k4 (4 parameters). This function is great- its very fast to calculate (certainly faster than complex behaviors like shortest path) , is differentiable and covers all sorts of real-life behaviours: constant, linear, rise-then-fall, saturatiing etc etc.

So that's just 16 parameters in all. Amazingly parsimonious .

Now, at every turn make your square do the move that maximizes (A(t) + P(t) + D(t) + S(t)) Note we are not parametrising the behaviour directly (unlike say a CNN that converts game pixels into a move in R^5) we are parametrising the objective function .

Now Simulate a few thousand games and see how you did. Tweak the parameters, this alters your functions A,P,D,S, Simulate again till you have evolved a good bot. If you have a few, try averaging their parameters . You can grow bots for small arenas, bots for crowded arenas etc etc. Essentially policy gradient on the 16 params of the functions makes it cheap to grow a huge family of bots.

Remarkably, this simple local behaviour alone is good enough to get to around rank 80 (as of 5 jan!) reliably. My briefly-49 bot has a few more checks and balances.

The first time i ran this (took a 2 hours to fully simulate a few thousand games with 6 bots), I was absolutely astonished by the complex behaviours that emerged. But equally, the bot did some very silly things. Even looking at the history of arjunvis14 you will see some really cool pincer movements, situations where the bot encloses but doesnt eat squares , to "keep them in reserve" , and points where the bot takes cover behind outcrops to let others fight it out. You will also see some god-awful silliness (anthropomorphising) where it does very random and clearly suboptimal noodling around aimlessly. Hence the rank moves around- it will occasionally beat 4 very "strong" players, but then lose to 2 "weak" ones .

There is a temptation to specify complex behaviours (sneak to the highest production at start) and let the algo work out when to use them, but I find if the behaviours are too complex there is a severe speed penalty. If you go from being able to run 5000 games an hour to 500, that cuts down the amount of parameter exploration you can do. There's lots of fun here to see just how complex the behaviours can get. But basically, the juice in this method is in being able to simulate tons and tons of games. And on a personal level, i like the idea of lowlevel simple behaviours rather than very complex behaviours.

The quality and motivation of competitors is so high, i fully expect a rank 50 today to be a rank 200 in 4 weeks (cf Brian's ML bot which was top 10 once) . I hope the above has been useful in that it details a very different approach. There is enough concrete info to get you started but no spoilers to take the fun out of it. Thanks everyone and good luck !


#2

thanks @tt293 for pointing out that the ml bot moves when strength >= opposing strengrth. Strict inequality would avoid some wastage. Maybe that will drag it to 49 from 50 (currently 50) :wink: . cheers


#3

Very interesting approach.

Now, at every turn make your square do the move that maximizes (A(t) + P(t) + D(t) + S(t))

So I guess you are looking at the moves of each square individually rather than iterating through combos?

Tweak the parameters

You're doing this manually, or staying quiet about specific algo to tweak?

I'm surprised about the need for D(t). Wouldn't the other three functions be sufficient? I also feel like you are missing one. Perhaps strength / distance-to-border, something like that? Encourages bot to move strength to useful places.


#4

Also, when you find the move that maximizes objective function, what do you assume about the actions of opponents?


#5

Hi @erdman -
a) so far, square by square, no combos. you can definitely see this in the amount of cap damage my bot takes- room for improvement there.

b) Good grief, not manually :wink:. I have a family and it feels like i can spend 15 min max a day on halite :slight_smile: . So its all automatic. As Agastya I think said, optimization is super important so i did put some thought into this one- specifically when it was slow to run load of games, i initially used bayesian optimization with a thin-plate kernel function phi(x1,x2) = d(x1,x2)^2 log d(x1,x2) where d is euclidean distance and x1 x2 are 1x16 vectors in parameter space. This worked well.

As i sped up simulation it became less necessary to use "clever" and more ok to use dependable methods. Right now, since simulating a game is so fast, i run about 500 games to completion at a time, then calculate the gradient via finite difference, then use adagrad with the estimated gradient. That is, it now can be treated like any old optimization problem.

The parameter optimization is automatic but it's interesting to look manually at the resulting curves. eg oddly, (at least as the bot finds), the importance of production is not max at time 0 and monotonically decreasing, but rather tends to peak around 37% into the game. (which is close to 1/e, could be coincidence)

c) @tt293 and i had a brief think. My feeling is the combination of A(t) and D(t) differentiates between enemy encounters (combat) and environment encounters. Dropping either results in significantly worse performance, so since its cheap to keep both, at the moment i keep both.

def lots of room for improvement. eg could use the strength/production heuristic , time-to-capture a square etc. Distance to border I am conflicted by- it is useful but about 20-50x slower to calculate than simpler arithmetic metrics. What would you reccomend? Your bots are exceptionally efficient- all advice is very welcome

Since your bot is only 100loc, i theorize that it too contains some form of policy gradient or parametrised behaviour? (i say that because hand coding behaviours in 100loc seems impossible) .

lastly
d) I assume nothing explicitly about the actions of opponents. However , the bot cant inflict damage unless it is within L1 distance 1 of an enemy, so there must be some implicit behaviour being inferred- eg the bot figures out the max speed or min time-to-first engagement with an enemy.


#6

This is amazing.

Congratulations on being the first person (so far as I'm aware) to get an RL bot working (and working well!) on Halite!!!


#7

Thanks @Sydriax , although @hmate9, @hetong007, @ccapo, @jstanderfer all self-describe as using ML/AI so they may be using RL too .
I can definitely say mine is all RL, no cnn in this version (v14). As I play around with more techniques I plan to make my language name more informative eg "80% RL 20% dijkstra" or similar

Btw how can I change my language name? Is it as simple as having a text file LANGUAGE (no extension) with the contents = "RL" ? Many thanks and have a great weekend.


#8

Yup! Include it at the root of your zip file submission.


#9

@arjunvis Thanks for this writeup! It's amazing to hear about how this approach has worked for you.

I have a few questions:

  1. What is the objective function that you're optimizing in order to produce the 'best' 16 parameters? Is it something like 'average value of (A(t) + P(t) + D(t) + S(t)) across all turns across all games simulated'? Or something that just relates directly to how many games the bot with those parameters won?
  2. What do you suggest as a good initial guess for the 16 parameters to kickstart the optimization (or maybe you're using a optimization method that doesn't need an initial state...I'm new to these ideas)?

Thanks again!


#10

Hi- re the objective function. lets say you have a square s, and doing a move will change area by dA, strength by dS , production by dP and damage by dD

eg a still move would have [dA, dS, dP, dD] = [0,production of s,0,0]

now, the objective function to maximise is W_a*dA + W_s*dS + W_p*dP + W_d*dD

where each W is a parametrised function of t , t being the normalised gametime = frame/maxframes , specifically

W_a(t) = (k1 + k2*t) exp(-k3t)+k4
W_s(t) = (k5+k6*t)exp(-k7t)+k8
etc

and k1...k16 are real constants.

i seeded them with all ks = 0, except for k4=1. that is initially, you only care about area.
(there are a couple of other postprocessing tricks but the main info is all up here)

that simple version probably wouldnt do better than rank 100 at the moment- but i rather like it as it is ex nihilo and quite elegant , and somewhat unique.


#11

Sorry @jacobenget i specified the per-square objective function- you mean the parameter optimising function. **To decide how good a set of params is, the objective function is simply: ** run say 1000 games using the distribution of num_players and sizes as observed in the game (https://2016.forums.halite.io/t/distribution-of-sizes-and-num-players/850 one of my other posts was kindly replied to by the experts on details of this distribution) . Calculate the trueskill score of your bot. that's the objective function. ie literally, run many games against a combination of some control like overkillbot, and your past few good bots, and see how it does. an ok proxy could be the mean rank of your parametrised bot. if that mean = 1.00 => your bot wins every match.


#12

Fantastic. Those seeds of "ks = 0 except k4 = 1" make sense. And I see how the 'trueskill' score is a fine way to determine the fitness of the bot.

I agree that the simpleness of this approach is really attractive....even if it doesn't do amazingly well, it does well enough to impress with its emergent behavior.

Thanks!