This competition is unique among Kaggle contests in that there is a history of submissions from previous years. My idea was to model not only the probability of each team winning each game, but also the competitors’ submissions. Combining these models, I searched for the submission with the highest chance of finishing with a prize (top 5 on the leaderboard). A schematic of my approach is below. The three main processes are shaded in blue: (1) A model of the probability of winning each game, (2) a model of what the competitors are likely to submit, and (3) an optimization of my submission based on these two models.
[---]
Finally, I used these models to come up with an optimal submission by simulating the bracket and the competitions’ submissions 10,000 times. This essentially gave me 10,000 simulated leaderboards of the competitors and my goal was to find the submission that most frequently showed up in the top 5 of the leaderboard. I tried to use a general-purpose optimizer, but it was very slow and it gave poor results. Instead, I sampled pairs of probabilities from the posterior many times, and chose the pair that was in the top 5 the most times. If I had naively used the posterior mean as a submission, my estimated probability of being in the top 5 would have been 15%, while my estimated probability of for the optimized submission (with two entries) went up to 25%.
The competitors’ submission model was trained on 2015 data. To assess the quality of the model, I have plotted the simulated distribution of the leaderboard losses for 2016 and 2017 and compared to the actual leaderboards. 2016 seems well in line, but 2017 had more submissions with lower losses than predicted. For both years, the actual 5th place loss was right in line with what was expected.
- March Machine Learning Mania, 1st Place Kaggle Winner's Interview
[---]
Finally, I used these models to come up with an optimal submission by simulating the bracket and the competitions’ submissions 10,000 times. This essentially gave me 10,000 simulated leaderboards of the competitors and my goal was to find the submission that most frequently showed up in the top 5 of the leaderboard. I tried to use a general-purpose optimizer, but it was very slow and it gave poor results. Instead, I sampled pairs of probabilities from the posterior many times, and chose the pair that was in the top 5 the most times. If I had naively used the posterior mean as a submission, my estimated probability of being in the top 5 would have been 15%, while my estimated probability of for the optimized submission (with two entries) went up to 25%.
The competitors’ submission model was trained on 2015 data. To assess the quality of the model, I have plotted the simulated distribution of the leaderboard losses for 2016 and 2017 and compared to the actual leaderboards. 2016 seems well in line, but 2017 had more submissions with lower losses than predicted. For both years, the actual 5th place loss was right in line with what was expected.
- March Machine Learning Mania, 1st Place Kaggle Winner's Interview
No comments:
Post a Comment