Smash Ultimate Set Prediction

Introduction

What is Smash Ultimate?

Super Smash Bros. Ultimate is the latest installment of Super Smash Bros., a series of competitive multiplayer games by Nintendo. In Super Smash Bros., players pick from a roster of Nintendo characters and duke it out on a 2D platformer-esque stage. In Smash Ultimate, the roster features 87 unique characters, which, in a 2-player game, makes 7569 total match ups. Each character has different fighting stats and abilities, so each match up is unique.

Statement of Purpose

I want to model the outcome of a match given the characters and the skill level of the players. It would be interesting to see how effective certain characters are at different skill levels. I would also like to know if the stages the matches are played on affect the outcome, or make no difference. The goal of this project is more about interpretation than about raw predictive power.

Data

Source

I sourced my data from smashdata.gg through their Github. The file I downloaded was ultimate_player_database.zip.

Missing Data

Fortunately, smashdata.gg provides us with some insight on why data is missing and how it has been handled. They say, “A lot of tournaments also don’t report full data, meaning game counts and character data have been lost.” Character data is our primary focus, but it seems that this is not dependent on the values themselves, just the tournaments they come from. This means we can assume missing data is missing completely at random, meaning it is not dependent on the values themselves nor any other values in the dataset. Therefore, I’ve decided to remove all the rows where the game data is completely absent. If the game data is present but the character data is absent, then I filled in the value for character with -1, indicating this absence.

Data cleaning

The data from smashdata.gg first needs to be cleaned in order to be used for modeling.

We are given a database file which contains information about sets, tournaments, players, and so on. I am only interested in the set data, and specifically, the game data column in sets. We are given the game data as a json string, so I will extract this to a dataframe.

Code

import sqlite3
import numpy as np
import pandas as pd
import json
import math
import random
from IPython.display import clear_output

cnx = sqlite3.connect('data/ultimate_player_database.db')

query = "SELECT game_data FROM sets WHERE game_data != '[]'"

df = pd.read_sql_query(query, cnx)

data = list()

for k in range(len(df)):

    if k % 50000 == 0:
        clear_output()
        print("{prop} % complete".format(prop = math.floor(100*k/len(df))))

    str = df.loc[k][0]

    while True:
        id1 = str.find("{")
        id2 = str.find("}")
        if (id1 != -1 ) and (id2 != -1):
            new_data = json.loads(str[id1:id2+1])
            data.append(new_data)
        else:
            break  
        str = str[id2+1:]

game_df = pd.DataFrame(data).fillna(-1)

game_df.to_csv("data/game_data.csv",index = False)

clear_output()
print("100 % complete")

Now I’ll modify the character data so it is easier to read.

Code

game_df = pd.read_csv("data/game_data.csv",dtype={'winner_id' : "string", 'loser_id' : "string", 'winner_score' : "string", 'loser_score' : "string","winner_char" : "string","loser_char" : "string"})
game_df = game_df.drop("winner_score",axis=1).drop("loser_score",axis=1)
game_df["winner_char"] = game_df.apply(lambda x: x["winner_char"].replace("ultimate/",""),axis=1)
game_df["loser_char"] = game_df.apply(lambda x: x["loser_char"].replace("ultimate/",""),axis=1)

We need to transform our data to be impartial to the winner or loser. We now have winner_id, loser_id, winner_char,loser_char, … ect but what we want is p_1_id,p2_id, p1_char,p2_char, p1_won. Otherwise, we wouldn’t be predicting anything because we would know who won beforehand. Let’s transform our data to add our desired columns.

Code

def condswap(b,tup):
    # conditional swap
    if b: 
        return (tup[1],tup[0])
    else:
        return tup

def row_transform(row):
    output = dict()
    p1w = bool(random.getrandbits(1))
    output["p1_id"], output["p2_id"] = condswap(not p1w,(row["winner_id"],row["loser_id"]))
    output["p1_char"], output["p2_char"] = condswap(not p1w,(row["winner_char"],row["loser_char"]))
    output["stage"] = row["stage"]
    output["p1_won"] = p1w
    return output

random.seed(1984)

game_list = game_df.apply(row_transform,axis=1)
imp_game_df = pd.DataFrame.from_records(game_list)
imp_game_df

I want to add a column that gives information about how many games a player has played and won so we can have a sort of “skill” metric. However, we need to be careful about when we transform our data. If we do it across the entire dataset then, for example, the model could deduce that someone with 0 games won has a 0% chance of winning and get 100% accuracy for those values. Someone with 0 games previously won might have a low chance of winning, but it is certainly not zero. So, I’m going to split the ENTIRE dataset in half, then calculate player ratio and skill from that. Then the games played and games won metrics would be independent of both the training and test, and any data points lost in our training / test sets would be missing by random chance.

Code

def transform_games_played(dataframe):

    p1s = dict(dataframe["p1_id"].value_counts())
    p1wins = dict(dataframe[dataframe["p1_won"] == True]["p1_id"].value_counts())

    p2s = dict(dataframe["p2_id"].value_counts())
    p2wins = dict(dataframe[dataframe["p1_won"] == False]["p2_id"].value_counts())

    players = set(p1s.keys()).union(set(p2s.keys()))

    games_played = list()

    for item in players:
        totalsum = 0
        winsum = 0
        if item in p1s.keys():
            totalsum += p1s[item]
        if item in p2s.keys():
            totalsum += p2s[item]
        if item in p1wins.keys():
            winsum += p1wins[item]
        if item in p2wins.keys():
            winsum += p2wins[item]
        games_played.append({"player_id" : item, "games_played" : totalsum, "games_won" : winsum})

    games_played_df = pd.DataFrame(games_played)
    return games_played_df

def clean_games(given_df,skill_df):
    games_played_df = transform_games_played(skill_df)

    output_df = pd.merge(given_df,games_played_df, left_on = "p1_id", right_on = "player_id", how = "left")
    output_df.rename(columns = {"games_played" : "p1_games_played", "games_won" : "p1_games_won"}, inplace=True)
    output_df = pd.merge(output_df,games_played_df, left_on = "p2_id", right_on = "player_id", how = "left")
    output_df.rename(columns = {"games_played" : "p2_games_played", "games_won" : "p2_games_won"}, inplace=True)

    output_df["p1_games_won"].fillna(0,inplace=True)
    output_df["p1_games_played"].fillna(0,inplace=True)
    output_df["p2_games_won"].fillna(0,inplace=True)
    output_df["p2_games_played"].fillna(0,inplace=True)

    output_df.reset_index(inplace=True)
    output_df.drop(columns=["index","player_id_x","player_id_y"],inplace=True)
    return output_df

_skill_df = imp_game_df.sample(frac = 0.5, random_state=2049)
remaining_df = imp_game_df.drop(_skill_df.index)

clean_game_df = clean_games(remaining_df,_skill_df) # The output rows are independent of the games played and won.
skill_clean_game_df = clean_games(_skill_df,_skill_df) # The output rows are NOT independent of the games played and won.

clean_game_df.to_csv("data/clean_game_data.csv",index=False) # To be used for training/test.
skill_clean_game_df.to_csv("data/dependent_skill_data.csv",index = False) # To be used in case we want to create more metrics.

We can now better use our data for fitting models.

Codebook

This codebook identifies each of the variables in cleaned_game_data.csv.

Variable	Description
`p1_id`	A string identifying the first player.
`p2_id`	A string identifying the second player.
`p1_games_played`	An integer indicating how many games the first player has played in `dependent_skill_data.csv`.
`p2_games_played`	An integer indicating how many games the second player has played in `dependent_skill_data.csv`.
`p1_games_won`	An integer indicating how many games the first player has won in `dependent_skill_data.csv`.
`p1_games_won`	An integer indicating how many games the second player has won in `dependent_skill_data.csv`.
`p1_char`	A string indicating the character of the first player.
`p2_char`	A string indicating the character of the second player.
`stage`	A string identifying the stage the match was played on.
`p1_won`	A boolean indicating if the first player won.

Exploratory Data Analysis

First, let’s load our data in.

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
game_df = pd.read_csv("../Data/data/clean_game_data.csv")
skill_data = pd.read_csv("../Data/data/dependent_skill_data.csv")

Correlation Matrix

Let’s start by plotting a correlation matrix between our numeric variables.

Code

corr = game_df.corr()
corr.style.background_gradient(cmap='coolwarm', axis = None, vmin = -1, vmax = 1).set_precision(2)

	p1_won	p1_games_played	p1_games_won	p2_games_played	p2_games_won
p1_won	1.00	0.11	0.12	-0.11	-0.12
p1_games_played	0.11	1.00	0.99	0.18	0.19
p1_games_won	0.12	0.99	1.00	0.19	0.20
p2_games_played	-0.11	0.18	0.19	1.00	0.99
p2_games_won	-0.12	0.19	0.20	0.99	1.00

We see that p1_games_played and p2_games_played are correlated, meaning players with more games played are more likely to play against each other. We also see that p1_games_played and p1_won are correlated, so games_played might be a good indicator of skill. Let’s explore this further.

“Games Played” as Skill

I want to establish that games played may be a good indicator of skill. To do this, I’ll ask the question: How likely are you to beat someone who has played at least k times as many games as you?

Let’s write some code to figure this out.

Code

def k_times_more_games_win_rate(k, df = game_df):
    return( (
                len(df[(df["p1_won"] == True) & (df["p1_games_played"] > k*df["p2_games_played"])]) + 
                len(df[(df["p1_won"] == False) & (df["p2_games_played"] > k*df["p1_games_played"])])
            ) /
            (
                len(df[(df["p1_games_played"] > k*df["p2_games_played"])]) + 
                len(df[(df["p2_games_played"] > k*df["p1_games_played"])])
            )
    )

graph = [k_times_more_games_win_rate(n/4) for n in range(4,40)]
plt.plot([(n/4) for n in range(4,40)],graph)
plt.xlabel("k")
plt.ylabel("Chance of winning") 
plt.title("Chance of winning given a player has played at least k times as many games as their opponent") 
plt.show()
            
print(k_times_more_games_win_rate(1))

0.6192630428086423

This graph justifies the belief that games played is a metric of skill level, because someone with more games played has a higher chance of winning.

Win Percent vs Skill by Character

I suspect that character win rate might change depending on how skilled a player is. For example, maybe kirby is a really good character until you get to a certain level, at which point they are less viable.

Kirby

Let’s define some functions that will help us graph win chance against games played.

Code

def win_percent(df):
    return (df[df["p1_won"] == True]["p1_char"].value_counts() + df[df["p1_won"] == False]["p2_char"].value_counts()) / \
              (df["p2_char"].value_counts() + df["p1_char"].value_counts())

def level_win_percent(df,level,step):
    return win_percent(df[
        (
        (df["p1_games_played"] >= level) & 
        (df["p1_games_played"] < level + step) 
        ) | (
        (df["p2_games_played"] >= level) & 
        (df["p2_games_played"] < level + step)
        )
        ]).sort_values(ascending=False)


def chargraph(characters, start = 0, stop = 1000, step = 50, df = game_df):
    if type(characters) == str:
        characters = [characters]
    graph = {char : list() for char in characters}
    for k in range(start,stop,step):
        LWP = level_win_percent(df,k,step)
        for char in characters:
            if char in LWP.index:
                graph[char].append(LWP[char])
            else:
                graph[char].append(np.NaN)
    for key in graph.keys():
        output, = plt.plot(range(start,stop,step), graph[key], label = key)
    plt.legend()
    plt.xlabel("p1 or p2 games played")
    plt.ylabel("Chance of winning")
    return output

And now let’s plot some of these graphs.

Code

chargraph(["ness","samus","luigi","kirby","joker"],stop = 3000, step = 250)

plt.show()

We see that the variance of these lines increases with games played, likely because there are less players with many games played. We can decipher some trends in these lines visually, like how the mean of Ness and Luigi’s win rate stays relatively constant, though the mean of Joker and Samus’s seem to decrease. This supports our hypothesis that win percent is not constant with respect to skill.

Distribution of Games Played

Let’s verify that there are less players with many games played.

Code

df1 = skill_data[["p1_id","p1_games_played"]].rename(columns = {"p1_id" : "player_id","p1_games_played" : "player_games_played"})
df2 = skill_data[["p2_id","p2_games_played"]].rename(columns = {"p2_id" : "player_id","p2_games_played" : "player_games_played"})
player_games_played_df = df1.append(df2,ignore_index = True).astype({"player_id" : "string", "player_games_played" : "int32"}).drop_duplicates(ignore_index=True).sort_values("player_games_played").reset_index(drop=True)
plt.hist(player_games_played_df["player_games_played"].to_list(),bins=range(0,200,5))
plt.title("Histogram of games played")
plt.ylabel("count")
plt.xlabel("games played")
plt.show()
#plt.hist(player_games_played_df[player_games_played_df["player_games_played"] >= 200]["player_games_played"].to_list(),bins=range(200,2000,50))
#plt.show() - has same 1/x graph

We can see that there are far less players with many games played than with few games played.

Win Percent vs Skill of Specific Matchups

Finally, I want to look at how the win percents of a matchup of two characters changes with respect to games played.

I’ll modify our previous functions and plot a graph of the win ratios of pikachu versus ganondorf.

Code

def vs_df(char1, char2):
    return game_df[((game_df["p1_char"] == char1) & (game_df["p2_char"] == char2)) |
                    ((game_df["p2_char"] == char1) & (game_df["p1_char"] == char2))]
def vsgraph(char1, char2, start = 0, stop = 1500, step = 250):
    chargraph(char1,df=vs_df(char1,char2),start=start, stop=stop, step=step).set_label(char1 + " win percentage vs " + char2)
    plt.legend()

vsgraph("pikachu","ganondorf")
vsgraph("ganondorf","pikachu")

Again, we see that the win percentage changes as a function of the skill level of the players.

Overview

We’ve now gathered sufficient information about the relationships between our variables, and have a better idea about which models we can use for fitting our data. It seems as though games played and games won will likely fit a logistic regression, but other variables such as the character data might be need more flexible models.

Fitting Models

Metrics

For classification problems, the two metrics to focus on are accuracy and the area under the ROC curve. In general it is better to use roc_auc because it accounts for class imbalances, however it is more difficult to visualize than accuracy. Because I randomized p1_won when I transformed the data, there is no class imbalance present. Because of this, for hyperparameter tuning I’m going to score by accuracy as accuracy is much easier to interpret. Let’s verify that there is no class imbalance:

Code

plt.bar(x = [True, False], height = game_df["p1_won"].value_counts() / len(game_df))
plt.xticks([True, False], ["True", "False"])
plt.title("p1_won True vs False")
plt.show()

We can see that there are equal proportions of True and False in p1_won, so there isn’t any class imbalance.

Train / Test Split

First, let’s load in our data:

Code

from sklearn import linear_model
from sklearn import tree
from sklearn import ensemble
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
import os
import joblib
from sklearn.compose import ColumnTransformer

game_data = pd.read_csv("../Data/data/clean_game_data.csv",dtype={"p1_id" : "string","p2_id" : "string","p1_char" : "string", "p2_char" : "string", "stage" : "string", "p1_games_played" : "int32", "p1_games_won" : "int32", "p2_games_played" : "int32", "p2_games_won" : "int32", "p1_won" : "bool"})
game_data = pd.get_dummies(game_data, columns=["p1_char","p2_char","stage"], prefix_sep=".", )
game_data

	p1_id	p2_id	p1_won	p1_games_played	p1_games_won	p2_games_played	p2_games_won	p1_char.-1	p1_char.banjokazooie	p1_char.bayonetta	...	stage.Town and City	stage.Unova Pokemon League	stage.Unova Pokémon League	stage.Venom	stage.WarioWare	stage.Wily Castle	stage.Yggdrasil's Altar	stage.Yoshi's Island	stage.Yoshi's Island (Melee)	stage.Yoshi's Story
0	1472816	1075251	False	2	0	23	13	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	1075251	1472816	True	23	13	2	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	challonge__MrRiceman	challonge__Loconotcoco	False	4	1	2	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	Leo	1272809	True	1	1	102	49	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	1034645	1302612	True	77	40	2	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2963854	30896	4702	False	72	51	865	555	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2963855	4702	30896	True	865	555	72	51	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2963856	1263104	53481	True	186	112	8	7	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2963857	53481	1263104	False	8	7	186	112	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2963858	230918	30044	False	212	120	79	53	0	0	0	...	0	0	0	0	0	0	0	0	0	0

2963859 rows × 237 columns

I dummy-coded some of the categorical variables right away so the data is easier to use.

We need to split our data into a testing set and a training set so we can train our models and evaluate their accuracy. I am going to use a 4:1 ratio, and stratify across our response variable p1_won, so we don’t develop any unexpected imbalances across our training and testing environments.

Code

game_train, game_test = train_test_split(game_data, train_size = 0.8, stratify = game_data[["p1_won"]], random_state=2049)

# We stratified on our response, p1_won. 
# This shouldn't make too much of a difference because we randomized which is p1 and p2, however it is still good practice.

X = game_train.loc[:,game_train.columns != "p1_won"]
y = game_train["p1_won"]

game_folded = StratifiedKFold(n_splits=5).split(X,y)

Hyperparameter Tuning

The following block of code will make it easier to save and load models.

Code

def fitmodel(model, filename, df = game_train):
    
    if not os.path.isfile(filename):

        model.fit(X,y)

        joblib.dump(model, filename)

    else:
        modeltemp = joblib.load(filename)
        if (type(model) != type(modeltemp)) or \
            (tuple([k[0] for k in model.steps]) != tuple([k[0] for k in modeltemp.steps])):
            print ("\033[93m Warning: model mismatch. Delete the file {filename} and rerun or risk faulty models.\n \033[0m".format(filename=filename))
        model = modeltemp
    return model

To tune hyperparameters, we need to create a grid of possible values for our models, and select the best set of hyperparameters for each model. To select the best set of hyperparameters, we use cross validation. This process involves “folding” our dataset into many disjoint sets, training a model on these sets, and evaluating each model by using another fold as a testing set. Then, we average these results to get a cross-validated score for each set of hyperparameters. Finding the best cross-validated score should tell us which set of hyperparameters is best for our model.

Elastic Net

In sklearn, there aren’t recipes. Instead, there are pipelines, which works a bit like a recipe and a workflow combined. We can apply transformations to our data including selecting predictors, normalizing variables, and adding models.

To find the best hyperparameters for our models, our first step is to set up a pipeline. Let’s begin by setting up a pipeline for an elastic net.

Code

en_predictors = ["p1_games_played","p2_games_played","p1_games_won","p2_games_won"]

en_pipe = Pipeline(steps = [
    ("predictors", ColumnTransformer([("predictors","passthrough",en_predictors)])),
    ("logistic", linear_model.LogisticRegression(solver='saga',penalty='elasticnet'))
    ])

An elastic net is similar to a linear regression, except it can perform better when overfitting may be present. It combines the strengths of a lasso and ridge regression. For our model, we used a logistic elastic net because it is a classification problem.

I’ve only selected games played and games won for this model, because for the other predictors such as character to be useful, we would need to add interaction terms. To add interaction terms, we would use sklearn.preprocessing.PolynomialFeatures as a step in our pipeline. However, because there are so many characters, my computer runs out of memory when adding these terms. Regardless, an elastic net is probably not the best way to deal with these interactions anyway - a more flexible model like a decision tree or random forest can automatically find interactions between variables. Let’s wait to use a different model before adding these predictors.

Our next step is to create a grid of hyperparameters we want to check. For an elastic net, I’ve chosen to tune l1_ratio and C. C controls the penalty strength of the elastic net, and l1_ratio controls the ratio of l1 to l2 in the elastic net, in other words, how similar the elastic net is to a lasso regression versus a ridge regression. l1_ratio thus takes values between 0.0 and 1.0, while C can take on any non-negative value.

Code

en_grid = dict(logistic__l1_ratio = [0.0, 0.10, 0.25, 0.50, 0.75, 0.90, 1.0],
               logistic__C = [1000, 100, 10, 1.0, 0.1, 0.01, 0.001])

Lastly, we try each combination of variables and pick the model with the best cross validation score. We are going to fold our data with k = 5, and take the average cross-validated score for each combination of hyperparameters. We will then select the highest average score across all hyperparameter combinations.

Code

game_folded = StratifiedKFold(n_splits=5).split(X,y)

en_grid_search = GridSearchCV(estimator = en_pipe,
                            param_grid = en_grid,
                            n_jobs = 1,
                            cv = game_folded,
                            scoring = 'accuracy',
                            error_score = 0,
                            verbose = 4)
en_grid_result = en_grid_search.fit(X[en_predictors],y) # We pass in X[en_predictors] because the result is the same but slightly faster.

Running this code, we find that the best combination of hyperparameters across our search is with l1 = 1.0 and C = 0.01. This has a cross-validated accuracy of 0.643, which is a good start.

Code

en_pipe.set_params(logistic__C = 0.01, logistic__l1_ratio = 1.0)
en = fitmodel(model = en_pipe,
              filename = "models/elastic_net.joblib")

Decision Tree

A decision tree classifier operates a bit like a game of 20 questions: at each node, the model asks a true-false question, and goes down a branch of the tree. After it has asked enough questions, it assigns a class to an observation.

We perform a similar process as in the elastic net for each of our models. This time, we’re also going to use character and stage data as predictors, and for hyperparameters, we are going to tune max_depth and min_samples_leaf. Both max_depth and min_samples_leaf help with overfitting, which is a big problem in decision trees. max_depth controls the maximum depth of the decision tree, and min_samples_leaf controls the minimum number of samples required to be at a leaf node.

Code

dtc_predictors = list(set(game_train.head()).difference({"p1_won","p1_id","p2_id"}))

dtc_pipe = Pipeline(steps = [
    ("predictors", ColumnTransformer([("predictors","passthrough",dtc_predictors)])),
    ("decision_tree", tree.DecisionTreeClassifier(random_state = 42))
    ])

dtc_grid = dict(decision_tree__max_depth = [3, 5, 10, None],
                decision_tree__min_samples_leaf = [1, 3, 5, 10])

game_folded = StratifiedKFold(n_splits=5).split(X,y)

dtc_grid_search = GridSearchCV(estimator = dtc_pipe,
                               param_grid = dtc_grid,
                               n_jobs = 4,
                               cv = game_folded,
                               scoring = 'accuracy',
                               error_score = 0,
                               verbose = 10)

We find that the best max_depth is 10, and the best min_samples_leaf is also 10. Our cross-validated accuracy is 0.647, so this model performs similar to our elastic net. Let’s fit our model and continue.

Code

dtc_pipe.set_params(decision_tree__max_features = None, decision_tree__max_depth = 10, decision_tree__min_samples_leaf = 10)

dtc = fitmodel(model = dtc_pipe,
               filename = "models/decision_tree.joblib")

Random Forest

A random forest classifier works by creating many decision trees and combining their predictions. Each tree in the random forest uses a random set of avalible predictors. Usually, this set has the square root of the total number of predictors. A random forest can perform better than a single decision tree in many cases because it doesn’t tend to overfit the training data.

Again, we repeat our process of creating a pipe, searching through hyperparameters, and fitting a model. Because random forest models many decision trees at once, some of the hyperparameters are the same. For a random forest, I am going to tune the hyperparameters n_estimators and min_samples_leaf, because I found that min_samples_leaf made the biggest difference in the decision tree, and n_estimators, the number of trees in the random forest, can make a big difference on the performance.

Code

rfc_predictors = list(set(game_train.head()).difference({"p1_won","p1_id","p2_id"}))

rfc_pipe = Pipeline(steps = [
            ("predictors", ColumnTransformer([("predictors","passthrough",rfc_predictors)])),
            ("random_forest", ensemble.RandomForestClassifier(verbose = 3, n_jobs = 4, random_state = 420))
            ])

rfc_grid = dict(random_forest__n_estimators = [100,200,400],
                random_forest__min_samples_leaf = [1, 3, 5, 10])

game_folded = StratifiedKFold(n_splits=2).split(X,y) # 5 folds takes a LONG time to run...

rfc_grid_search = GridSearchCV(estimator = rfc_pipe,
                               param_grid = rfc_grid,
                               n_jobs = 1,
                               cv = game_folded,
                               scoring = 'accuracy',
                               error_score = 0,
                               verbose = 10)

Despite 400 trees having the highest mean test score, it is only by .00035, and is computationally twice as expensive as 200 trees, which itself is only higher than 100 trees by .0007. We can always use a higher number of trees to get a marginally better test score, so with discretion, I am going to use 200 trees with min_samples_leaf = 3.

So we get a cross-validation score of 0.673 for our random forest with 200 trees and min_samples_leaf = 3. This is around 3 percent better than our previous models!

Let’s fit and store our random forest.

Code

rfc_pipe.set_params(random_forest__n_estimators = 200, random_forest__min_samples_leaf = 3)

rfc = fitmodel(model = rfc_pipe,
               filename = "models/random_forest.joblib")

Boosted Tree

Our last model I want to fit is a boosted tree. A boosted tree works similar to a single decision tree, except its branches are weighted and updated iteratively. The model calculates the gradient at each step and updates the tree in the best way possible.

For this model, I would like to tune max_depth and min_samples_leaf, as well as learning_rate and n_estimators. These are a lot of hyperparameters, and a boosted tree is one of the longest models to fit. As we have it now, if I had 4 levels for each of these parameters, GridSearchCV.fit would take around 174 hours to run, which is over a week.

To get around this, I am going to reduce the size of the dataset I am tuning over. This isn’t ideal, because all of these parameters could depend on the size of the dataset, but in our case, we don’t have many options.

Before fitting our chosen model, I’ll also cross-validate it separately with the full dataset, as the cross-validation from the hyperparameter tuning won’t reflect the results of our true chosen model.

Code

train_sample = game_train.sample(50000, random_state= 49)

X_sample = train_sample.loc[:,game_train.columns != "p1_won"]
y_sample = train_sample["p1_won"]

gbc_predictors = list(set(game_train.head()).difference({"p1_won","p1_id","p2_id"}))

gbc_pipe = rfc_pipe = Pipeline(steps = [
            ("predictors", ColumnTransformer([("predictors","passthrough",gbc_predictors)])),
            ("boosted_tree", ensemble.GradientBoostingClassifier(verbose = 1, n_estimators = 100, random_state = 21))
            ])

gbc_grid = dict(boosted_tree__n_estimators = [100, 250, 500, 1000],
                boosted_tree__max_depth = [2, 3, 4, 5],
                boosted_tree__min_samples_leaf = [1, 3, 5, 10],
                boosted_tree__learning_rate = [.001, .01, .1, .2]
                )

game_folded = StratifiedKFold(n_splits=2).split(X_sample,y_sample)

gbc_grid_search = GridSearchCV(estimator = gbc_pipe,
                               param_grid = gbc_grid,
                               n_jobs = 4,
                               cv = game_folded,
                               scoring = 'accuracy',
                               error_score = 0,
                               verbose = 10)

gbc_grid_result = gbc_grid_search.fit(X_sample,y_sample)

We find that the best set of hyperparameters is n_estimators = 500, learning_rate = 0.2, min_samples_leaf = 3, max_depth = 2, with an accuracy of 0.66928.

To find the true cross-validated accuracy of this model, we can cross-valdate with the entire dataset.

Code

gbc_pipe.set_params(boosted_tree__max_depth = 2, boosted_tree__n_estimators = 500, boosted_tree__learning_rate = 0.2, boosted_tree__min_samples_leaf = 3)

game_folded = StratifiedKFold(n_splits=5).split(X,y)

cross_val_score(estimator = gbc_pipe,
                n_jobs = 1,
                cv = game_folded,
                X = X,
                y = y,
                scoring = 'accuracy',
                error_score = 0,
                verbose = 10)

And we find that the model has a cross-validation accuracy of 0.680, our best accuracy yet.

Evaluation

Let’s set up and load our best models in. I want to look at the random forest classifier and the elastic net. The random forest classifier did almost as good as the boosted tree in cross-validation (0.007 difference), and contains information about variable importance, which I want to analyze and interpret. The elastic net provides a good baseline for what we should expect from our model, and lets us more deeply analyze how our numerical predictors effect outcome.

Code

game_train, game_test = train_test_split(game_data, train_size = 0.8, stratify = game_data[["p1_won"]], random_state=2049)
X = game_train.loc[:,game_train.columns != "p1_won"]
y = game_train["p1_won"]

X_test = game_test.loc[:,game_train.columns != "p1_won"]
y_test = game_test["p1_won"]

# We use the same seed so that we get the same testing data as in our model_fitting file. This is important for evaluation.

en = joblib.load("../Prediction/models/elastic_net.joblib")
rfc = joblib.load("../Prediction/models/random_forest.joblib")
rfc.set_params(random_forest__verbose = 0)

Metrics

Let’s display our test roc_auc and accuracy.

Code

def get_metrics(model):
    prediction = model.predict(X_test)
    actual = y_test
    print("Metrics for {model}\n".format(model=model[-1]))
    print("Accuracy: %0.4f" % accuracy_score(prediction,actual))
    print("ROC_AUC: %0.4f" % roc_auc_score(prediction,actual))
    print("\n")

get_metrics(en)
get_metrics(rfc)

Metrics for LogisticRegression(C=0.01, l1_ratio=1.0, penalty='elasticnet', solver='saga')

Accuracy: 0.6432
ROC_AUC: 0.6432

Metrics for RandomForestClassifier(min_samples_leaf=3, n_estimators=200, n_jobs=4,
                       random_state=420)

Accuracy: 0.6782
ROC_AUC: 0.6782

Model	Accuracy	ROC AUC
Logistic Regression	0.6432	0.6432
Random Forest	0.6782	0.6782

Interestingly, the roc_auc and accuracy is the same for both our models. This is likely because there is no class imabalance present, which further justifies our decision to use accuracy as a scoring metric in our hyperparameter tuning.

The elastic net got an accuracy of 0.6434, and the random forest got an accuracy of 0.6782. We might intepret the difference in these accuracies as the amount that adding the character and stage data affected the score. More precisely, this difference would be the percent of variability in the response explained by the random forest that could not be explained by the elastic net.

Variable Importance

I want to find the variable importance of the predictors in my random forest so I can make some statements about how certain varaibles affect outcome.

Here is a table of the most important variables in the RFC:

Code

importances = rfc[-1].feature_importances_
feature_names = rfc["predictors"].transformers_[0][2]

importanceDF = pd.DataFrame([importances,feature_names]).transpose()
importanceDF.columns = ["importance","feature name"]

importanceDF.sort_values(by="importance",ascending= False).head(20)

	importance	feature name
148	0.190556	p2_games_won
45	0.189577	p1_games_won
39	0.14731	p1_games_played
128	0.145371	p2_games_played
131	0.007244	stage.Pokémon Stadium 2
118	0.005352	stage.Town & City
77	0.005285	stage.-1
91	0.005256	stage.Final Destination
172	0.005113	stage.Smashville
36	0.005108	stage.Battlefield
54	0.004008	stage.Kalos Pokémon League
71	0.003919	stage.Small Battlefield
13	0.002982	p1_char.bowser
114	0.002981	p2_char.bowser
21	0.002968	p1_char.ness
56	0.002955	p2_char.ness
124	0.002873	p2_char.cloud
3	0.002843	stage.Yoshi's Story
191	0.002786	p1_char.palutena
199	0.002765	p1_char.pokemontrainer

To no suprise, the most important predictors are games played and games won for both players. However what is interesting is that the next 8 most important predictors are stage, and only then do we see character data.

The following is speculation from my personal experience. I suspect that stage is an important predictor not because of the stage itself, but because of the kinds of tournaments associated with these stages. I would guess that Pokemon Stadium 2, Town & City, Final Destination, Battlefield, ect. all come up as the most important stages associated with predictive power because these stages are often used in high-level tournaments. If you watch competitive smash, these are the kinds of stages you see top players playing on. This is likely just by preference and standarization in the Smash community, however we see that this preference shows up in our model, which is an interesting result.

Pokemon Stadium 2

Now, I want to look at how character data affects the outcome.

Code

importanceDF[importanceDF["feature name"].str.contains("char")].sort_values(by="importance",ascending= False).head(20)

	importance	feature name
13	0.002982	p1_char.bowser
114	0.002981	p2_char.bowser
21	0.002968	p1_char.ness
56	0.002955	p2_char.ness
124	0.002873	p2_char.cloud
191	0.002786	p1_char.palutena
199	0.002765	p1_char.pokemontrainer
229	0.002756	p2_char.palutena
100	0.002754	p1_char.cloud
175	0.002706	p2_char.pokemontrainer
29	0.002688	p2_char.wolf
34	0.002631	p1_char.wolf
32	0.00253	p2_char.yoshi
49	0.002528	p1_char.yoshi
143	0.002469	p1_char.inkling
6	0.002466	p2_char.inkling
8	0.002453	p2_char.mario
120	0.002423	p1_char.donkeykong
62	0.002402	p1_char.mario
65	0.002393	p2_char.donkeykong

In most cases, we see that if p1_char and p2_char are the same character, they have a similar importance. This is a good thing because we would expect our model to be symmetric, in that, the outcome shouldn’t matter based on who is the first player or the second player.

Because this table only shows how important the variables are and not their relationship to the outcome, it is not possible to tell whether showing up at the top means a character is favorable or unfavorable. However, it is interesting to see which characters yeild the most predictive power.

If we want to interpret this, we might say that characters towards the top are the least balanced (for their benefit or detriment), and characters towards the bottom are the most balanced. Let’s see which characters have the least predictive power.

We see names like Bowser, Ness, Palutena, Wolf, Yoshi, and more come up first.

Ness (Earthbound)

Code

importanceDF[importanceDF["feature name"].str.contains("char")].sort_values(by="importance",ascending= False).tail(20)

	importance	feature name
73	0.000838	p2_char.metaknight
66	0.000836	p2_char.lucario
125	0.000834	p1_char.metaknight
75	0.000825	p1_char.miigunner
161	0.000797	p2_char.miiswordfighter
115	0.000778	p2_char.ryu
139	0.00076	p1_char.ryu
233	0.00076	p1_char.miiswordfighter
18	0.000731	p1_char.sheik
53	0.000718	p2_char.sheik
41	0.000712	p2_char.rosalina
155	0.000703	p1_char.rosalina
90	0.000659	p2_char.olimar
141	0.000657	p1_char.olimar
232	0.000593	p1_char.pit
50	0.000575	p2_char.pit
103	0.000513	p1_char.simon
74	0.000481	p2_char.simon
181	0.000481	p2_char.daisy
138	0.000478	p1_char.daisy

Characters near the bottom of the list include Olimar, Pit, Simon, and Daisy. We might interpret these as some of the most balanced characters.

Character Matchups

Now that our model is fit, we can try comparing some characters. When my friend and I first started playing Smash, I would often play Ness and he would usually play Kirby. Let’s try to predict who has the edge in this matchup.

Neither of us have competed in any tournaments, so I’m going to keep our games played and games won at zero.

Code

pred_df = X.iloc[[]]
pred_df.loc[1] = 0
pred_df.loc[1,"p1_char.ness"] = 1
pred_df.loc[1,"p2_char.kirby"] = 1
pred_df.loc[1,"stage.-1"] = 1
rfc.predict(pred_df)[0]

True

True means that the first player won, so Ness is favorable in this matchup. Let’s make sure that this works when we swap p1 and p2.

Code

pred_df = X.iloc[[]]
pred_df.loc[1] = 0
pred_df.loc[1,"p2_char.ness"] = 1
pred_df.loc[1,"p1_char.kirby"] = 1
pred_df.loc[1,"stage.-1"] = 1
rfc.predict(pred_df)[0]

False

We now get False. This is what we would expect, because our model should be symmetric around p1 and p2. Looks like I had an advantage!

Recently, we’ve switched up our characters, and he has been beating my Villager as Ganondorf. Let’s see if Ganondorf has an edge against Villager, or if I just need to work on my skills.

Code

pred_df = X.iloc[[]]
pred_df.loc[1] = 0
pred_df.loc[1,"p1_char.ganondorf"] = 1
pred_df.loc[1,"p2_char.villager"] = 1
pred_df.loc[1,"stage.-1"] = 1
rfc.predict(pred_df)[0]

False

False means that the second player won, so Ganondorf is favorable in this matchup. Let’s swap the characters to make sure the model symmetric around p1 and p2.

Code

pred_df = X.iloc[[]]
pred_df.loc[1] = 0
pred_df.loc[1,"p2_char.ganondorf"] = 1
pred_df.loc[1,"p1_char.villager"] = 1
pred_df.loc[1,"stage.-1"] = 1
rfc.predict(pred_df)[0]

True

And as expected, the model is symmetric. Looks like my friend had the edge after all!

Conclusion

I would say that we adequately answered many of our initial questions. We found that stage data is (surprisingly!) an important predictor, and that certain characters are more balanced than others. Furthermore, even though the goal of this project isn’t prediction, our accuracy of 67.8% for our random forest model was fairly good considering the variance in games of this type. If the outcome of a match was near certain, then Smash Ultimate probably wouldn’t be a very fun game anyways. I’m sure we could improve this accuracy further, however, there may simply not be many more trends in the data that we could exploit for our models.

I wanted this project to analyze the skill level of players with respect to their characters, which is why I included data about their games played and games won. Realistically though, my models would only work for new predictions within the same dataset, because the number of games played and games won is dependent on the size of the dataset itself. It is possible I could have scaled and normalized these predictors to work with datasets of varying sizes, however I did not think that was nessecary given that the goal of this project is interpretation.

If I were to expand on this project, I would like to add an interactive component that lets you input match statistics and see the predicted winner of the match. I think this would better answer the question of which characters are better at certain skill levels, and help me pick the best characters to use against my friends!