Smash Ultimate Set Prediction

Author

Colton Rowe

Introduction

What is Smash Ultimate?

Super Smash Bros. Ultimate is the latest installment of Super Smash Bros., a series of competitive multiplayer games by Nintendo. In Super Smash Bros., players pick from a roster of Nintendo characters and duke it out on a 2D platformer-esque stage. In Smash Ultimate, the roster features 87 unique characters, which, in a 2-player game, makes 7569 total match ups. Each character has different fighting stats and abilities, so each match up is unique.

Statement of Purpose

I want to model the outcome of a match given the characters and the skill level of the players. It would be interesting to see how effective certain characters are at different skill levels. I would also like to know if the stages the matches are played on affect the outcome, or make no difference. The goal of this project is more about interpretation than about raw predictive power.

Data

Source

I sourced my data from smashdata.gg through their Github. The file I downloaded was ultimate_player_database.zip.

Missing Data

Fortunately, smashdata.gg provides us with some insight on why data is missing and how it has been handled. They say, “A lot of tournaments also don’t report full data, meaning game counts and character data have been lost.” Character data is our primary focus, but it seems that this is not dependent on the values themselves, just the tournaments they come from. This means we can assume missing data is missing completely at random, meaning it is not dependent on the values themselves nor any other values in the dataset. Therefore, I’ve decided to remove all the rows where the game data is completely absent. If the game data is present but the character data is absent, then I filled in the value for character with -1, indicating this absence.

Data cleaning

The data from smashdata.gg first needs to be cleaned in order to be used for modeling.

We are given a database file which contains information about sets, tournaments, players, and so on. I am only interested in the set data, and specifically, the game data column in sets. We are given the game data as a json string, so I will extract this to a dataframe.

Code
import sqlite3
import numpy as np
import pandas as pd
import json
import math
import random
from IPython.display import clear_output

cnx = sqlite3.connect('data/ultimate_player_database.db')

query = "SELECT game_data FROM sets WHERE game_data != '[]'"

df = pd.read_sql_query(query, cnx)

data = list()

for k in range(len(df)):

    if k % 50000 == 0:
        clear_output()
        print("{prop} % complete".format(prop = math.floor(100*k/len(df))))

    str = df.loc[k][0]

    while True:
        id1 = str.find("{")
        id2 = str.find("}")
        if (id1 != -1 ) and (id2 != -1):
            new_data = json.loads(str[id1:id2+1])
            data.append(new_data)
        else:
            break  
        str = str[id2+1:]

game_df = pd.DataFrame(data).fillna(-1)

game_df.to_csv("data/game_data.csv",index = False)

clear_output()
print("100 % complete")

Now I’ll modify the character data so it is easier to read.

Code
game_df = pd.read_csv("data/game_data.csv",dtype={'winner_id' : "string", 'loser_id' : "string", 'winner_score' : "string", 'loser_score' : "string","winner_char" : "string","loser_char" : "string"})
game_df = game_df.drop("winner_score",axis=1).drop("loser_score",axis=1)
game_df["winner_char"] = game_df.apply(lambda x: x["winner_char"].replace("ultimate/",""),axis=1)
game_df["loser_char"] = game_df.apply(lambda x: x["loser_char"].replace("ultimate/",""),axis=1)

We need to transform our data to be impartial to the winner or loser. We now have winner_id, loser_id, winner_char,loser_char, … ect but what we want is p_1_id,p2_id, p1_char,p2_char, p1_won. Otherwise, we wouldn’t be predicting anything because we would know who won beforehand. Let’s transform our data to add our desired columns.

Code
def condswap(b,tup):
    # conditional swap
    if b: 
        return (tup[1],tup[0])
    else:
        return tup

def row_transform(row):
    output = dict()
    p1w = bool(random.getrandbits(1))
    output["p1_id"], output["p2_id"] = condswap(not p1w,(row["winner_id"],row["loser_id"]))
    output["p1_char"], output["p2_char"] = condswap(not p1w,(row["winner_char"],row["loser_char"]))
    output["stage"] = row["stage"]
    output["p1_won"] = p1w
    return output

random.seed(1984)

game_list = game_df.apply(row_transform,axis=1)
imp_game_df = pd.DataFrame.from_records(game_list)
imp_game_df

I want to add a column that gives information about how many games a player has played and won so we can have a sort of “skill” metric. However, we need to be careful about when we transform our data. If we do it across the entire dataset then, for example, the model could deduce that someone with 0 games won has a 0% chance of winning and get 100% accuracy for those values. Someone with 0 games previously won might have a low chance of winning, but it is certainly not zero. So, I’m going to split the ENTIRE dataset in half, then calculate player ratio and skill from that. Then the games played and games won metrics would be independent of both the training and test, and any data points lost in our training / test sets would be missing by random chance.

Code
def transform_games_played(dataframe):

    p1s = dict(dataframe["p1_id"].value_counts())
    p1wins = dict(dataframe[dataframe["p1_won"] == True]["p1_id"].value_counts())

    p2s = dict(dataframe["p2_id"].value_counts())
    p2wins = dict(dataframe[dataframe["p1_won"] == False]["p2_id"].value_counts())

    players = set(p1s.keys()).union(set(p2s.keys()))

    games_played = list()

    for item in players:
        totalsum = 0
        winsum = 0
        if item in p1s.keys():
            totalsum += p1s[item]
        if item in p2s.keys():
            totalsum += p2s[item]
        if item in p1wins.keys():
            winsum += p1wins[item]
        if item in p2wins.keys():
            winsum += p2wins[item]
        games_played.append({"player_id" : item, "games_played" : totalsum, "games_won" : winsum})

    games_played_df = pd.DataFrame(games_played)
    return games_played_df

def clean_games(given_df,skill_df):
    games_played_df = transform_games_played(skill_df)

    output_df = pd.merge(given_df,games_played_df, left_on = "p1_id", right_on = "player_id", how = "left")
    output_df.rename(columns = {"games_played" : "p1_games_played", "games_won" : "p1_games_won"}, inplace=True)
    output_df = pd.merge(output_df,games_played_df, left_on = "p2_id", right_on = "player_id", how = "left")
    output_df.rename(columns = {"games_played" : "p2_games_played", "games_won" : "p2_games_won"}, inplace=True)

    output_df["p1_games_won"].fillna(0,inplace=True)
    output_df["p1_games_played"].fillna(0,inplace=True)
    output_df["p2_games_won"].fillna(0,inplace=True)
    output_df["p2_games_played"].fillna(0,inplace=True)

    output_df.reset_index(inplace=True)
    output_df.drop(columns=["index","player_id_x","player_id_y"],inplace=True)
    return output_df

_skill_df = imp_game_df.sample(frac = 0.5, random_state=2049)
remaining_df = imp_game_df.drop(_skill_df.index)

clean_game_df = clean_games(remaining_df,_skill_df) # The output rows are independent of the games played and won.
skill_clean_game_df = clean_games(_skill_df,_skill_df) # The output rows are NOT independent of the games played and won.

clean_game_df.to_csv("data/clean_game_data.csv",index=False) # To be used for training/test.
skill_clean_game_df.to_csv("data/dependent_skill_data.csv",index = False) # To be used in case we want to create more metrics.

We can now better use our data for fitting models.

Codebook

This codebook identifies each of the variables in cleaned_game_data.csv.

Variable Description
p1_id A string identifying the first player.
p2_id A string identifying the second player.
p1_games_played An integer indicating how many games the first player has played in dependent_skill_data.csv.
p2_games_played An integer indicating how many games the second player has played in dependent_skill_data.csv.
p1_games_won An integer indicating how many games the first player has won in dependent_skill_data.csv.
p1_games_won An integer indicating how many games the second player has won in dependent_skill_data.csv.
p1_char A string indicating the character of the first player.
p2_char A string indicating the character of the second player.
stage A string identifying the stage the match was played on.
p1_won A boolean indicating if the first player won.

Exploratory Data Analysis

First, let’s load our data in.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
game_df = pd.read_csv("../Data/data/clean_game_data.csv")
skill_data = pd.read_csv("../Data/data/dependent_skill_data.csv")

Correlation Matrix

Let’s start by plotting a correlation matrix between our numeric variables.

Code
corr = game_df.corr()
corr.style.background_gradient(cmap='coolwarm', axis = None, vmin = -1, vmax = 1).set_precision(2)
p1_won p1_games_played p1_games_won p2_games_played p2_games_won
p1_won 1.00 0.11 0.12 -0.11 -0.12
p1_games_played 0.11 1.00 0.99 0.18 0.19
p1_games_won 0.12 0.99 1.00 0.19 0.20
p2_games_played -0.11 0.18 0.19 1.00 0.99
p2_games_won -0.12 0.19 0.20 0.99 1.00

We see that p1_games_played and p2_games_played are correlated, meaning players with more games played are more likely to play against each other. We also see that p1_games_played and p1_won are correlated, so games_played might be a good indicator of skill. Let’s explore this further.

“Games Played” as Skill

I want to establish that games played may be a good indicator of skill. To do this, I’ll ask the question: How likely are you to beat someone who has played at least k times as many games as you?

Let’s write some code to figure this out.

Code
def k_times_more_games_win_rate(k, df = game_df):
    return( (
                len(df[(df["p1_won"] == True) & (df["p1_games_played"] > k*df["p2_games_played"])]) + 
                len(df[(df["p1_won"] == False) & (df["p2_games_played"] > k*df["p1_games_played"])])
            ) /
            (
                len(df[(df["p1_games_played"] > k*df["p2_games_played"])]) + 
                len(df[(df["p2_games_played"] > k*df["p1_games_played"])])
            )
    )

graph = [k_times_more_games_win_rate(n/4) for n in range(4,40)]
plt.plot([(n/4) for n in range(4,40)],graph)
plt.xlabel("k")
plt.ylabel("Chance of winning") 
plt.title("Chance of winning given a player has played at least k times as many games as their opponent") 
plt.show()
            
print(k_times_more_games_win_rate(1))

0.6192630428086423

This graph justifies the belief that games played is a metric of skill level, because someone with more games played has a higher chance of winning.

Win Percent vs Skill by Character

I suspect that character win rate might change depending on how skilled a player is. For example, maybe kirby is a really good character until you get to a certain level, at which point they are less viable.

Let’s define some functions that will help us graph win chance against games played.

Code
def win_percent(df):
    return (df[df["p1_won"] == True]["p1_char"].value_counts() + df[df["p1_won"] == False]["p2_char"].value_counts()) / \
              (df["p2_char"].value_counts() + df["p1_char"].value_counts())

def level_win_percent(df,level,step):
    return win_percent(df[
        (
        (df["p1_games_played"] >= level) & 
        (df["p1_games_played"] < level + step) 
        ) | (
        (df["p2_games_played"] >= level) & 
        (df["p2_games_played"] < level + step)
        )
        ]).sort_values(ascending=False)


def chargraph(characters, start = 0, stop = 1000, step = 50, df = game_df):
    if type(characters) == str:
        characters = [characters]
    graph = {char : list() for char in characters}
    for k in range(start,stop,step):
        LWP = level_win_percent(df,k,step)
        for char in characters:
            if char in LWP.index:
                graph[char].append(LWP[char])
            else:
                graph[char].append(np.NaN)
    for key in graph.keys():
        output, = plt.plot(range(start,stop,step), graph[key], label = key)
    plt.legend()
    plt.xlabel("p1 or p2 games played")
    plt.ylabel("Chance of winning")
    return output

And now let’s plot some of these graphs.

Code
chargraph(["ness","samus","luigi","kirby","joker"],stop = 3000, step = 250)

plt.show()

We see that the variance of these lines increases with games played, likely because there are less players with many games played. We can decipher some trends in these lines visually, like how the mean of Ness and Luigi’s win rate stays relatively constant, though the mean of Joker and Samus’s seem to decrease. This supports our hypothesis that win percent is not constant with respect to skill.

Distribution of Games Played

Let’s verify that there are less players with many games played.

Code
df1 = skill_data[["p1_id","p1_games_played"]].rename(columns = {"p1_id" : "player_id","p1_games_played" : "player_games_played"})
df2 = skill_data[["p2_id","p2_games_played"]].rename(columns = {"p2_id" : "player_id","p2_games_played" : "player_games_played"})
player_games_played_df = df1.append(df2,ignore_index = True).astype({"player_id" : "string", "player_games_played" : "int32"}).drop_duplicates(ignore_index=True).sort_values("player_games_played").reset_index(drop=True)
plt.hist(player_games_played_df["player_games_played"].to_list(),bins=range(0,200,5))
plt.title("Histogram of games played")
plt.ylabel("count")
plt.xlabel("games played")
plt.show()
#plt.hist(player_games_played_df[player_games_played_df["player_games_played"] >= 200]["player_games_played"].to_list(),bins=range(200,2000,50))
#plt.show() - has same 1/x graph

We can see that there are far less players with many games played than with few games played.

Win Percent vs Skill of Specific Matchups

Finally, I want to look at how the win percents of a matchup of two characters changes with respect to games played.

I’ll modify our previous functions and plot a graph of the win ratios of pikachu versus ganondorf.

Code
def vs_df(char1, char2):
    return game_df[((game_df["p1_char"] == char1) & (game_df["p2_char"] == char2)) |
                    ((game_df["p2_char"] == char1) & (game_df["p1_char"] == char2))]
def vsgraph(char1, char2, start = 0, stop = 1500, step = 250):
    chargraph(char1,df=vs_df(char1,char2),start=start, stop=stop, step=step).set_label(char1 + " win percentage vs " + char2)
    plt.legend()

vsgraph("pikachu","ganondorf")
vsgraph("ganondorf","pikachu")

Again, we see that the win percentage changes as a function of the skill level of the players.

Overview

We’ve now gathered sufficient information about the relationships between our variables, and have a better idea about which models we can use for fitting our data. It seems as though games played and games won will likely fit a logistic regression, but other variables such as the character data might be need more flexible models.

Fitting Models

Metrics

For classification problems, the two metrics to focus on are accuracy and the area under the ROC curve. In general it is better to use roc_auc because it accounts for class imbalances, however it is more difficult to visualize than accuracy. Because I randomized p1_won when I transformed the data, there is no class imbalance present. Because of this, for hyperparameter tuning I’m going to score by accuracy as accuracy is much easier to interpret. Let’s verify that there is no class imbalance:

Code
plt.bar(x = [True, False], height = game_df["p1_won"].value_counts() / len(game_df))
plt.xticks([True, False], ["True", "False"])
plt.title("p1_won True vs False")
plt.show()

We can see that there are equal proportions of True and False in p1_won, so there isn’t any class imbalance.

Train / Test Split

First, let’s load in our data:

Code
from sklearn import linear_model
from sklearn import tree
from sklearn import ensemble
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
import os
import joblib
from sklearn.compose import ColumnTransformer

game_data = pd.read_csv("../Data/data/clean_game_data.csv",dtype={"p1_id" : "string","p2_id" : "string","p1_char" : "string", "p2_char" : "string", "stage" : "string", "p1_games_played" : "int32", "p1_games_won" : "int32", "p2_games_played" : "int32", "p2_games_won" : "int32", "p1_won" : "bool"})
game_data = pd.get_dummies(game_data, columns=["p1_char","p2_char","stage"], prefix_sep=".", )
game_data
p1_id p2_id p1_won p1_games_played p1_games_won p2_games_played p2_games_won p1_char.-1 p1_char.banjokazooie p1_char.bayonetta ... stage.Town and City stage.Unova Pokemon League stage.Unova Pokémon League stage.Venom stage.WarioWare stage.Wily Castle stage.Yggdrasil's Altar stage.Yoshi's Island stage.Yoshi's Island (Melee) stage.Yoshi's Story
0 1472816 1075251 False 2 0 23 13 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1075251 1472816 True 23 13 2 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 challonge__MrRiceman challonge__Loconotcoco False 4 1 2 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 Leo 1272809 True 1 1 102 49 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1034645 1302612 True 77 40 2 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2963854 30896 4702 False 72 51 865 555 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2963855 4702 30896 True 865 555 72 51 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2963856 1263104 53481 True 186 112 8 7 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2963857 53481 1263104 False 8 7 186 112 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2963858 230918 30044 False 212 120 79 53 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2963859 rows × 237 columns

I dummy-coded some of the categorical variables right away so the data is easier to use.

We need to split our data into a testing set and a training set so we can train our models and evaluate their accuracy. I am going to use a 4:1 ratio, and stratify across our response variable p1_won, so we don’t develop any unexpected imbalances across our training and testing environments.

Code
game_train, game_test = train_test_split(game_data, train_size = 0.8, stratify = game_data[["p1_won"]], random_state=2049)

# We stratified on our response, p1_won. 
# This shouldn't make too much of a difference because we randomized which is p1 and p2, however it is still good practice.

X = game_train.loc[:,game_train.columns != "p1_won"]
y = game_train["p1_won"]

game_folded = StratifiedKFold(n_splits=5).split(X,y)

Hyperparameter Tuning

The following block of code will make it easier to save and load models.

Code
def fitmodel(model, filename, df = game_train):
    
    if not os.path.isfile(filename):

        model.fit(X,y)

        joblib.dump(model, filename)

    else:
        modeltemp = joblib.load(filename)
        if (type(model) != type(modeltemp)) or \
            (tuple([k[0] for k in model.steps]) != tuple([k[0] for k in modeltemp.steps])):
            print ("\033[93m Warning: model mismatch. Delete the file {filename} and rerun or risk faulty models.\n \033[0m".format(filename=filename))
        model = modeltemp
    return model

To tune hyperparameters, we need to create a grid of possible values for our models, and select the best set of hyperparameters for each model. To select the best set of hyperparameters, we use cross validation. This process involves “folding” our dataset into many disjoint sets, training a model on these sets, and evaluating each model by using another fold as a testing set. Then, we average these results to get a cross-validated score for each set of hyperparameters. Finding the best cross-validated score should tell us which set of hyperparameters is best for our model.

Elastic Net

In sklearn, there aren’t recipes. Instead, there are pipelines, which works a bit like a recipe and a workflow combined. We can apply transformations to our data including selecting predictors, normalizing variables, and adding models.

To find the best hyperparameters for our models, our first step is to set up a pipeline. Let’s begin by setting up a pipeline for an elastic net.

Code
en_predictors = ["p1_games_played","p2_games_played","p1_games_won","p2_games_won"]

en_pipe = Pipeline(steps = [
    ("predictors", ColumnTransformer([("predictors","passthrough",en_predictors)])),
    ("logistic", linear_model.LogisticRegression(solver='saga',penalty='elasticnet'))
    ])

An elastic net is similar to a linear regression, except it can perform better when overfitting may be present. It combines the strengths of a lasso and ridge regression. For our model, we used a logistic elastic net because it is a classification problem.

I’ve only selected games played and games won for this model, because for the other predictors such as character to be useful, we would need to add interaction terms. To add interaction terms, we would use sklearn.preprocessing.PolynomialFeatures as a step in our pipeline. However, because there are so many characters, my computer runs out of memory when adding these terms. Regardless, an elastic net is probably not the best way to deal with these interactions anyway - a more flexible model like a decision tree or random forest can automatically find interactions between variables. Let’s wait to use a different model before adding these predictors.

Our next step is to create a grid of hyperparameters we want to check. For an elastic net, I’ve chosen to tune l1_ratio and C. C controls the penalty strength of the elastic net, and l1_ratio controls the ratio of l1 to l2 in the elastic net, in other words, how similar the elastic net is to a lasso regression versus a ridge regression. l1_ratio thus takes values between 0.0 and 1.0, while C can take on any non-negative value.

Code
en_grid = dict(logistic__l1_ratio = [0.0, 0.10, 0.25, 0.50, 0.75, 0.90, 1.0],
               logistic__C = [1000, 100, 10, 1.0, 0.1, 0.01, 0.001])

Lastly, we try each combination of variables and pick the model with the best cross validation score. We are going to fold our data with k = 5, and take the average cross-validated score for each combination of hyperparameters. We will then select the highest average score across all hyperparameter combinations.

Code
game_folded = StratifiedKFold(n_splits=5).split(X,y)

en_grid_search = GridSearchCV(estimator = en_pipe,
                            param_grid = en_grid,
                            n_jobs = 1,
                            cv = game_folded,
                            scoring = 'accuracy',
                            error_score = 0,
                            verbose = 4)
en_grid_result = en_grid_search.fit(X[en_predictors],y) # We pass in X[en_predictors] because the result is the same but slightly faster.

Running this code, we find that the best combination of hyperparameters across our search is with l1 = 1.0 and C = 0.01. This has a cross-validated accuracy of 0.643, which is a good start.

Code
en_pipe.set_params(logistic__C = 0.01, logistic__l1_ratio = 1.0)
en = fitmodel(model = en_pipe,
              filename = "models/elastic_net.joblib")

Decision Tree

A decision tree classifier operates a bit like a game of 20 questions: at each node, the model asks a true-false question, and goes down a branch of the tree. After it has asked enough questions, it assigns a class to an observation.

We perform a similar process as in the elastic net for each of our models. This time, we’re also going to use character and stage data as predictors, and for hyperparameters, we are going to tune max_depth and min_samples_leaf. Both max_depth and min_samples_leaf help with overfitting, which is a big problem in decision trees. max_depth controls the maximum depth of the decision tree, and min_samples_leaf controls the minimum number of samples required to be at a leaf node.

Code
dtc_predictors = list(set(game_train.head()).difference({"p1_won","p1_id","p2_id"}))

dtc_pipe = Pipeline(steps = [
    ("predictors", ColumnTransformer([("predictors","passthrough",dtc_predictors)])),
    ("decision_tree", tree.DecisionTreeClassifier(random_state = 42))
    ])

dtc_grid = dict(decision_tree__max_depth = [3, 5, 10, None],
                decision_tree__min_samples_leaf = [1, 3, 5, 10])

game_folded = StratifiedKFold(n_splits=5).split(X,y)

dtc_grid_search = GridSearchCV(estimator = dtc_pipe,
                               param_grid = dtc_grid,
                               n_jobs = 4,
                               cv = game_folded,
                               scoring = 'accuracy',
                               error_score = 0,
                               verbose = 10)

We find that the best max_depth is 10, and the best min_samples_leaf is also 10. Our cross-validated accuracy is 0.647, so this model performs similar to our elastic net. Let’s fit our model and continue.

Code
dtc_pipe.set_params(decision_tree__max_features = None, decision_tree__max_depth = 10, decision_tree__min_samples_leaf = 10)

dtc = fitmodel(model = dtc_pipe,
               filename = "models/decision_tree.joblib")

Random Forest

A random forest classifier works by creating many decision trees and combining their predictions. Each tree in the random forest uses a random set of avalible predictors. Usually, this set has the square root of the total number of predictors. A random forest can perform better than a single decision tree in many cases because it doesn’t tend to overfit the training data.

Again, we repeat our process of creating a pipe, searching through hyperparameters, and fitting a model. Because random forest models many decision trees at once, some of the hyperparameters are the same. For a random forest, I am going to tune the hyperparameters n_estimators and min_samples_leaf, because I found that min_samples_leaf made the biggest difference in the decision tree, and n_estimators, the number of trees in the random forest, can make a big difference on the performance.

Code
rfc_predictors = list(set(game_train.head()).difference({"p1_won","p1_id","p2_id"}))

rfc_pipe = Pipeline(steps = [
            ("predictors", ColumnTransformer([("predictors","passthrough",rfc_predictors)])),
            ("random_forest", ensemble.RandomForestClassifier(verbose = 3, n_jobs = 4, random_state = 420))
            ])

rfc_grid = dict(random_forest__n_estimators = [100,200,400],
                random_forest__min_samples_leaf = [1, 3, 5, 10])

game_folded = StratifiedKFold(n_splits=2).split(X,y) # 5 folds takes a LONG time to run...

rfc_grid_search = GridSearchCV(estimator = rfc_pipe,
                               param_grid = rfc_grid,
                               n_jobs = 1,
                               cv = game_folded,
                               scoring = 'accuracy',
                               error_score = 0,
                               verbose = 10)

Despite 400 trees having the highest mean test score, it is only by .00035, and is computationally twice as expensive as 200 trees, which itself is only higher than 100 trees by .0007. We can always use a higher number of trees to get a marginally better test score, so with discretion, I am going to use 200 trees with min_samples_leaf = 3.

So we get a cross-validation score of 0.673 for our random forest with 200 trees and min_samples_leaf = 3. This is around 3 percent better than our previous models!

Let’s fit and store our random forest.

Code
rfc_pipe.set_params(random_forest__n_estimators = 200, random_forest__min_samples_leaf = 3)

rfc = fitmodel(model = rfc_pipe,
               filename = "models/random_forest.joblib")

Boosted Tree

Our last model I want to fit is a boosted tree. A boosted tree works similar to a single decision tree, except its branches are weighted and updated iteratively. The model calculates the gradient at each step and updates the tree in the best way possible.

For this model, I would like to tune max_depth and min_samples_leaf, as well as learning_rate and n_estimators. These are a lot of hyperparameters, and a boosted tree is one of the longest models to fit. As we have it now, if I had 4 levels for each of these parameters, GridSearchCV.fit would take around 174 hours to run, which is over a week.

To get around this, I am going to reduce the size of the dataset I am tuning over. This isn’t ideal, because all of these parameters could depend on the size of the dataset, but in our case, we don’t have many options.

Before fitting our chosen model, I’ll also cross-validate it separately with the full dataset, as the cross-validation from the hyperparameter tuning won’t reflect the results of our true chosen model.

Code
train_sample = game_train.sample(50000, random_state= 49)

X_sample = train_sample.loc[:,game_train.columns != "p1_won"]
y_sample = train_sample["p1_won"]

gbc_predictors = list(set(game_train.head()).difference({"p1_won","p1_id","p2_id"}))

gbc_pipe = rfc_pipe = Pipeline(steps = [
            ("predictors", ColumnTransformer([("predictors","passthrough",gbc_predictors)])),
            ("boosted_tree", ensemble.GradientBoostingClassifier(verbose = 1, n_estimators = 100, random_state = 21))
            ])

gbc_grid = dict(boosted_tree__n_estimators = [100, 250, 500, 1000],
                boosted_tree__max_depth = [2, 3, 4, 5],
                boosted_tree__min_samples_leaf = [1, 3, 5, 10],
                boosted_tree__learning_rate = [.001, .01, .1, .2]
                )

game_folded = StratifiedKFold(n_splits=2).split(X_sample,y_sample)

gbc_grid_search = GridSearchCV(estimator = gbc_pipe,
                               param_grid = gbc_grid,
                               n_jobs = 4,
                               cv = game_folded,
                               scoring = 'accuracy',
                               error_score = 0,
                               verbose = 10)

gbc_grid_result = gbc_grid_search.fit(X_sample,y_sample)

We find that the best set of hyperparameters is n_estimators = 500, learning_rate = 0.2, min_samples_leaf = 3, max_depth = 2, with an accuracy of 0.66928.

To find the true cross-validated accuracy of this model, we can cross-valdate with the entire dataset.

Code
gbc_pipe.set_params(boosted_tree__max_depth = 2, boosted_tree__n_estimators = 500, boosted_tree__learning_rate = 0.2, boosted_tree__min_samples_leaf = 3)

game_folded = StratifiedKFold(n_splits=5).split(X,y)

cross_val_score(estimator = gbc_pipe,
                n_jobs = 1,
                cv = game_folded,
                X = X,
                y = y,
                scoring = 'accuracy',
                error_score = 0,
                verbose = 10)

And we find that the model has a cross-validation accuracy of 0.680, our best accuracy yet.

Evaluation

Let’s set up and load our best models in. I want to look at the random forest classifier and the elastic net. The random forest classifier did almost as good as the boosted tree in cross-validation (0.007 difference), and contains information about variable importance, which I want to analyze and interpret. The elastic net provides a good baseline for what we should expect from our model, and lets us more deeply analyze how our numerical predictors effect outcome.

Code
game_train, game_test = train_test_split(game_data, train_size = 0.8, stratify = game_data[["p1_won"]], random_state=2049)
X = game_train.loc[:,game_train.columns != "p1_won"]
y = game_train["p1_won"]

X_test = game_test.loc[:,game_train.columns != "p1_won"]
y_test = game_test["p1_won"]

# We use the same seed so that we get the same testing data as in our model_fitting file. This is important for evaluation.

en = joblib.load("../Prediction/models/elastic_net.joblib")
rfc = joblib.load("../Prediction/models/random_forest.joblib")
rfc.set_params(random_forest__verbose = 0)

Metrics

Let’s display our test roc_auc and accuracy.

Code
def get_metrics(model):
    prediction = model.predict(X_test)
    actual = y_test
    print("Metrics for {model}\n".format(model=model[-1]))
    print("Accuracy: %0.4f" % accuracy_score(prediction,actual))
    print("ROC_AUC: %0.4f" % roc_auc_score(prediction,actual))
    print("\n")

get_metrics(en)
get_metrics(rfc)
Metrics for LogisticRegression(C=0.01, l1_ratio=1.0, penalty='elasticnet', solver='saga')

Accuracy: 0.6432
ROC_AUC: 0.6432

Metrics for RandomForestClassifier(min_samples_leaf=3, n_estimators=200, n_jobs=4,
                       random_state=420)

Accuracy: 0.6782
ROC_AUC: 0.6782

Model Accuracy ROC AUC
Logistic Regression 0.6432 0.6432
Random Forest 0.6782 0.6782

Interestingly, the roc_auc and accuracy is the same for both our models. This is likely because there is no class imabalance present, which further justifies our decision to use accuracy as a scoring metric in our hyperparameter tuning.

The elastic net got an accuracy of 0.6434, and the random forest got an accuracy of 0.6782. We might intepret the difference in these accuracies as the amount that adding the character and stage data affected the score. More precisely, this difference would be the percent of variability in the response explained by the random forest that could not be explained by the elastic net.

Variable Importance

I want to find the variable importance of the predictors in my random forest so I can make some statements about how certain varaibles affect outcome.

Here is a table of the most important variables in the RFC:

Code
importances = rfc[-1].feature_importances_
feature_names = rfc["predictors"].transformers_[0][2]

importanceDF = pd.DataFrame([importances,feature_names]).transpose()
importanceDF.columns = ["importance","feature name"]

importanceDF.sort_values(by="importance",ascending= False).head(20)
importance feature name
148 0.190556 p2_games_won
45 0.189577 p1_games_won
39 0.14731 p1_games_played
128 0.145371 p2_games_played
131 0.007244 stage.Pokémon Stadium 2
118 0.005352 stage.Town & City
77 0.005285 stage.-1
91 0.005256 stage.Final Destination
172 0.005113 stage.Smashville
36 0.005108 stage.Battlefield
54 0.004008 stage.Kalos Pokémon League
71 0.003919 stage.Small Battlefield
13 0.002982 p1_char.bowser
114 0.002981 p2_char.bowser
21 0.002968 p1_char.ness
56 0.002955 p2_char.ness
124 0.002873 p2_char.cloud
3 0.002843 stage.Yoshi's Story
191 0.002786 p1_char.palutena
199 0.002765 p1_char.pokemontrainer

To no suprise, the most important predictors are games played and games won for both players. However what is interesting is that the next 8 most important predictors are stage, and only then do we see character data.

The following is speculation from my personal experience. I suspect that stage is an important predictor not because of the stage itself, but because of the kinds of tournaments associated with these stages. I would guess that Pokemon Stadium 2, Town & City, Final Destination, Battlefield, ect. all come up as the most important stages associated with predictive power because these stages are often used in high-level tournaments. If you watch competitive smash, these are the kinds of stages you see top players playing on. This is likely just by preference and standarization in the Smash community, however we see that this preference shows up in our model, which is an interesting result.

Now, I want to look at how character data affects the outcome.

Code
importanceDF[importanceDF["feature name"].str.contains("char")].sort_values(by="importance",ascending= False).head(20)
importance feature name
13 0.002982 p1_char.bowser
114 0.002981 p2_char.bowser
21 0.002968 p1_char.ness
56 0.002955 p2_char.ness
124 0.002873 p2_char.cloud
191 0.002786 p1_char.palutena
199 0.002765 p1_char.pokemontrainer
229 0.002756 p2_char.palutena
100 0.002754 p1_char.cloud
175 0.002706 p2_char.pokemontrainer
29 0.002688 p2_char.wolf
34 0.002631 p1_char.wolf
32 0.00253 p2_char.yoshi
49 0.002528 p1_char.yoshi
143 0.002469 p1_char.inkling
6 0.002466 p2_char.inkling
8 0.002453 p2_char.mario
120 0.002423 p1_char.donkeykong
62 0.002402 p1_char.mario
65 0.002393 p2_char.donkeykong

In most cases, we see that if p1_char and p2_char are the same character, they have a similar importance. This is a good thing because we would expect our model to be symmetric, in that, the outcome shouldn’t matter based on who is the first player or the second player.

Because this table only shows how important the variables are and not their relationship to the outcome, it is not possible to tell whether showing up at the top means a character is favorable or unfavorable. However, it is interesting to see which characters yeild the most predictive power.

If we want to interpret this, we might say that characters towards the top are the least balanced (for their benefit or detriment), and characters towards the bottom are the most balanced. Let’s see which characters have the least predictive power.

We see names like Bowser, Ness, Palutena, Wolf, Yoshi, and more come up first.

Code
importanceDF[importanceDF["feature name"].str.contains("char")].sort_values(by="importance",ascending= False).tail(20)
importance feature name
73 0.000838 p2_char.metaknight
66 0.000836 p2_char.lucario
125 0.000834 p1_char.metaknight
75 0.000825 p1_char.miigunner
161 0.000797 p2_char.miiswordfighter
115 0.000778 p2_char.ryu
139 0.00076 p1_char.ryu
233 0.00076 p1_char.miiswordfighter
18 0.000731 p1_char.sheik
53 0.000718 p2_char.sheik
41 0.000712 p2_char.rosalina
155 0.000703 p1_char.rosalina
90 0.000659 p2_char.olimar
141 0.000657 p1_char.olimar
232 0.000593 p1_char.pit
50 0.000575 p2_char.pit
103 0.000513 p1_char.simon
74 0.000481 p2_char.simon
181 0.000481 p2_char.daisy
138 0.000478 p1_char.daisy

Characters near the bottom of the list include Olimar, Pit, Simon, and Daisy. We might interpret these as some of the most balanced characters.

Character Matchups

Now that our model is fit, we can try comparing some characters. When my friend and I first started playing Smash, I would often play Ness and he would usually play Kirby. Let’s try to predict who has the edge in this matchup.

Neither of us have competed in any tournaments, so I’m going to keep our games played and games won at zero.

Code
pred_df = X.iloc[[]]
pred_df.loc[1] = 0
pred_df.loc[1,"p1_char.ness"] = 1
pred_df.loc[1,"p2_char.kirby"] = 1
pred_df.loc[1,"stage.-1"] = 1
rfc.predict(pred_df)[0]
True

True means that the first player won, so Ness is favorable in this matchup. Let’s make sure that this works when we swap p1 and p2.

Code
pred_df = X.iloc[[]]
pred_df.loc[1] = 0
pred_df.loc[1,"p2_char.ness"] = 1
pred_df.loc[1,"p1_char.kirby"] = 1
pred_df.loc[1,"stage.-1"] = 1
rfc.predict(pred_df)[0]
False

We now get False. This is what we would expect, because our model should be symmetric around p1 and p2. Looks like I had an advantage!

Recently, we’ve switched up our characters, and he has been beating my Villager as Ganondorf. Let’s see if Ganondorf has an edge against Villager, or if I just need to work on my skills.

Code
pred_df = X.iloc[[]]
pred_df.loc[1] = 0
pred_df.loc[1,"p1_char.ganondorf"] = 1
pred_df.loc[1,"p2_char.villager"] = 1
pred_df.loc[1,"stage.-1"] = 1
rfc.predict(pred_df)[0]
False

False means that the second player won, so Ganondorf is favorable in this matchup. Let’s swap the characters to make sure the model symmetric around p1 and p2.

Code
pred_df = X.iloc[[]]
pred_df.loc[1] = 0
pred_df.loc[1,"p2_char.ganondorf"] = 1
pred_df.loc[1,"p1_char.villager"] = 1
pred_df.loc[1,"stage.-1"] = 1
rfc.predict(pred_df)[0]
True

And as expected, the model is symmetric. Looks like my friend had the edge after all!

Conclusion

I would say that we adequately answered many of our initial questions. We found that stage data is (surprisingly!) an important predictor, and that certain characters are more balanced than others. Furthermore, even though the goal of this project isn’t prediction, our accuracy of 67.8% for our random forest model was fairly good considering the variance in games of this type. If the outcome of a match was near certain, then Smash Ultimate probably wouldn’t be a very fun game anyways. I’m sure we could improve this accuracy further, however, there may simply not be many more trends in the data that we could exploit for our models.

I wanted this project to analyze the skill level of players with respect to their characters, which is why I included data about their games played and games won. Realistically though, my models would only work for new predictions within the same dataset, because the number of games played and games won is dependent on the size of the dataset itself. It is possible I could have scaled and normalized these predictors to work with datasets of varying sizes, however I did not think that was nessecary given that the goal of this project is interpretation.

If I were to expand on this project, I would like to add an interactive component that lets you input match statistics and see the predicted winner of the match. I think this would better answer the question of which characters are better at certain skill levels, and help me pick the best characters to use against my friends!