In this post, I will implement Feature Importance in Neural Networks.
The code is at : Link
The dataset is the famous Titanic.
Testing Neural Networks for feature importance is difficult. Nearly all methods are approximation. If you were doing Linear Regression you will get coefficients for a variable but for Neural Network, the relation is much more complex.
Class in sklearn MLPClassifier gives coefcients.(weights of neurons), but these weights are not so meaningful because , as we go deeper in a neural network, we do lots of nonlinear transformation. So exact correlation of parameter and output is a much more complex relation.
One way of doing a simple heuristic approximation to this problem is checking our features 1 at a time to see what will they change in prediction accuracy.
The idea is at : https://christophm.github.io/interpretable-ml-book/feature-importance.html
- Train a model
- Calculate accuracy on test set
- Take columns 1 by 1, shuffle values ,and calculate accuracy
- See the change in accuracy to understand which features are important.
Since there is randomness of shuffling, this method is not perfect. But it will give you an idea.
Also for making things nicer I added a column totally random, to check how model reacts to a random variable. Code is a normal simple neural network trained for Titanic dataset.
def get_train_test():
df = load_df()
df["random"] = np.random.random(len(df))
X_ddf = df.drop(['Survived','PassengerId'], axis=1)
X = X_ddf.values
y = df['Survived'].values
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)
return X_train, X_test, y_train, y_test
Our baseline is the original test set.
After training the model we have 5 columns
We can say that our model has a 79 percent accuracy. We will detect which feature leads the most variance from this result.
The only code we have to write is as below. We get a dataframe, shuffle the values and use new dataframe in prediction.
#copy original data, shuffle the values in 1 column
def get_df_for_column(df_test_orig,col_name):
df_test = df_test_orig.copy()
arr =df_test[col_name].values
random.shuffle(arr)
df_test[col_name] = arr
return df_test
#predict the accuracy for with changed data
def predict_for_colum(df_test_orig,col_name):
df_col = get_df_for_column(df_test_orig,col_name)
pred_binary = np.round(model.predict(df_col.values))
print(col_name, accuracy_score(y_test, pred_binary) )
Pclass 0.7039106145251397
Sex 0.6089385474860335
Fare 0.7988826815642458
Embarked 0.770949720670391
random 0.7932960893854749
From the above results, we can see that most important feature is ‘Sex’ and ‘Pclass’. As expected ‘random’ column is meaningless. Fare also seems meaningless for this run.(Maybe ‘Pclass’ already implies it or our feature scaling corrupted this variable)
In this post I showed a very simple approach to test feature importance for Neural Networks.