Testing Neural Networks

6 min readJan 19, 2023

Testing Machine Learning algorithms differ for all algorithms. For some algorithms, final model can be specified with a few parameters only.(Linear Regression, weights and bias). For Neural Networks, we need to keep a much more longer set of learned parameters and also, we apply nonlinear transformations through the network. So even if we know all intermediate parameters, they will not be enough for calculations.
There are lots of Machine Learning Testing algorithms for Decisition Trees or LinearRegression type of problems. I want to show some good testing cases for a Neural Network.
The code is in github : Link
I wrote code in Colab and tried in Colab.

NNetworks are also highly nonlinear. Inputs are combined in dense layer, generating new features, which are combination of all input parameters. So a Neural Network can behave very differently according to data and initial initialization(randomness) .
At this post, We will design some tests for ensuring, behavior of NNetwork is still similar to the time we choose it. Be careful we are not tuning parameters. We already decide on a architecture. We just want to be sure that, selected model is working as before ,when we select it at past.

NNetworks are forming too much connections, which makes measuring effect of single variables difficult. There are libraries like SHAP, but they work in theory of Game Theory. Their calculation is an approximation, rather than a direct parameter like Linear regression.

Overfit Test

Overfit test is like using a hammer while you can do your job with a little punch. You designed your hammer for difficult jobs. So you expect that job to work perfect for even very simple cases.
Think that you designed a model for Mnist. You claim it can separate 10 class of handwritten digits, in a very complicated space as below. So you claim you can draw lots of decision boundaries which will separate below problem.

If I delete nearly all of points and only a few left, then this problem will be very simple for your model.

So 1 test case for Machine Learning overfitting for a small sample set. Below u can see that I am taking first 10 elements from set. Training the model with them. And predicting them. I assume this to work perfectly. (Since my model is not so great normally, I also expect this overfitting not to be perfect. So my error checking does not assume 100 percent accuracy.)

#If our model will work for whole dataset
#it must work perfect with a little dataset.Success for a very small set must be so high
def test_nn_overfit(dummy_dataset):
    X_train, X_test, y_train, y_test = dummy_dataset
    #Best way is making totally random, 
    #sampled_list = random.sample(range(0,len(X_train)), 10)    
    #X_train,y_train =  X_train[sampled_list],y_train[sampled_list]

    #Get first 10 rows
    X_train,y_train =  X_train[0:10],y_train[0:10]
    
    overfit_model = get_model(X_train.shape[1])
    overfit_model.fit(X_train,y_train, epochs=20)
    pred = np.round(overfit_model.predict(X_train))

    labels = y_train.flatten().astype(int).tolist()
    pred = pred.flatten().astype(int).tolist()

    actual_sum = np.sum(labels)
    pred_sum =  np.sum(pred)
    
    error = sum( np.logical_xor(labels ,pred   )  )
    is_below = error <= ( len(labels) / 5 )

    assert  is_below, "Model should fit data perfectly "

Here we get our model, we subsample a very small set.(It depends on problem, real dataset size,here just a sample). And I execute the test to see, if data will be learned much better than real training accuracy. Here I know my accuracy is 80%. (If we use getdummies for categorical variables, same model gets 99 accuracy, but I want a simpler model to show in post)

Common Sense

Sometimes we also want to test our model with our Domain knowledge (common sense). Normally I do not know this titanic set, but when I check samples everyone states that a higher class room, increases survival probability. So we can make a test as if same person, stays in a cheaper room, his survival probability will decrease.
Below from set I choose a person who stays in 1st class and made predictions by altering this value.

#We must put our domain knowledge,and test very simple relations.
#Here we know for this dataset "Class" field is important, so we try manually
#to see the effect we expect
def test_common_sense(dummy_dataset,dummy_nn_model):
  X_train, X_test, y_train, y_test = dummy_dataset
  p1 = dummy_nn_model.predict(X_train[1].reshape(1,-1) ).flatten()[0] 
  #copy  change class to 2nd
  X_train_copy = X_train[1].copy()
  X_train_copy[1] = 2.0
  p2 = dummy_nn_model.predict(X_train_copy.reshape(1,-1) ).flatten()[0] 
  #copy  change class to 3rd
  X_train_copy[1] = 3.0
  p3 = dummy_nn_model.predict(X_train_copy.reshape(1,-1) ).flatten()[0] 
  print( f"1st class {p1} 2nd class {p2} 3rd {p3} ")
  assert p1 > p2 , "1st class probability must be higher than 2nd class"
  assert p2 > p3 , "2nd class probability must be higher than 3rd class"

Above test :
takes a sample from 1st class, predicts = p1
Keep everything same change class to 2nd class = p2
Keep everything same change class to 3rd class = p3

very simply we assume p1 > p2 > p3

Testing Depth and Number of Neurons
At parameter tuning we tried lots of models with different neuron sizes and depths. Now we have a model. WE WILL NOT TUNE AGAIN. But just for a simple test, we want to check if still our topology is better than some samples we know.
Below we generate the depth and number of neurons dynamically. We can send different arrays for levels, so that it generates different topology.

def get_model(inputdim,levels=[80,50,10]):
  model = Sequential()

  # layers
  model.add(Dense(levels[0],  activation = 'relu', input_dim = inputdim) )
  for level in levels[1:]:    
    model.add(Dense(level,  activation = 'relu'))
  
  model.add(Dense(1,  activation = 'sigmoid'))
  # summary
  model.summary()
  # Compile
  model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
  return model

The model we decide has a topology as below. At test we are not checking for a better model. Just being sure, among know very simple or very bad architectures, this architecture still makes sense.

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 80)                400       
                                                                 
 dense_1 (Dense)             (None, 50)                4050      
                                                                 
 dense_2 (Dense)             (None, 10)                510       
                                                                 
 dense_3 (Dense)             (None, 1)                 11

KEEP SAME DEPTH WITH DIFFERENT NEURONS

Here with a utility method, we can create different models which must not be among best topology for our data. We check the accuracy and verify, our model performs best among these.

#After finetuning we decided on a model architecture,
#we still want to check beginning from a very simple architecture
#to see if the number of neurons, we choose before makes sense aganist so dummy architectures
def test_width_acc(dummy_dataset):    
    X_train, X_test, y_train, y_test = dummy_dataset
    levels_list = [  [4,3,2] , [20,10,5], [80,50,10] ] 

    acc_list = []    
    for levels in levels_list:
        model_ = get_model(X_train.shape[1] ,levels )
        model_.fit(X_train, y_train,batch_size =64, epochs = 40,verbose=0)
        pred_binary = np.round( model_.predict(X_train) )
        acc_list.append(accuracy_score(y_train, pred_binary))        
    
    assert sorted(acc_list) == acc_list, 'Accuracy should increase as comlexity increases.'

CHANGE DEPTH WITH DIFFERENT NEURONS

At tuning we determined the best depth for our problem. So at test time , we can still make 1 pass to see, if this depth was optimum depth. Below we can see we are trying depth 3,4,5 …


#After finetuning we decided on a model architecture,
#we still want to check beginning from a very simple architecture
#to see if the depth we choose before makes sense aganist so dummy architectures
def test_depth_acc(dummy_dataset):
    
    X_train, X_test, y_train, y_test = dummy_dataset
    levels_list = [  [4,3,] , [20,10,5,], [80,50,10,5] ] 

    acc_list = []
    for levels in levels_list:
        model_ = get_model(X_train.shape[1] ,levels )
        model_.fit(X_train, y_train,batch_size =64, epochs = 40,verbose=0)
        pred_binary = np.round( model_.predict(X_train) )
        acc_list.append(accuracy_score(y_train, pred_binary))        
                
    assert sorted(acc_list) == acc_list, 'Accuracy should increase as comlexity increases.'

Evaluation Test

You know your model’s performance. So add a test to check you are still around those values.

def test_dt_evaluation(dummy_dataset,dummy_nn_model):
    X_train, X_test, y_train, y_test = dummy_dataset
    
    pred_test = dummy_nn_model.predict(X_test)
    pred_test_binary = np.round(pred_test)
    acc_test = accuracy_score(y_test, pred_test_binary)
    auc_test = roc_auc_score(y_test, pred_test)
    assert acc_test > 0.78, 'Accuracy on test should be > 0.78'

At the end you will see a result as below.

In this post , I show an easy way to make tests for complicated Machine Learning cases. The reason to apply these tests could be new data(which infact can cause Concept Shift,Covariance shift..) or your team decided to use a new model. At both cases you may need to pay with these tests. Because new data or new model, in Machine Learning converts the problem into a totally different space.

Testing Neural Networks

Written by mustafac

No responses yet