Garett MacGowan

The Soul of a New Machine Learning System

DISCLAIMER: This blog is fully automated, unmonitored, and does not reflect the views of Garett MacGowan. The ML model may produce content that is offensive to some readers.

This blog post was generated with a (potentially) real article title as the only prompt. A link to the original article is below.

Original Article

Photo by Mathew Schwartz on Unsplash

Generated: 6/20/2022

The Soul of a New Machine Learning System

Our last post was on how to measure your AI’s performance by training your models. So far, we have built a baseline model and have tuned its hyper parameters using a human annotated validation set. The next step is to evaluate our model on our own test set and then create a test log to send to the cloud for evaluation.

There are many metrics and measures to determine your model’s performance, but among them is an error rate. For example, when we do a linear regression on a real data set, say, the number of students that get an academic score above a certain threshold, in other words, the accuracy or recall is the number of true positives divided by the total number.

But this means nothing if we can’t tell the difference between a true positive and a false positive. Let’s work out this example of a high school student. We’ll be using the same dataset as last time and look at how we might predict whether a student would get a score of at least 90% on their college boards exam, our training set.

For the sake of simplicity, let’s work on the college boards dataset itself and not a student’s grades or test scores.

We’ll use linear regression in this example. Again, we’ll need a test data set and a test set for it.

For the training data, we’ll predict whether or not they would get 90% or above on their test that’s up to us to decide.

We’ll now load the training data into a variable called Training data, which is a Pandas dataframe with columns for the different values we are predicting, along with the corresponding labels indicating which is a true positive and which is false.

training_data = pd.read_csv (“training.csv”,header=“id,label,true)

This line reads the CSV file with 2 columns headers.

The headers are: id, a numerical code that the data comes from, and label, a 1 or 0 that denotes whether you are looking at a true positive or a false positive.

We can see that for the example data set, the id for the true positive is a 1 and the id for the false positive is a 0. Remember however that for each row of the CSV file, id will be the same with all the data for that label. We will work with labels in this example so that we can label the data by the id values later when we make predictions.

The id for the training table is 1-dimensional and will be 1-dimensional for this dataset, so we won’t be worrying about the id.

Now we are ready to read in the test data set. Here’s how you load it in.

test_data = pd.read_csv (“test.csv”,header=“id,label”)

This is the test set, but remember it also has a 1/0 labels, meaning the same id values in the data as for the labels in the training_data were 1. So when label = 0 means false positive, and label = 1 means true positive.

We can see what they look like in the data frame here:

We can take a look at the number of students who we had to predict the results for and the accuracy.

To make this process easier, I’ve put in a function called evaluate() to measure our accuracy or recall rate.

First we need to label the data for which id it is. We will do this using the id values from the training set by merging both data sets together and running a merge. The result now looks like this:

Now we will use this data and calculate a recall rate or the percentage of 1’s that are correct.

We’ll now use a formula I learned from a past interview with a famous data science company where the formula is = (True Positives)/(True Positives + False Negatives) This means the number of true positives (number 1’s in Training_Data) divided by the total number of positives we are predicting or the number of true positives and the number of false negatives (0’s in Training Data).

With True Positives, we get any 1’s with our test data that have a corresponding 1 in the Training data (which is labeled ID).

Note that both positive and negative labels will go into the formula.

We will also use this formula to calculate our False Positive rate or the number of negatives that are wrong. These will be negatives in the test data that should be 1’s but we have a 0 instead. We’ll also include False Positives which will be any 0’s in our test data that have a 1 in the Training_Data.

We can now write a function that does the arithmetic for us and we just need to put in the data from Training_Data and Training_Labels to tell it what our train labels are. But before we do this, we’ll need to put our data into the form in which we want to use it.

We’ll need a couple of other functions, like one that converts a string to an integer and one that adds all 1’s to our training_data to make sure we look specifically for positives, as opposed to everything that is 1.

Here is the first line:

from sklearn import metrics from sklearn import linear_model

Remember the basics, especially linear regression? To be able to make predictions on future data we will need to understand how to fit our model and make predictions from it.

Here is a line to fit the model and make predictions.

model = linear_model. LinearRegression() model. fit (Training.values, Training.label.values)

We created a model named model that is a linear regression. We told the model that it is fit() to a data set in Training. We passed our data to the function and then we made sure we had a value for each column of what we want to predict called Training.label which is a column of our test data. Our label or target is 1 if the student got a score of 90% or more, a value of 0 if they didn’t do so well.

We will then use another function called predict() to tell our model what is our prediction. We’ll give it our test data that we loaded in earlier, test_data. We pass this to the predict() function but this time we won’t give predict.label.values. We’ll just tell it to predict test columns, or whatever we want to predict and see what the model gives us.

y_predicted, _ = model.predict(test_data.values)print(y_predicted)

We will print the predicted value, or y-axis value of the line of best fit of our model to our test data. We used the values() built in function on our model because we don’t want to iterate through all values. Remember that this is a fit() function and is not actually on the line of best fit so our predictions don’t have to be on the line.

We can then also measure the accuracy of the model, which simply means how often we were right over the number of times we were correct. Recall that we said earlier that we want to make predictions given our model, by using our test data. The accuracy is calculated as: (True Positives)/(True Positives + False Negatives). We will run our predict() function again using a list comprehension to make this easier.

y_predicted = [np.argmax(y_predicted == n) if np.argmax(y_predicted == n)+1 == 0 else np.argmin((np.argmax(y_predicted == n)):((np.argmax(y_predicted == n)+1)))]

There’s a bunch of stuff I skipped over here. Like did you remember we had to save the predictions using pickle() but we weren’t? We did. We want to write to a CSV file to send this off into the cloud.

Now we can look at how many students we have to predict.

train_totals = Training.label.nrows

As we’ve done before train_totals will tell us how many examples our dataset has.

We’ve used two built-in functions:.nrows which works out how many rows there are in our training data or true positives, and.argmax.

We’ve passed in some integers to the.argmax() function which means return the index of the largest element in a list. However as an example we could pass in [0, 0, 4, 2]. So the one on the top would be the index 1 but we only care about the top one. When we get indices, the first one is the last positive number, so argmax(3) = 3 is the index we’re looking for.

Our goal is to predict the label which is the 1 or 0.

labels = [1, 0]

The values we are looking for in this list is the index of label values. We use a list comprehension here because we do not want to iterate through the whole list of label values, this is what the numpy.

Garett MacGowan