close-icon
Subscribe to learn more about this topic
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

How Machine Learning Works - with Code Example

Train a model to predict who survives the titanic.

datarevenue-icon
by
Markus Schmitt

[Note: build the model yourself here using our fully interactive notebook. No prior coding experience required.]

If you’re like me, you need to play with something and “do it yourself” to really understand it. Here we’ll explain how machine learning really works, by example.

You’ll build your own machine learning model to predict the likelihood of passengers on the Titanic surviving. The model will learn patterns by itself, just by looking at data.

Understanding the steps for doing machine learning

Follow along to:

  1. Load the data and explore it with visualisations;
  2. Prepare the data for the machine learning algorithm;
  3. Train the model – let the algorithm learn from the data;
  4. Evaluate the model – see how well it performs on data it has not seen before;
  5. Analyse the model – see how much data it needs to perform well.

To build the machine learning model yourself, open the companion notebook. You’ll run real machine learning code without needing any set-up – it just works.

Understanding the tooling for machine learning

There are lots of options when it comes to machine learning tooling. In this guide, we use some of the most popular and powerful machine learning libraries, namely:

  • Python: a high-level programming language known for its readability, and the most popular machine learning language worldwide.
  • Pandas: a Python library that brings spreadsheet-like functionality to the language.
  • Seaborn: A library for plotting charts and other graphics.
  • Scikit learn: A machine learning library for Python, offering simple tools for predictive data analysis.
  • DRLearn: Our own DataRevenue Learn module, built for this dataset.

These are good tools to start with, since they’re used by both beginners and huge companies (like J.P. Morgan).

Exploring our dataset

We’ll use the famous “Titanic” dataset – a slightly morbid but fascinating dataset containing details of the passengers on the Titanic. We have a bunch of data for each passenger, including:

  • name,
  • gender,
  • age,
  • ticket class.

Our data takes a standard form of rows and columns, where each row represents a passenger and each column an attribute of that passenger. Here’s a sample:

Passenger ID
A few of the passengers that are in the titanic dataset

Visualizing our dataset

Machine learning models are smart, but they can only be as smart as the data we feed them. Therefore an important first step is gaining a high-level understanding of our dataset.

When it comes to analyzing the data, a good starting point is testing a hypothesis. People with first-class tickets were probably more likely to survive, so let’s see if the data supports that.

You can see and run the code to produce this visualization in the companion notebook.

Survival rate 1st class v.s. 2nd and 3rd class passengers.
3rd class passengers had the worst survival rate, and 1st class passengers the best.

Over 60% of the people in first class survived, while less than 30% of those in third class did.

You might also have heard the phrase "women and children first." Let's take a look at how gender and survival rate interact.

Women were much more likely to survive than men.
Women were much more likely to survive than men.

Again, we see that our hypothesis was right. Over 70% of women survived, while only around 20% of men did.

Just like that, we’ve created two basic visualizations of our dataset. We could do a lot more here (and for production machine learning projects, we certainly would). For example, multivariate analysis would show what happens when we look at more than a single variable at a time.

Preparing our data

Before we feed our data into a machine learning algorithm to train our model, we need to make it more meaningful to our algorithm. We can do this by ignoring some columns and reformatting others.

Ignoring unhelpful columns 

We already know there will be no correlation between a passenger’s ticket number and their chance of survival, so we can explicitly ignore that column. We delete it before feeding the data into the model.

Reformatting our data

Some features are useful, but not in their raw form. For example, the labels "male" and "female" are meaningful to a human but not to a machine, which prefers numbers. Therefore we can encode these markers as "0" and "1" respectively.

Once we're done preparing our dataset, the format is more machine friendly. We’ve provided a sample below: we’ve eliminated many useless columns, and the columns that are left all use numbers.

PassengerID List
After preparing the dataset it's simpler and now ready for machine learning.


Splitting our dataset in two

Now we need to train our model and then test it. Just like school children are given examples of test questions as homework but then unseen questions under exam conditions, we’ll train the machine learning algorithm on some of the data and then see how well it performs on the remainder.

Splitting the data into training and test set.
We split our dataset: One part for training the model, and one part for testing it.


Let’s train our model!

And now for the fun part! We’ll feed the training data into our model and ask it to find patterns. In this step, we give the model both the data and the desired answers (whether or not the passenger survived.)

The model learns patterns from this data.

Training a machine learning model on the training set.
Our machine learning model is trained on the Training set.

Testing our model

Now we can test our model by giving it only the details of the passengers in the other half of our dataset, without the answer. The algorithm doesn’t know whether these passengers survived or not, but it will try to guess based on what it learned from the training set.

Testing the machine learning model on the test dataset.
Testing how well our machine learning model works by asking it to predict the results on the test data.

Analyzing our model

To better understand how our model works, we can:

  • Look at which features it relied on  the most to make predictions;
  • See how its accuracy changes if we use less data.

The first helps us understand our data better, and the second helps us understand whether it’s worth trying to source a larger dataset.

Understanding what our model finds important

Machine learning knows that not all data is equally interesting. By weighting particular details differently, it can make better predictions. The weights below show that gender is by far the most important factor in predicting survival rate.

Our model relies mostly on gender, a bit on whether the passenger was in 3rd class or not and on the size of their family.

We can also look at which aspects of the data the algorithm paid attention to when predicting the survival of a specific passenger. Below we see a passenger who the algorithm thought was very likely to survive. It paid special attention to the fact that: 

  • The passenger was not in third class;
  • The passenger was female.

It lowered the chance of survival slightly because the passenger was also not in first class, resulting in a final survival prediction of 93%.

Shapely value analysis for one passener.
How the model made a prediction for one particular passenger. She had a high survival rate because she was a women and not in 3rd class.


Understanding how data quantity affects our model

Let’s train the model multiple times, seeing how much it improves with more data. Here we plot both the training score and the test score. The latter is much more interesting, as it tells us how well the model performs on unseen data.

The training score can be thought of as an “open-book” test: the model has already seen the answers, so it looks higher than the “Test score” but it’s much easier for the model to perform well on data it saw during the training phase.

Machine learning model improves with size of training data.
More data makes our model better (test score). But after ~500 data points the improvement is minimal.

Here we see that the more data the model has, the better it performs. This is much more noticeable at the start and thereafter adding more data results in only small improvements.

Machine learning models don’t have to be “black box” algorithms. Model analysis helps us understand how they work, and how to improve them.

Conclusion

That’s it  – you've built your own machine learning model. You’ll now be able to: 

  • Understand the day-to-day work data science teams do;
  • Communicate better with your data science or machine learning team;
  • Know what kinds of problems machine learning is best at solving;
  • Realize that machine learning is not so intimidating after all.

The complex part of machine learning is getting into all the nitty-gritty details of building and scaling a customized solution. And that’s exactly what we specialize in. So if you need help with the next steps, let us know.

Get Notified of New Articles

Leave your email to get our weekly newsletter.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.