In this short article, we'll show you a simple example of one of the classics of modelling: the Decision Tree. We'll quickly take you through the steps required to develop a simple predictive model using RapidMiner.
A classic modelling method incorporates the decision tree concept as a way to predict future values based on a set of identified attributes, and RapidMiner's straightforward model development user interface makes creating this model about as simple as it can be. In this article and the next, we'll work through an example, showing you the steps required to take a training dataset, formulate a predictive model and then test this model against real data to determine its usefulness.
The question: what sorts of people survived the Titanic disaster?
We'll use a well-known training dataset that contains passenger data from the ill-fated Titanic ship. Thought of as unsinkable, the ship hit an iceberg on its maiden voyage and sank in the seas off Canada with the loss of over 1,500 lives. The dataset we'll use lists, for each passenger, their key attributes like fare paid, class of service and age (the independent variables) . It also contains a key 'outcome' attribute (the dependent variable) - did the passenger survive the disaster of the ship's sinking or not? Using a decision tree, we'll take this training dataset and let RapidMiner choose the best attributes to predict survival, and then we'll test the model to see how 'good' the predictions are by measuring what the model says versus what actually happened.
Step 1: Getting the Titanic 'survival' data
The first step is to get hold of the Titanic dataset. There are numerous versions available on the web, but the one we're using can be downloaded here. It's worth opening this file in Excel first to get a quick overview of its contents. The dataset is a passenger list, and for each passenger you can see the following attributes:
|name||The name of the passenger|
|age||Age in years|
|fare||The fare paid|
|pclass||An integer from 1 to 3, representing the class of travel (1=First class etc.)|
|sex||Gender of passenger (M=Male, F=Female)|
|survived||A flag representing whether the passenger survived (1) or not (0)|
Now let's start up RapidMiner to develop our predictive model. First, create a new process using the 'blank' template. You should see an empty Process panel to which we'll add the required RapidMiner operators. We'll now step through each of the operators in sequence.
Step 2. Read the data into RapidMiner
As our training dataset is in Excel format, we'll use the '
Read Excel' operator. Find the operator in the Operators panel, then drag it across to the Process panel. Each operator has a set of parameters, and the first one to fill in, in this case, is the 'excel name'. Use the browser icon to find your file (Titanic3.xls). The easiest way to set up a dataset correctly in RapidMiner is to use the Import Configuration Wizard, but before doing this click the 'first row as names' checkbox to ensure these are used. After clicking the wizard button, you'll be taken through the steps required to convert your Excel data into a RapidMiner dataset. As you step through the wizard let RapidMiner choose the default attribute types and roles, although we'll be adjusting one of them using the '
Set Role' operator later (Step 4 below).
At this point you might like to take a look at the data in RapidMiner. You can do this by connecting the 'out' port of the '
Read Excel' operator to the 'res' (Results) port on the right hand side of the screen. Save your model as something like 'TitanicSurvivalPredictor' and click 'Run'. You should now see a table of the Excel files contents, with the first row used as the attribute names.
Step 3. Select the most useful attributes in the dataset
The dataset contains a number of attributes we don't need for modelling survival. It's always good practice to reduce your dataset attributes down to just what you need for your model - this makes your data and model easier to understand, and reduces the risk of 'overfitting'. It's critically important to understand the statistical concept of 'overfitting', and if you're not sure what this means, it's well worthwhile familiarising yourself, because 'overfitted' models are dangerous models!. They look useful, but in fact are very poor at being used for future predictions. Add a '
Select Attributes' operator to your model, and connect the 'out' port of the 'Read Excel' operator to the '
exa' (Example) port of the '
Select Attributes' operator. In the '
Select Attributes' operator, you now need to adjust the parameters so we get just the attributes we need. Select '
subset' for '
attribute filter type', then click '
Select Attributes'. In our case, we'll be using the following attributes (variables):
||The age of the survivor, as at the rescue date|
||The value of the fare paid for this survivor - higher numbers equate to more expensive tickets|
||The travelling class eg. First or 3rd class|
||The gender of the survivor|
||A flag that represents whether the person survived or not|
Step 4. Discretize the 'Survived' and 'PClass' attributes
RapidMiner's Decision Tree operator requires that your prediction attribute (Survived) be nominal (ie. non-numeric), however our dataset stores the Survival status as either 1 or 0 ie. numeric. The Discretize operators let you adjust your numeric attributes that represent classes, turning them into binomial or polynomial non-numeric values. Add a '
Discretize by User Specification' operator to your process model, and select '
Single' for the '
attribute filter type' and '
survived' for the '
attribute'. This means you're going to discretize just one attribute - whether the passenger survived or not. Click the '
Edit List' button, and add two class names: '
DidSurvive' and '
DidNotSurvive'. Set the upper limit as '
1' for DidSurvive and '
0' for did not survive. If you join your operator to the results and run the model, you'll see the adjusted dataset containing the words '
DidSurvive' and '
DidNotSurvive' now in the '
survive' attribute. This is what we need for the decision tree, and the technique of discretizing attributes is an important one when developing predictive models.
We also have another discretize task to perform. The '
pclass' attribute represents the cabin class for the passenger, with values from '
1' (First Class) to '
3' (Third Class). Because this attribute is numeric, RapidMiner will assume it's a continuous variable (like fare paid), and our decision tree could end up with non-sensical 'forks' in the tree for pclass like '<2.5' but... there's no such thing as '2 and half' class on board ships! In fact, we want to treat pclass as a discrete nominal attribute so that the tree forks on 1st, 2nd or 3rd class only. To do this, add another '
Discretize by User Specification' operator, select a single variable, and add three classes with their appropriate values:
Join these two discretize operators to the flow by connecting the '
exa' output ports to the '
exa' input ports. We now have a dataset where cabin class (
pclass) assumes one of the three textual values, and RapidMiner's decision tree won't split them into non-existent, non-integer pclass values.
Step 5. Set Role
On its own, RapidMiner's decision tree doesn't know what we're actually trying to predict with this model. How do we do this ie. tell RapidMiner what we'd like to predict? The answer is 'by setting a label attribute for the prediction attribute'. To do this, add a '
Set Role' operator, and choose '
survived' as the '
attribute name'. Now choose a '
target role' of '
label'. The label attribute tells RapidMiner that this attribute is the one that contains the prediction - in our case whether the passenger survived or not. Note that each dataset can contain only one label attribute (all other attributes are termed '
Step 6. Create a Decision Tree
At last we can insert our
Decision Tree operator. Add this to your Process panel, and join it up to your data by connecting the output of the last operator we used ('
Set Role') to the '
tra' (training) input port. For now, leave Decision Tree's parameters as is - once you've created a decision tree, you can look at modifying parameters like 'pruning' and 'maximal depth' to fine-tune to the model, balancing simplicity with accuracy. Don't forget to connect the output from Decision Tree to the results connector so you can see the results.
Step 7. Run the Process
Click 'Run', and if you've set up your operators correctly, you should see the results in a view (tab) called '
Tree (decision tree)'. You should see that first line predictor attribute is '
sex'. Females were more likely than males to survive the Titanic's sinking. As you go a level further down, you should see '
pclass' ie. cabin class as another signficant predictor of survival. For more advanced users, you can then go back to your Process panel and adjust the Decision Tree parameters to change to objectives of the algorithm, creating different 'trees'.
That's it! You've loaded a training dataset and created a predictive model that predicts survivability based on a passenger list containing a set of influencing attributes.
Here's a screenshot of the completed model for you to check your own model against:
In the next article, we'll modify the process to test the model using a test dataset, and we'll see just how well (or not) it does its predictions. To give you a hint of what's ahead, it appeared that those paying higher fares did have a slight survival advantage, and younger female travellers survived significantly more often than older males. To understand why this was so, we'll need to think about the cultural norms in place at the time: a fascinating area of study of itself!
Association vs. Causality
Lastly, it's important to note that any modelling of data reports 'association' only. It makes no claims over causality. A more basic example illustrates the point: if you looked outside onto a street on a rainy day, you might note that there seem to be far more umbrellas in use than on a dry day. But we don't for a minute say that 'lots of umbrellas cause dry weather'. Rather, we say that lots of umbrellas are associated with wet weather. The same concept applies to any and all modelling you do: models generally test associations, but have little to say about causality. Understanding the causes of rainy weather has little to do with umbrellas, and it's extremely important to bear this concept in mind when doing any predictive insights work. Confusing causality with association is a common trap for young players, and differentiates those able to succesfully use tools like RapidMiner and those who don't!
Other useful articles
|http://auburnbigdata.blogspot.com.au/2013/03/decision-tree-in-rapidminer.html||Decision tree overview|
|http://www.simafore.com/blog/bid/107076/How-to-choose-optimal-decision-tree-model-parameters-in-Rapidminer||Decision Tree parameters|