RapidMiner is a popular data science tool that allows for the creation of sophisticated predictive models. A classic modelling method incorporates the decision tree concept as a way to predict future values based on a set of identified attributes, and RapidMiner's straightforward model development user interface makes creating this model about as simple as it can be. In this article and the next, we'll work through an example, showing you the steps required to take a training dataset, formulate a predictive model and then test this model against real data to determine its usefulness.
We'll use a well-known training dataset that contains passenger data from the ill-fated Titanic ship. Thought of as unsinkable, the ship hit an iceberg on its maiden voyage and sank in the seas off Canada with the loss of over 1,500 lives. The dataset we'll use lists, for each passenger, their key attributes like fare paid, class of service and age. It also contains a key 'outcome' attribute - did the passenger survive the disaster of the ship's sinking or not? Using a decision tree, we'll take this training dataset and let RapidMiner choose the best attributes to predict survival, and then we'll test the model to see how 'good' the predictions are by measuring what the model says versus what actually happened.
The first step is to get hold of the Titanic dataset. There are numerous versions available on the web, but the one we're using can be downloaded here. It's worth opening this file in Excel first to get a quick overview of its contents. The dataset is a passenger lists, and for each passenger you can see the following attributes:
|name||The name of the passenger eg.|
|age||Age in years|
|fare||The fare paid|
|pclass||An integer from 1 to 3, representing the class of travel (1=First class etc.)|
|sex||Gender of passenger (M=Male, F=Female)|
|survived||A flag representing whether the passenger survived (1) or not (0)|
Now let's start up RapidMiner to develop our predictive model. First, create a new process using the 'blank' template. You should see an empty Process panel to which we'll add the required RapidMiner operators. We'll now step through each of the operators in sequence.
1. Read the data
As our training dataset is in Excel format, we'll use the '
Read Excel' operator. Find the operator in the Operators panel, then drag it across to the Process panel. Each operator has a set of parameters, and the first one to fill in, in this case, is the 'excel name'. Use the browser icon to find your file (Titanic3.xls). The easiest way to set up a dataset correctly in RapidMiner is to use the Import Configuration Wizard, but before doing this click the 'first row as names' checkbox to ensure these are used. After clicking wizard button, you'll be taken through the steps required to convert your Excel data into a RapidMiner dataset. As you step through the wizard let RapidMiner choose the default attribute types and roles, although we'll be adjusting one of them using the '
Set Role' operator later (Step 4 below).
At this point you might like to take a look at the data in RapidMiner. You can do this by connecting the 'out' port of the '
Read Excel' operator to the 'res' (Results) port on the right hand side of the screen. Save your model as something like 'TitanicSurvivalPredictor' and click 'Run'. You should now see a table of the Excel files contents, with the first row used as the attribute names.
2. Select Attributes
The dataset contains a number of attributes we don't need for modelling survival. It's always good practice to reduce your dataset attributes down to just what you need for your model. Add a '
Select Attributes' operator, and connect the 'out' port of the 'Read Excel' operator to the 'exa' (Example) port of the '
Select Attributes' operator. In the '
Select Attributes' operator, you now need to adjust the parameters so we get just the attributes we need. Select 'subset' for 'attribute filter type', then click 'Select Attributes'. In our case, we'll be using the following attributes (variables):
3. Discretize 'Survived' and 'PClass'
RapidMiner's Decision Tree operator requires that your prediction attribute (Survived) be nominal (ie. non-numeric), however our dataset stores the Survival status as either 1 or 0 ie. numeric. The Discretize operators let you adjust your numeric attributes that represent classes, turning them into binomial or polynomial non-numeric values. Add a '
Discretize by User Specification' operator to your process model, and select 'Single' for the 'attribute filter type' and 'survived' for the 'attribute'. This means you're going to discretize just one attribute - whether the passenger survived or not. Click the 'Edit List' button, and add two class names: 'DidSurvive' and 'DidNotSurvive'. Set the upper limit as '1' for DidSurvive and '0' for did not survive. If you join your operator to the results and run the model, you'll see the adjusted dataset containing the words 'DidSurvive' and 'DidNotSurvive' now in the 'survive' attribute. This is what we need for the decision tree.
We also have another discretize task to perform. The 'pclass' attribute represents the cabin class for the passenger, with values from 1 (First Class) to 3 (Third Class). Because this attribute is numeric, RapidMiner will assume it's a continuous variable (like fare paid), and our decision tree could end up with non-sensical 'forks' in the tree for pclass like '<2.5' but... there's no such thing as '2 and half' class on board ships! In fact, we want to treat pclass as a discrete nominal attribute so that the tree forks on 1st, 2nd or 3rd class only. To do this, add another '
Discretize by User Specification' operator, select a single variable, and add three classes with their appropriate values:
Join these two discretize operators to the flow by connecting the 'exa' output ports to the 'exa' input ports. We now have a dataset where cabin class (pclass) assumes one of the three textual values, and RapidMiner's decision tree won't split them into non-existent, non-integer pclass values.
4. Set Role
On its own, RapidMiner's decision tree doesn't know what we're actually trying to predict with this model. How do we do this? The answer is 'by setting a label attribute for the prediction attribute'. To do this, add a '
Set Role' operator, and choose 'survived' as the 'attribute name'. Now choose a 'target role' of 'label'. The label attribute tells RapidMiner that this attribute is the one that contains the prediction - in our case whether the passenger survived or not. Note that each dataset can contain only one label attribute (all other attributes are termed 'regular').
5. Create a Decision Tree
At last we can insert our
Decision Tree operator. Add this to your Process panel, and join it up to your data by connecting the output of the last operator we used ('
Set Role') to the 'tra' (training) input port. For now, leave Decision Tree's parameters as is - once you've created a decision tree, you can look at modifying parameters like 'pruning' and 'maximal depth' to fine-tune to the model, balancing simplicity with accuracy. Don't forget to connect the output from Decision Tree to the results connector so you can see the results.
6. Run the Process
Click 'Run', and if you've set up your operators correctly, you should see the results in a view (tab) called 'Tree (decision tree)'. You should see that first line predictor attribute is 'sex'. Females were more likely than males to survive the Titanic's sinking. As you go a level further down, you should see 'pclass' ie. cabin class as another signficant predictor of survival. For more advanced users, you can then go back to your Process panel and adjust the Decision Tree parameters to change to objectives of the algorithm, creating different 'trees'.
That's it! You've loaded a training dataset and created a predictive model that predicts survivability based on a passenger list containing a set of influencing attributes.
Here's a screenshot of the completed model for you to check your own model against:
In the next article, we'll modify the process to test the model using a test dataset, and we'll see just how well (or not) it does its predictions.
Other useful articles
|http://auburnbigdata.blogspot.com.au/2013/03/decision-tree-in-rapidminer.html||Decision tree overview|
|http://www.simafore.com/blog/bid/107076/How-to-choose-optimal-decision-tree-model-parameters-in-Rapidminer||Decision Tree parameters|