RapidMiner - Introduction
RapidMiner is a well-established and popular provider of Predictive Analytics software, owned by the German company Rapid-I. Initially developed at the Technical University of Dortmund in 2001, in 2006 its developers created an independent company dedicated to developing and distributing RapidMiner. The software is now distributed in multiple versions ranging from a Basic (free) version up to a Professional (commercial version) which includes support and consulting services as well as additional features. It is regarded a pre-eminent, if not market-leading Predictive Analytics tool, and has been downloaded 3,000,000 times, with an estimated user base of 200,000 including companies such as Ebay, Pepsico, Intel and Kraft Foods. The Professional version of RapidMiner includes extensive cloud-based access to high-performance computing resources, and integration with commercial database suppliers such as Oracle and MSSQL.
Accessing a dataset
Users of RapidMiner begin by running the application on their computer. In normal use, a user would start by importing a data source from the data store. This could be an Excel spreadsheet, a .CSV file, a MySQL database table or even a cloud-based Twitter feed (generally used for textual analysis: an exciting development that permits things like sentiment analysis over unstructured data). The next step is to define the data elements of the table, choosing data types and examining basic statistics on each column (sum, average, range, median, unique values etc.). RapidMiner provides quick summary statistics (including graphical visualisation) in order to provide overviews of datasets in an effort to identify ranges and patterns. This step can assist in the next phase - cleansing, translation and sampling.
Most Predictive Analytics models require input dataset reformatting in order to successfully use statiistical algorithms. A process of data cleansing, translation (and often sampling) is required to make the dataset more suitable for model development. Cleansing activities include things like the reformatting of phone numbers or addresses, capitalising text values or removing outliers. Translation activities include creating new classifcation columns, recoding columns to standardise categories, or converting columns to ordinals for ranking purposes (often required for regression models). Datasets may additionally be sampled to restrict the data volume during model development.
Model Development using Processes
RapidMiner uses a 'Connected Process Flow' paradigm. This means that a user hooks up multiple processes together, with the result being a model suitable for running against datasets to gain insights or predictions. The two steps above (Accessing Datasets and Dataset Pre-processing) are themselves processes, and via the use of data inputs and outputs, processes are joined together to create the model. Once a processed dataset is available, the user begins the process of choosing statisical algorithms to produces the final model. It is at this point that the user decides which mathematicl or statistical techniques to used, based on the objectives of the model. If clustering is the goal (categorising the data to create a new colum that specifies a group) the user has over 20 different algorithsm to choose from (eg. k-means). The output from one process may form the input of a new process - for example, in the case of a clustering process, the user may wish to perform a regression analysis using the new clustered classifcation column as an independent variable into the regression process. By connecting data outputs from one process to inputs into the next process, the results from one process are 'fed' into the next process. Finally, the models results can be hooked up to an 'output' process eg. display the outpus in a table or graph.
RapidMiner and Golf Play Model - a Predictive Analytics example
The veneralbe 'Hello World' program common to demonstrations of all programming languages has a near-equivalent in Predictive Analytics - the Golf Model. RapidMiner includes the required datasets and processes in a model available as an example in the tool. The goal of the model is to predict the likelihood of playing golf (the dependent variable) based on a number of weather related factors including temperature, humidity, outlook and windiness. The dataset is relatively small (14 observations) and lends itself to a number of modelling scenarios including predictive modelling, decision trees and clustering. The dataset has 'what actually happened' values, so you can test the sensitivity* and specificity** of the generated models.
Type I and Type II errors
Using standard terminology, you create a model and assess its value by testing it against the null hypothesis. The null hypothesis is 'no effect', or for the case of the Golf dataset, might be 'weather does not influence the decision to play golf'. A Type I error occurs when you identify an effect that is not present (a 'false positive'). In this case, it means you are assuming that weather influences playing golf, when in fact it doesn't. A Type II error occurs when you accept the null hypothesis, when in fact there is an effect (and you, or your model and data missed it).
In summary, a type I error is detecting an effect that is not present, while a type II error is failing to detect an effect that is present.
The following table illustrates these concepts, as applied to the golf dataset and predictive model
|Null Hypothesis||There is no effect ie. the model does not identify an effect||"Weather does not influence the likelihood of playing golf"|
|Type I error||A "false positive" ie. identify an effect that is not actually there. This is the same as the risk of rejecting the null hypothesis, and identifying an effect that does not exist. In this case this means finding that weather does influence the likelihood of playing golf, when in truth, it doesn't||Identify that weather influences playing golf, when in fact it doesn't.|
|Type II error||A "false negative" ie. assume there is no effect, when in fact there is one. This means accepting the null hypothesis (weather does not influence the likelihood of playing golf) when in truth, it does||Identify that weather does not influence playing golf, when in fact it does|
When assessing a model, it is common to test for Type I and Type II errors. To do this you first identify your null hypothesis. You can then use tools like the Student T-test to test for the probability of Type I and Type II errors. By setting a standard threshold eg. 0.05 you can then make a determination on the significance of the result. After you've calculated your t-value, you check it against the p value you have chosen (most commonly 0.05). If t<p, you can then say "I reject the null hypothesis, and the probability of this decision being wrong is <0.05" or, in other words "Weather influences the playing of golf, and I am 95% certain that this not just a random association".
* Sensitivity - the ability of a model to detect a Type I error ie. if the hypothesis is true, how likely is it that the model will pick this up?
** Specificity - the ability of a model to detect a Type II error ie. 'if a hypothesis is true, how likely is it that this hypothesis identifies the underlying cause?'