Learning From Data - Homework 6

Using LIONoso orchestration

Here we provide a complete workflow setup for exercises 2, 3, 4, and 5 by using the LIONoso polynomial fit factory.

Before proceeding, be sure to have Python installed on your computer, and the latest version of LIONoso (you'll need at least version 2.1.45).

We want to analyze the performance of a linear classifier applied to a non-linear transform of two-dimensional input samples (x1, x2):

Φ(x1, x2) = (1, x1, x2, x12, x22, x1x2, |x1x2|, |x1+x2|).

We are given two data files. The first, in.dta, contains 35 training triplets (two inputs and the +/-1 output class). The second, out.dta, contains 250 test triplets.
The files are not encoded in CSV format: each line contains a triplet with blanks to separate and justify numbers, as in the following excerpt:

  -7.7947021e-01   8.3822138e-01   1.0000000e+00
   1.5563491e-01   8.9537743e-01   1.0000000e+00
  -5.9907703e-02  -7.1777995e-01   1.0000000e+00
   2.0759636e-01   7.5893338e-01   1.0000000e+00

Our goal is to assess the in-sample and out-of-sample success rates of the classifier with different regularization weights λ = 0, 10-3, 10-2, 10-1, 1, 10, 102, 103.

Importing the data files in LIONoso

While LIONoso cannot immediately recognize the data format of the sample files, we can load them in the application by using the Big data or unparsed file tool:

In order for LIONoso to work on these datasets, we need to make them readable by converting them to CSV format. We can do this with the Python script dta2csv.py that basically reads all the lines in the data file and rewrites them in CSV format.
Import the file in LIONoso twice (one instance for each file to be converted) using the "Orchestration/Table manipulation/Shell executable" tool, connect them to the input files and press "Run". The resulting CSV tables will be automatically loaded. Remember that you can change their names if the ones chosen by LIONoso look too complex.

The tables contain three numeric columns each, called x1, x2, and x3. You can check them by double-clicking their symbol.

Computing the non-linear function Φ

In order to apply the required non-linear transformation, we introduce a Javascript function node. Drag the Orchestration/Model/Javascript tool onto the workbench; fill in the following code:

phi1 = x1 * x1;
phi2 = x2 * x2;
phi3 = x1 * x2;
phi4 = Math.abs (x1 - x2);
phi5 = Math.abs (x1 + x2);
y = x3;

Note that you don't need to restate the linear components, because you will obtain them in the output table anyway. Also, the constant component “1” does not appear because it will be automatically added by the linear classifier.
The last line is used to preserve the output column, which would otherwise be removed from the output table because it would be unused.
Press "Check code and proceed", then reorder the input and output variable if you wish. Press "Complete function" when you are done: the Phi transform is ready to be applied to your input tables. You can also rename the node “Phi” for clarity:

Connect the two CSV tables to the Javascript node and two new tables will appear, each containing the original values plus the outputs phi1,...,phi5,y of the function:

You can double click the new tables to inspect their content.

Applying the linear classifier

Drag the Models/Polynomials/Polynomial fit factory onto the workbench and connect it to the transformed in-Phi table. Select the following input columns: x1, x2, phi1, phi2, phi3, phi4, phi5. Select either y or x3 as output column. Finally, uncheck both normalization checkboxes (if you leave them checked, results will be slightly different) and press “Start training”:

To apply the trained classifier to the training data in order to obtain the in-sample success rate, connect the “in-Phi” table to the classifier. You'll obtain a new table; by pressing the “Perform error analysis” button in the property pane, you'll obtain various data about the classifier: we are interested to the Success rate wrt target center line:

Likewise, obtain the out-of-sample success rate by connecting the “in-Phi” table to the classifier and performing the error analysis:

Retraining the classifier with a nonzero regularization factor

We are interested in checking the performance of a linear classifier with different values of a regularization factor λ (also known as weight decay).

To retrain the classifier, just click the “PolyFit1” icon, modify the “Regularization factor” value, and press “Start training”.
After a short training period, you can click on the in-sample and out-of-sample output tables and look at the error analysis area.

If you check the in-sample and out-of-sample error analysis figures after retraining for different values of the regularization factor, you should obtain the following results:

Regularization factorIn-sample success rateOut-of-sample success rate
00.9714290.916
10-30.9714290.92
10-20.9714290.916
10-10.9428570.94
110.908
100.9428570.876
1020.7714290.772
1030.6285710.564

Notes for Windows users

While on most UNIX-based systems (such as Linux and Mac OS X) it is possible to declare the script interpreter in the top line of the script, Windows bases the choice of the interpreter on the filename extension. There can be two types of problems:

  1. The interpreter is installed, but it did not register the file extension (as it happens, e.g., with R)
  2. A specialized application “stole” the file extension and is executed in place of the interpreter (as it happens, e.g., with Canopy, which appropriates the .py extension of Python)
In these cases, it is possible to execute the script from within LIONoso by providing a “wrapper shell script”. In the Python case, use a text editor (e.g., Notepad) to create the file dta2csv.bat containing the following text:
        @echo off
        C:\Python27\python.exe dta2csv.py %*
where C:\Python27\python.exe must be replaced by the path of the python.exe executable in your system. Next, import this file as Orchestration/Table manipulation/Shell executable.