PMSG: Predictive Modelling with Sparse Grids -------------------------------------------- PMSG provides methods for approximating surfaces fitted to multi-dimensional data. Such surfaces can be used to model data and to predict values of response variables at new data points. Elements of the system ---------------------- The system consists of three logically distinct components: Datasets, Learners, and Models. The interplay is as follows: Given a dataset D and a learner L, L can be trained on D resulting in a model M: M = L(D) M, in turn, can be evaluated on a dataset yielding predicted values of the response variables - itself a dataset. Formally D_pred = M(D) Datasets --------- A new dataset object can be created from existing files, artificial data, python lists, strings etc. Each dataset is implemented as Python module which takes care of conversions to the internal format. PMSG handles the following formats: - Text file with columns delimited by any separator. - Text files following the UC Irvine format: One file contains the raw data as above. A second contains information about column names and types. See Datasets/housing.data and Datasets/housing.info for an example. - Data in the form of a Numerical array See examples from the directory Datasets: - Datasets/DATApeaks.py reads and converts the file peaks.dat which is a standard three column representation of the dataset used for the Matlab logo. - Datasets/DATAtest.py creates an artificial dataset. - Datasets/DATAhousing.py reads and converts the irvine dataset housing which consists of a meta data file housing.info and the comma-separated data file housing.data. Limitations are that missing values are not (yet) dealt with and that there is only limited support for categorical variables. See their respective implementations for further details. Learners -------- At the moment we have the following variants of the sparse grids predictive modelling technique: PMsparse.py: The standard sparse grids regression module. PMsparseCG.py: Sparse grids regression module improved with a conjugate-gradient iterative technique PMfull.py: A reference method which computed the full high dimensional regression. Useful for tests but not feasible for high dimensional problems. Models ------ These are the components that are returned from a Learner when it has been trained on a dataset. Models represent the non-parametric (e.g. sparse grids) model of the data and can be evaluated on either the same data or new data. To evaluate a model M on a dataset D we write M(D) and the result is a new dataset representing the predicted values. EXAMPLE 1 --------- Here is a simple annotated example taken from demo.py: import PM # Import the predictive modelling framework. # Options can be set using the method PM.set_option # See PM.py for possible options from Datasets\ import DATAcircle as dataset # Then import the data object for # the data that you wish to model. data = dataset.data import PMsparseCG as learner # Import the learner to use # Learners have options two, # See module code for a complete # example PM.evaluate(learner, data) # Take some action - in this # case evaluate the quality # of learner on the dataset. EXAMPLE 2 --------- Here is another example where the model is evaluated on a test dataset import PM from Datasets import DATAcircle as dataset data = dataset.data import PMsparseCG as learner training_set, test_set = data.split() #Split into two sets model = PM.train(learner, training_set) #Train P = model.evaluate(test_set) #Predict function values of test_set using model