A PMML file can be described by the following components:
Header: contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.
Data Dictionary: contains definitions for all the possible fields used by the model. It is here that a field is defined as continuous, categorical, or ordinal. Depending on this definition, the appropriate value ranges are then defined as well as the data type.
Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of simple data transformations.
* Normalization: map values to numbers, the input can be continuous or discrete.
* Discretization: map continuous values to discrete values.
* Value mapping: map discrete values to discrete values.
* Functions : derive a value by applying a function to one or more parameters.
* Aggregation: used to summarize or collect groups of values.
Model: contains the definition of the data mining model. E.g., A multi-layered feedforward neural network is represented in PMML by a "NeuralNetwork" element which contains attributes such as:
* Model Name
* Function Name
* Algorithm Name
* Activation Function
* Number of Layers
Mining Schema: a list of all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:
* Name : must refer to a field in the data dictionary
* Usage type : defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.
* Outlier Treatment : defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values, or as is.
* Missing Value Replacement Policy : if this attribute is specified then a missing value is automatically replaced by the given values.
* Missing Value Treatment : indicates how the missing value replacement was derived.
Targets: allows for post-processing of the predicted value in the format of scaling if the output of the model is continuous. Targets can also be used for classification tasks. In this case, the attribute priorProbability specifies a default probability for the corresponding target category. It is used if the prediction logic itself did not produce a result. This can happen, e.g., if an input value is missing and there is no other method for treating missing values.
Output: this element can be used to name all the desired output fields expected from the model. These are features of the predicted field and so are typically the predicted value itself, the probability, cluster affinity, standard error, etc. The latest release of PMML, PMML 4.1, extended Output to allow for generic post-processing of model outputs. In PMML 4.1, all the built-in and custom functions that were originally available only for pre-processing became available for post-processing too.
PMML 4.0, 4.1, 4.2 and 4.3
PMML 4.0 was released on June 16, 2009. Examples of new features included:
The is a consortium managed by the Center for Computational Science Research, Inc., a nonprofit founded in 2008. The Data Mining Group also developed a standard called Portable Format for Analytics, or PFA, which is complementary to PMML.