PROFASI  Version 1.5
XML templates: Auto-generate complex XML nodes from tabular data

Often XML data consists of a large sequence of nodes of identical format with different entries. Although it is more convenient to analyze the data as a sequence of XML nodes, writing out each node in the standard XML format is unnecessarily verbose. Storing such data in the table form in a file with comma or white space separated columns is space efficient and convenient from the point of view of scripting. But this approach makes us commit to a certain data layout and not very accommodative of unexpected new directions. ProFASi's XML module provides a template handling mechanism to auto-generate XML nodes from tabular data.

An XML_Node with a child node of name "formatted_data" can substitute a series of child nodes based on the contents of a formatted_data child. Each formatted_data node must have children of names "format" and "data" (See example below). Each line of input inside the data field will be interpreted as a record, represented by XML nodes described by the format. The format field can address different fields in the data line in awk style, $1 $2 etc. Example:

<some_node>
<arbitrary_property_1>
  This is some generic property.
</arbitrary_property_1>
<formatted_data>
  <format name="snapshot">
    <MC_time>$1</MC_time>
    <temperature>$2</temperature>
    <energy>$3</energy>
    <helix>$4</helix>
    <strand>$5</strand>
  </format>
  <data>
    999  0  33.391879  0.071429  0.357143
    1999  3  67.611951  0.000000  0.071429
    2999  2  53.211118  0.000000  0.000000
  </data>
</formatted_data>
</some_node>

If function prf_xml::XML_Node::interpret_formatted_data() is called on XML_Node some_node, the content of the node some_node will change to the following:

<some_node>
<arbitrary_property_1>
  This is some generic property.
</arbitrary_property_1>
<snapshot>
    <MC_time>999</MC_time>
    <temperature>0</temperature>
    <energy>33.391879</energy>
    <helix>0.071429</helix>
    <strand>0.357143</strand>
</snapshot>
<snapshot>
    <MC_time>1999</MC_time>
    <temperature>3</temperature>
    <energy>67.611951</energy>
    <helix>0.000000</helix>
    <strand>0.071429</strand>
</snapshot>
<snapshot>
    <MC_time>2999</MC_time>
    <temperature>2</temperature>
    <energy>53.211118</energy>
    <helix>0.000000</helix>
    <strand>0.000000</strand>
</snapshot>
</some_node>

The advantage of being able to do this is that an application can process the data by accessing XML nodes with given names. The correspondence between those nodes and columns of data in a tabular file is left completely open. So, an external application can generate data in arbitrary format. So long as there is a format specifier block added and the data copied into a data field, it can be interpreted by the processing program without any change.

Tip: In the formatted data node, there is an alternative to the data node, called import_data. The content of import_data node should be a file name. During interpretation, the block

<import_data>tabular_file.dat</import_data>

is interpreted as if it was

<data>
contents of the file tabular_file.dat
</data>

This way, you can construct many small XML files of arbitrarily different node structure using the information in the same tabular data file.

The interpretation of formatted data happens only if one calls the function interpret_formatted_data() for the specific XML node. This happens by default for all ProFASi simulation programs which read XML configurations. But this happens because they explicitly call the interpret_... function after retrieving an XML node from the XML file. If you want to use this feature in a new program you write, make sure you do something like this:

prf_xml::XML_Node * root=prf_xml::get_xml_tree("some_file.xml");
root->interpret_formatted_data();

PROFASI: Protein Folding and Aggregation Simulator, Version 1.5
© (2005-2016) Anders Irbäck and Sandipan Mohanty
Documentation generated on Mon Jul 18 2016 using Doxygen version 1.8.2