Motivation
You want to parse a file of comma-separated values (CSV) text into an xml structure or you have a flat file with very similar line/field structure that you want to convert into XML format.
Method
We will use the tokenize($input, '\n') function to break the input file into separate lines. We will then use the tokenize($input, '\s*,\s*') function to parse each line into separate fields. The regexp term \s* will remove whitespace.
Basic Example
xquery version "1.0";
let $csv-input-sample :=
'John, Smith, x123
Peg, Jones , x456
Sue, Adams , x789
Dan, McCoy , x321'
let $lines := tokenize($csv-input-sample, '\n')
return
<results>{
  for $line in $lines
  let $fields := tokenize($line, '\s*,\s*')
  return
  <row>{
    for $field in $fields
    return
      <field>{$field}</field>
  }</row>
}</results>
Sample Output
<results>
   <row>
      <field>John</field>
      <field>Smith</field>
      <field>x123</field>
   </row>
   <row>
      <field>Peg</field>
      <field>Jones</field>
      <field>x456</field>
   </row>
   <row>
      <field>Sue</field>
      <field>Adams</field>
      <field>x789</field>
   </row>
   <row>
      <field>Dan</field>
      <field>McCoy</field>
      <field>x321</field>
   </row>
</results>
Example With Data Dictionary in Row 1
This second example will use the first row of the CSV file as a data dictionary of the element names for each column of each row.
xquery version "1.0";
let $csv :=
'name,faculty
alice,anthropology
bob,biology'
let $lines := tokenize($csv, '\n')
let $head := tokenize($lines[1], ',')
let $body := remove($lines, 1)
return
    <people>
        {
            for $line in $body
            let $fields := tokenize($line, ',')
            return
                <person>
                    {
                        for $key at $pos in $head
                        let $value := $fields[$pos]
                        return
                            element { $key } { $value }
                    }
                </person>
        }
    </people>
Sample Output
<people>
  <person>
    <name>alice</name>
    <faculty>anthropology</faculty>
  </person>
  <person>
    <name>bob</name>
    <faculty>biology</faculty>
  </person>
</people>
Adding Configuration File Options
Many times you have a family of CSV files that all may have very similar options for import. In this case it is useful to be able to pass a series of configuration parameters to a single XQuery function. These configuration parameters include:
- What the field delimiter is (comma is the default)
- The element name of the root node
- The element name of each line or row
<file-import-config>
   <field-separator>,<field-separator>
   <root-element-name>people</root-element-name>
   <line-element-name>person</line-element-name>
</file-import-config>
You can then use this configuration file in the CSV parser:
xquery version "1.0";
let $config := 
<file-import-config>
   <field-separator>:</field-separator>
   <root-element-name>People</root-element-name>
   <line-element-name>Person</line-element-name>
</file-import-config>
let $csv :=
'name:faculty
alice:anthropology
bob:biology'
let $lines := tokenize($csv, '\n')
let $head := tokenize($lines[1], $config/field-separator)
let $body := remove($lines, 1)
return
    element {$config/root-element-name}
        {
            for $line in $body
            let $fields := tokenize($line, $config/field-separator)
            return
                element {$config/line-element-name}
                    {
                        for $key at $pos in $head
                        let $value := $fields[$pos]
                        return
                            element { $key } { $value }
                    }
        }
CSV complications
The code above assumes a simple form of CSV. In practice CSV must handle more complex cases for which as simple use of tokeniser() is insufficient to parse a line. Strings containing one or more separators will be double-quoted. Quotes within quoted strings also need to be handled.
The following implementations handle some or all of these complications, althiugh it is often not clear which complications are considered.
- XSLT 2.0 Stylesheet for transforming CSV files to XML
- Matthew Royal's parser Sadly this Marklogic only since it uses mutable maps.
- Zorba CSV importer
- BaseX CSV module
- David Cassel another Marklogic specific implementation