Monday, October 22, 2012

Flattening an XML Structure for MySQL Import


Recently a buddy asked me to help him convert BuddyPress XML data into a MySQL database. There are several software options that will import XML into MySQL or other databases, but they don't handle fields past the second level. So this project posed two challenges. First, the BuddyPress data is only available in single day increments in XML files. So you have to download an XML for every day for every table. That is a lot of files. The second challenge is that all of that XML data needed to be merged into a single file, with all of the fields flattened to a single level, while maintaining their identity.

To solve problem #1, I wrote a PowerShell script that would use dates and an array of table names to do a series of loops. This allowed the automatic generation of a URL for each day for each table. That URL was passed to the following function, which would download the XML file. Since each URL was limited to 1000 records, it was actually possible that a table could have more than one XML file for each day, so I added some logic to check the XML file to see if there was another page of records, and to call its self against that next page if it existed.

$global:wc=new-object system.net.webclient

function download-xml {

param ($url, $file)

write-host Downloading $file
$global:wc.downloadfile($url,$file)
[xml]$xml = get-content $file
if ($xml.ContentEnvelope.ContentHeader.NextPageUrl) { 
$nextFile = $file + "_" + $xml.ContentEnvelope.ContentHeader.CurrentPage
download-xml $xml.ContentEnvelope.ContentHeader.NextPageUrl $nextFile
}
}

That script worked great, and netted me about 3900 XML files, each with up to 1000 records. To get them into MySQL, I needed each tables XML file combined into a single file. That part is pretty easy, but while combining them, I also needed to flatten all of the fields into the second level. That meant turning this:

    <BlogPost>
      <NeedsToUpdateFullText>True</NeedsToUpdateFullText>
      <SkipLists>False</SkipLists>
      <UpdatingComments>False</UpdatingComments>
      <Title>Some Blog Title</Title>
      <UnencodedTitle>Some Blog Title</UnencodedTitle>
      <Body>&lt;p&gt;Lots of blog content :)&lt;/p&gt;</Body>
      <Tags />
      <TagsPrevious />
      <TimeStamp>2012-07-08T08:17:00-07:00</TimeStamp>
      <IsPublished>True</IsPublished>
      <ParentKey>
        <UserKey>
          <IsAnonymous>False</IsAnonymous>
          <DontParseKeySeed>False</DontParseKeySeed>
          <Key>2bda27f7-cb12-4e5c-949c-c4c36757a626</Key>
          <ObjectType>Models.Users.UserKey</ObjectType>
          <PartitionHash>
            <Byte>131</Byte>
            <Byte>4</Byte>
          </PartitionHash>
        </UserKey>
        <DontParseKeySeed>False</DontParseKeySeed>
        <Key>Blog:2bda27f7-cb12-4e5c-949c-c4c36757a626</Key>
        <ObjectType>Models.Blogs.BlogKey</ObjectType>
        <PartitionHash>
          <Byte>150</Byte>
          <Byte>135</Byte>
        </PartitionHash>
      </ParentKey>
      <LastEditedBy>
        <IsAnonymous>False</IsAnonymous>
        <DontParseKeySeed>False</DontParseKeySeed>
        <Key>2bda27f7-cb12-4e5c-949c-c4c36757a626</Key>
        <ObjectType>Models.Users.UserKey</ObjectType>
        <PartitionHash>
          <Byte>131</Byte>
          <Byte>4</Byte>
        </PartitionHash>
      </LastEditedBy>
      <LastEditTimeStamp>2012-07-08T08:17:00-07:00</LastEditTimeStamp>
      <LastCommentDate>0001-01-01T00:00:00-08:00</LastCommentDate>
      <Key>
        <BlogKey>
          <UserKey>
            <IsAnonymous>False</IsAnonymous>
            <DontParseKeySeed>False</DontParseKeySeed>
            <Key>2bda27f7-cb12-4e5c-949c-c4c36757a626</Key>
            <ObjectType>Models.Users.UserKey</ObjectType>
            <PartitionHash>
              <Byte>131</Byte>
              <Byte>4</Byte>
            </PartitionHash>
          </UserKey>
          <DontParseKeySeed>False</DontParseKeySeed>
          <Key>Blog:2bda27f7-cb12-4e5c-949c-c4c36757a626</Key>
          <ObjectType>Models.Blogs.BlogKey</ObjectType>
          <PartitionHash>
            <Byte>150</Byte>
            <Byte>135</Byte>
          </PartitionHash>
        </BlogKey>
        <DontParseKeySeed>False</DontParseKeySeed>
        <Key>Blog:2bda27f7-cb12-4e5c-949c-c4c36757a626Post:f23c33f7-932c-46d4-9dc6-7b44a4a95328</Key>
        <ObjectType>Models.Blogs.BlogPostKey</ObjectType>
        <PartitionHash>
          <Byte>55</Byte>
          <Byte>64</Byte>
        </PartitionHash>
      </Key>
      <ContentBlockingState>Unblocked</ContentBlockingState>
      <ImmuneToAbuseReportsState>NotImmune</ImmuneToAbuseReportsState>
      <Owner>
        <IsAnonymous>False</IsAnonymous>
        <DontParseKeySeed>False</DontParseKeySeed>
        <Key>2bda27f7-cb12-4e5c-949c-c4c36757a626</Key>
        <ObjectType>Models.Users.UserKey</ObjectType>
        <PartitionHash>
          <Byte>131</Byte>
          <Byte>4</Byte>
        </PartitionHash>
      </Owner>
      <LastUpdated>2012-07-08T08:12:27-07:00</LastUpdated>
      <CreatedOn>2012-07-08T08:12:27-07:00</CreatedOn>
      <SiteOfOriginKey>system</SiteOfOriginKey>
    </BlogPost>

Into this:

    <BlogPost>

<NeedsToUpdateFullText>True</NeedsToUpdateFullText>
<SkipLists>False</SkipLists>
<UpdatingComments>False</UpdatingComments>
<Title>Some Blog Title</Title>
<UnencodedTitle>Some Blog Title</UnencodedTitle>
<Body>&lt;p&gt;Lots of blog content :)&lt;/p&gt;</Body>
<Tags></Tags>
<TagsPrevious></TagsPrevious>

<BodyAbstract></BodyAbstract>
<ContentBlockingState>Unblocked</ContentBlockingState>
<CreatedOn>2012-06-19 10:15:50-07:00</CreatedOn>
<ImmuneToAbuseReportsState>NotImmune</ImmuneToAbuseReportsState>
<IsPublished>True</IsPublished>
<Key_BlogKey_DontParseKeySeed>False</Key_BlogKey_DontParseKeySeed>
<Key_BlogKey_Key>Blog:a3f689bc-f77c-4f5b-a358-9197a7c7c246@D|9;36|CommGroup5897d62d-6f6e-41fd-8cce-5c8d288dab3d|</Key_BlogKey_Key>
<Key_BlogKey_ObjectType>Models.Blogs.BlogKey</Key_BlogKey_ObjectType>
<Key_BlogKey_PartitionHash_Byte1>90</Key_BlogKey_PartitionHash_Byte1>
<Key_BlogKey_PartitionHash_Byte2>131</Key_BlogKey_PartitionHash_Byte2>
<Key_BlogKey_UserKey_DontParseKeySeed>False</Key_BlogKey_UserKey_DontParseKeySeed>
<Key_BlogKey_UserKey_IsAnonymous>False</Key_BlogKey_UserKey_IsAnonymous>
<Key_BlogKey_UserKey_Key>a3f689bc-f77c-4f5b-a358-9197a7c7c246@d|9;36|commgroup5897d62d-6f6e-41fd-8cce-5c8d288dab3d|</Key_BlogKey_UserKey_Key>
<Key_BlogKey_UserKey_ObjectType>Models.Users.UserKey</Key_BlogKey_UserKey_ObjectType>
<Key_BlogKey_UserKey_PartitionHash_Byte1>108</Key_BlogKey_UserKey_PartitionHash_Byte1>
<Key_BlogKey_UserKey_PartitionHash_Byte2>104</Key_BlogKey_UserKey_PartitionHash_Byte2>
<Key_DontParseKeySeed>False</Key_DontParseKeySeed>
<Key_Key>Blog:a3f689bc-f77c-4f5b-a358-9197a7c7c246@D|9;36|CommGroup5897d62d-6f6e-41fd-8cce-5c8d288dab3d|Post:7147d535-9a47-4987-a9c6-9c695f0c8a6c</Key_Key>
<Key_ObjectType>Models.Blogs.BlogPostKey</Key_ObjectType>
<Key_PartitionHash_Byte1>156</Key_PartitionHash_Byte1>
<Key_PartitionHash_Byte2>175</Key_PartitionHash_Byte2>
<LastCommentDate>0001-01-01T00:00:00-08:00</LastCommentDate>
<LastEditedBy_DontParseKeySeed>False</LastEditedBy_DontParseKeySeed>
<LastEditedBy_IsAnonymous>False</LastEditedBy_IsAnonymous>
<LastEditedBy_Key>511bcfae-b9f6-4831-a81b-2a0c76529c06</LastEditedBy_Key>
<LastEditedBy_ObjectType>Models.Users.UserKey</LastEditedBy_ObjectType>
<LastEditedBy_PartitionHash_Byte1>41</LastEditedBy_PartitionHash_Byte1>
<LastEditedBy_PartitionHash_Byte2>127</LastEditedBy_PartitionHash_Byte2>
<LastEditTimeStamp>2012-03-26 10:18:46-07:00</LastEditTimeStamp>
<LastUpdated>2012-06-19 10:15:50-07:00</LastUpdated>
<NeedsToUpdateFullText>True</NeedsToUpdateFullText>
<Owner_DontParseKeySeed>False</Owner_DontParseKeySeed>
<Owner_IsAnonymous>False</Owner_IsAnonymous>
<Owner_Key>511bcfae-b9f6-4831-a81b-2a0c76529c06</Owner_Key>
<Owner_ObjectType>Models.Users.UserKey</Owner_ObjectType>
<Owner_PartitionHash_Byte1>41</Owner_PartitionHash_Byte1>
<Owner_PartitionHash_Byte2>127</Owner_PartitionHash_Byte2>
<ParentKey_DontParseKeySeed>False</ParentKey_DontParseKeySeed>
<ParentKey_Key>Blog:a3f689bc-f77c-4f5b-a358-9197a7c7c246@D|9;36|CommGroup5897d62d-6f6e-41fd-8cce-5c8d288dab3d|</ParentKey_Key>
<ParentKey_ObjectType>Models.Blogs.BlogKey</ParentKey_ObjectType>
<ParentKey_PartitionHash_Byte1>90</ParentKey_PartitionHash_Byte1>
<ParentKey_PartitionHash_Byte2>131</ParentKey_PartitionHash_Byte2>
<ParentKey_UserKey_DontParseKeySeed>False</ParentKey_UserKey_DontParseKeySeed>
<ParentKey_UserKey_IsAnonymous>False</ParentKey_UserKey_IsAnonymous>
<ParentKey_UserKey_Key>a3f689bc-f77c-4f5b-a358-9197a7c7c246@d|9;36|commgroup5897d62d-6f6e-41fd-8cce-5c8d288dab3d|</ParentKey_UserKey_Key>
<ParentKey_UserKey_ObjectType>Models.Users.UserKey</ParentKey_UserKey_ObjectType>
<ParentKey_UserKey_PartitionHash_Byte1>108</ParentKey_UserKey_PartitionHash_Byte1>
<ParentKey_UserKey_PartitionHash_Byte2>104</ParentKey_UserKey_PartitionHash_Byte2>
<SiteOfOriginKey>system</SiteOfOriginKey>
<SkipLists>False</SkipLists>
<Tags>Nursing</Tags>
<TagsPrevious>Nursing</TagsPrevious>
<TimeStamp>2012-03-26 10:18:46-07:00</TimeStamp>
<Title>Keep Trying!</Title>
<UnencodedTitle>Keep Trying!</UnencodedTitle>
<UpdatingComments>False</UpdatingComments>
   </BlogPost>

Since PowerShell handles XML natively, it seemed like the perfect tool. Since I didn't want to build a custom script for each XML file, I needed the script to be agnostic to the XML fields. To accomplish that, I had the script import an XML file, loop through each property in the XML file and check the property type. If the property type is "XmlElement", the script knows it has child items and starts looping through those properties. If the type isn't "XmlElement", it knows that it has a field and writes the output to an XML file. Because each table potentially has thousands of rows, the file write is done with the .NET [System.IO.StreamWriter], which is by far the fastest way to write a file from PowerShell. Placing the XML flattening process into a function allowed me to easily loop through all of the files for a specific table, and all of the records in each file. In order to maintain the integrity of the XML paths, I pass the parent path to the function. That way we don't end up with multiple "Key" fields at the root, but instead have fields named with the full XML path of the item moved to the root. For example, this path:

      <ParentKey>
        <UserKey>
          <IsAnonymous>False</IsAnonymous>
becomes this:
      <ParentKey_UserKey_IsAnonymous>False</ParentKey_UserKey_IsAnonymous>

Here is the flatten-xml function that accomplished that:

$stream = [System.IO.StreamWriter] "C:\BlogPost.xml"

function global:flatten-field {

param (
$parent,
$element,
$fieldname
)

if ($parent -eq "") {
$label = $fieldname
} else {
$label = $parent + "_" + $fieldname
}

if ((($element.GetType()).Name) -eq "XmlElement") { 
$element | get-member | where-object { $_.MemberType -eq "Property" } | % {
global:flatten-field $label $element.($_.Name) $_.Name
}
} else {
if (($element.length -eq 25) -and ($element -like "2012-*")) { $element = $element.replace("T"," ") }
if ($fieldname -eq "Byte") { 
$row = "`t<" + $label + "1>" + $element[0] + "</" + $label + "1>" + "<" + $label + "2>" + $element[1] + "</" + $label + "2>" 
} else {
$element = $element.replace("&","&amp;")
$element = $element.replace("'","&apos;")
$element = $element.replace('"',"&quot;")
$element = $element.replace("<","&lt;")
$element = $element.replace(">","&gt;")
$row = "`t<" + $label + ">" + $element + "</" + $label + ">"
}
$stream.WriteLine($row)
}
}

Having the script loop through the hundred of files, I was able to start it processing and go to bed. When I got up the next morning, I had an XML file for each table that I was able to import right into a MySQL database for my friend.

If you would like copies of the full scripts, feel free to email me.




1 comment:

  1. Jeremy I would like to get a copy of the complete script. How well do you think it would work on GPO XML reports?

    ReplyDelete