Saturday, November 26, 2011

Sunday, November 6, 2011

Dennis Ritchie, 1941-2011


http://en.wikipedia.org/wiki/Dennis_Ritchie

printf(“Goodbye, Dennis Ritchie”);




Parsing large XMLs with XmlReader and XmlSerializer (C#)

Having discovered the other week XmlSerializer (used it to save / read application settings in XML) I went back to another application I’m writing for work that analyses Microsoft SQL Server Profiler trace XMLs and try to use with that as well.

Now these XMLs are fairly large, from hundreds of MBs to GBs. I used XmlReader in the first place so it didn’t load the whole XML in memory, but the program was fairly slow with large ones and the code was about 2 or 3 pages, testing for node names, attributes and all that.

Decided to rewrite it using XmlSerializer, and to my surprise the performance increased quite a bit (true, it was maybe mostly due to the way I wrote it the first place with one loop that did xmlReader.Read() and then checking for IsStartElement() and all that spaghetti that needs to go with it), while making the code much cleaner.

Microsoft SQL Server Profiler XMLs look like this, with Event entries, each with multiple Column elements:

  1: <?xml version="1.0" encoding="utf-16"?>
  2: <TraceData xmlns="http://tempuri.org/TracePersistence.xsd">
  3:   <Header>
  4:     [...]
  5:   </Header>
  6:   <Events>
  7:     <Event id="45" name="SP:StmtCompleted">
  8:       <Column id="11" name="LoginName">USERNAME</Column>
  9:       <Column id="15" name="EndTime">2011-09-12T13:20:45.813-07:00</Column>
 10:       <Column id="10" name="ApplicationName">Microsoft SQL Server JDBC Driver</Column>
 11:       <Column id="12" name="SPID">190</Column>
 12:       <Column id="14" name="StartTime">2011-09-12T13:20:45.813-07:00</Column>
 13:       <Column id="16" name="Reads">2</Column>
 14:       <Column id="18" name="CPU">0</Column>
 15:       <Column id="1" name="TextData">SELECT COUNT(*) FROM "TABLENAME"</Column>
 16:       <Column id="9" name="ClientProcessID">6896</Column>
 17:       <Column id="13" name="Duration">105</Column>
 18:       <Column id="17" name="Writes">0</Column>
 19:     </Event>
 20:     [...]
 21:   </Events>
 22: </TraceData>
 23:     
The Event.cs class for Event nodes:
  1: using System.Collections.Generic;
  2: using System.Xml.Serialization;
  3: 
  4: namespace XmlSerializerTest
  5: {
  6:     [XmlRoot(ElementName="Event", Namespace="http://tempuri.org/TracePersistence.xsd")]
  7:     public class Event
  8:     {
  9:         [XmlAttribute("id")]
 10:         public string ID { get; set; }
 11: 
 12:         [XmlAttribute("name")]
 13:         public string Name { get; set; }
 14: 
 15:         [XmlElement("Column")]
 16:         public List<Column> Columns { get; set; }
 17:     }
 18: }
 19: 
The Column.cs clas for Column nodes:
  1: using System.Xml.Serialization;
  2: 
  3: namespace XmlSerializerTest
  4: {
  5:     [XmlRoot(ElementName="Column", Namespace="http://tempuri.org/TracePersistence.xsd")]
  6:     public class Column
  7:     {
  8:         [XmlAttribute("id")]
  9:         public string ID { get; set; }
 10: 
 11:         [XmlAttribute("name")]
 12:         public string Name { get; set; }
 13: 
 14:         [XmlText]
 15:         public string Value { get; set; }
 16:     }
 17: }
And the main parser code (XmlParser.cs) as is simple as:
  1: using System;
  2: using System.Collections.Generic;
  3: using System.Xml;
  4: using System.Xml.Serialization;
  5: 
  6: namespace XmlSerializerTest
  7: {
  8:     class XmlParser
  9:     {
 10:         public static List<Event> Parse(String fileName) 
 11:         {
 12:             // Init
 13:             List<Event> events = new List<Event>();
 14: 
 15:             // Parse...
 16:             using (XmlReader xmlReader = XmlReader.Create(fileName))
 17:             {
 18:                 // XmlSerializer...
 19:                 XmlSerializer EventSerializer = new XmlSerializer(typeof(Event));
 20: 
 21:                 // Parse XML - "Event" nodes...
 22:                 while (xmlReader.ReadToFollowing("Event"))
 23:                 {
 24:                     Event eventObject = (Event) EventSerializer.Deserialize(xmlReader.ReadSubtree());
 25:                     if (String.Equals(eventObject.Name, "SP:StmtCompleted"))
 26:                     {
 27:                         //
 28:                         events.Add(eventObject);
 29:                     }
 30:                 }
 31: 
 32:                 // Cleanup...
 33:                 xmlReader.Close();
 34:             }
 35: 
 36:             // Return value
 37:             return events;
 38:         }
 39:     }
 40: }

Everything is parsed for you by the XmlSerializer.Deserialize() – yes, there may be an overhead in parsing stuff that otherwise maybe you were interested in, like columns you didn’t want in the first place and could add up in memory, but if code readability and maintainability is more important, to reduce memory I guess you could go and trim them from the list after the elements are parsed.

The above code goes through a 250 MB file in about 6 seconds on my Intel P8600 laptop (measured with StopWatch) and it uses about 37 MB of RAM (private bytes).

I guess the interesting parts are the XmlReader, XmlReader.ReadToFollowing(), XmlReader.ReadSubtree(), XmlSerializer.Deserialize() and the Xml annotations in the Event and Column classes – I’m not going to go into details on those, there’s plenty examples on the web and documentation on MSDN.

The reason why I posted it in the first place was that I thought it might be of help to someone else learning C# like myself, and see a working example and what makes it tick.
  

Wednesday, November 2, 2011

Eclipse – FindBugs plugin

buggy-sm
http://findbugs.sourceforge.net
FindBugs™ - Find Bugs in Java Programs

A program which uses static analysis to look for bugs in Java code.  It is free software, distributed under the terms of the Lesser GNU Public License. The name FindBugs™ and the FindBugs logo are trademarked by The University of Maryland. As of July, 2008, FindBugs has been downloaded more than 700,000 times.