Having discovered the other week XmlSerializer (used it to save / read application settings in XML) I went back to another application I’m writing for work that analyses Microsoft SQL Server Profiler trace XMLs and try to use with that as well.
Now these XMLs are fairly large, from hundreds of MBs to GBs. I used XmlReader in the first place so it didn’t load the whole XML in memory, but the program was fairly slow with large ones and the code was about 2 or 3 pages, testing for node names, attributes and all that.
Decided to rewrite it using XmlSerializer, and to my surprise the performance increased quite a bit (true, it was maybe mostly due to the way I wrote it the first place with one loop that did xmlReader.Read() and then checking for IsStartElement() and all that spaghetti that needs to go with it), while making the code much cleaner.
Microsoft SQL Server Profiler XMLs look like this, with Event entries, each with multiple Column elements:
1: <?xml version="1.0" encoding="utf-16"?>2: <TraceData xmlns="http://tempuri.org/TracePersistence.xsd">3: <Header>4: [...]5: </Header>6: <Events>7: <Event id="45" name="SP:StmtCompleted">8: <Column id="11" name="LoginName">USERNAME</Column>9: <Column id="15" name="EndTime">2011-09-12T13:20:45.813-07:00</Column>10: <Column id="10" name="ApplicationName">Microsoft SQL Server JDBC Driver</Column>11: <Column id="12" name="SPID">190</Column>12: <Column id="14" name="StartTime">2011-09-12T13:20:45.813-07:00</Column>13: <Column id="16" name="Reads">2</Column>14: <Column id="18" name="CPU">0</Column>15: <Column id="1" name="TextData">SELECT COUNT(*) FROM "TABLENAME"</Column>16: <Column id="9" name="ClientProcessID">6896</Column>17: <Column id="13" name="Duration">105</Column>18: <Column id="17" name="Writes">0</Column>19: </Event>20: [...]21: </Events>22: </TraceData>23:
The Column.cs clas for Column nodes:1: using System.Collections.Generic;2: using System.Xml.Serialization;3:4: namespace XmlSerializerTest5: {6: [XmlRoot(ElementName="Event", Namespace="http://tempuri.org/TracePersistence.xsd")]7: public class Event8: {9: [XmlAttribute("id")]10: public string ID { get; set; }11:12: [XmlAttribute("name")]13: public string Name { get; set; }14:15: [XmlElement("Column")]16: public List<Column> Columns { get; set; }17: }18: }19:
And the main parser code (XmlParser.cs) as is simple as:1: using System.Xml.Serialization;2:3: namespace XmlSerializerTest4: {5: [XmlRoot(ElementName="Column", Namespace="http://tempuri.org/TracePersistence.xsd")]6: public class Column7: {8: [XmlAttribute("id")]9: public string ID { get; set; }10:11: [XmlAttribute("name")]12: public string Name { get; set; }13:14: [XmlText]15: public string Value { get; set; }16: }17: }
1: using System;2: using System.Collections.Generic;3: using System.Xml;4: using System.Xml.Serialization;5:6: namespace XmlSerializerTest7: {8: class XmlParser9: {10: public static List<Event> Parse(String fileName)11: {12: // Init13: List<Event> events = new List<Event>();14:15: // Parse...16: using (XmlReader xmlReader = XmlReader.Create(fileName))17: {18: // XmlSerializer...19: XmlSerializer EventSerializer = new XmlSerializer(typeof(Event));20:21: // Parse XML - "Event" nodes...22: while (xmlReader.ReadToFollowing("Event"))23: {24: Event eventObject = (Event) EventSerializer.Deserialize(xmlReader.ReadSubtree());25: if (String.Equals(eventObject.Name, "SP:StmtCompleted"))26: {27: //28: events.Add(eventObject);29: }30: }31:32: // Cleanup...33: xmlReader.Close();34: }35:36: // Return value37: return events;38: }39: }40: }
Everything is parsed for you by the XmlSerializer.Deserialize() – yes, there may be an overhead in parsing stuff that otherwise maybe you were interested in, like columns you didn’t want in the first place and could add up in memory, but if code readability and maintainability is more important, to reduce memory I guess you could go and trim them from the list after the elements are parsed.
The above code goes through a 250 MB file in about 6 seconds on my Intel P8600 laptop (measured with StopWatch) and it uses about 37 MB of RAM (private bytes).
I guess the interesting parts are the XmlReader, XmlReader.ReadToFollowing(), XmlReader.ReadSubtree(), XmlSerializer.Deserialize() and the Xml annotations in the Event and Column classes – I’m not going to go into details on those, there’s plenty examples on the web and documentation on MSDN.
The reason why I posted it in the first place was that I thought it might be of help to someone else learning C# like myself, and see a working example and what makes it tick.
Nice, just used this to do some analysis in Linqpad with ~500mb of trc. The XLinq possibilities are fun once you get the List back. E.g.,
ReplyDeleteIList events = XmlParser.Parse(@"");
var bar = events.Select(e =>
{
string textData = e.Columns.First(c => c.Name == "TextData").Value;
DateTime startTime = DateTime.Parse(e.Columns.First(c => c.Name == "StartTime").Value);
DateTime endTime = DateTime.Parse(e.Columns.First(c => c.Name == "EndTime").Value);
return new
{
TextDataPrefix = textData.Substring(0, Math.Min(100, textData.Length)),
TextData = textData,
Time = endTime.Subtract(startTime)
};
});
bar.GroupBy(e => e.TextDataPrefix)
.Select(g => new { g.Key, AvgTime = Math.Round(g.Average (ge => ge.Time.Milliseconds), 2), Count = g.LongCount() })
.OrderByDescending (g => g.Count)
.Dump();