Wednesday, February 10, 2010

Multithreaded XmlSlurper

Groovy XmlSlurper is a nice tool to parse XML documents, mostly because of the elegant GPath dot-notation. But how efficient is XmlSlurper when it comes to parsing of thousands of XMLs per second? Let's do some simple test

class XmlParserTest {

static int iterations = 1000

def xml = """
<root>
<node1 aName='aValue'>
<node1.1 aName='aValue'>1.1</node1.1>
<node1.2 aName='aValue'>1.2</node1.2>
<node1.3 aName='aValue'>1.3</node1.3>
</node1>
<node2 aName='aValue'>
<node2.1 aName='aValue'>2.1</node2.1>
<node2.2 aName='aValue'>2.2</node2.2>
<node2.3 aName='aValue'>2.3</node2.3>
</node2>
<nodeN aName='aValue'>
<nodeN.1 aName='aValue'>N.1</nodeN.1>
<nodeN.2 aName='aValue'>N.2</nodeN.2>
<nodeN.3 aName='aValue'>N.3</nodeN.3>
</nodeN>
</root>
"""

def parseSequential() {
iterations.times {
def root = new XmlSlurper().parseText(xml)
assert 'aValue' == root.node1.@aName.toString()
}
}

@Test void testSequentialXmlParsing() {
long start = System.currentTimeMillis()
parseSequential()
long stop = System.currentTimeMillis()
println "${iterations} XML documents parsed sequentially in ${stop-start} ms"
}
}

I ran this test on my 4-core machine and I got

1000 XML documents parsed sequentially in 984 ms

Not really good (0.984 ms per document) but we didn't expect much from single threaded application. Let's parallelize this process

class XmlParserTest {
...
static int threadCount = 5
...
@Test void testParallelXmlParsing() {
def threads = []
long start = System.currentTimeMillis()
threadCount.times {
threads << Thread.start { parseSequential() }
}
threads.each { it.join() }
long stop = System.currentTimeMillis()
println "${threadCount * iterations} XML documents parsed parallelly by ${threadCount} threads in ${stop - start} ms"
}
}

And the result is

5000 XML documents parsed parallelly by 5 threads in 1750 ms

This is definitely better (0.35 ms per document) but doesn't look like parallel processing — the test time shouldn't increase in true parallelism.

The problem here is the default constructor of XmlSlurper. It does too much: first, it initializes XML parser factory loading bunch of classes; second, it creates new XML parser, which is quite expensive operation. Now imaging this happens thousand times per second.

Luckily, XmlSlurper has another constructor, with XML parser parameter, so we can create the parser up-front and pass it to the slurper. Unfortunately, we cannot reuse one parser instance between several slurpers because XML parser is not thread-safe — you have to finish parsing one document before you can use the same parser to parse another.

The solution here is to use preconfigured pool of parsers. Let's create one based on Apache commons-pool library.

public class XmlParserPoolableObjectFactory implements PoolableObjectFactory {
private SAXParserFactory parserFactory;

public XmlParserPoolableObjectFactory() {
parserFactory = SAXParserFactory.newInstance();
}
public Object makeObject() throws Exception {
return parserFactory.newSAXParser();
}
public boolean validateObject(Object obj) {
return true;
}
// Other methods left empty
}

public class XmlParserPool {
private final GenericObjectPool pool;

public XmlParserPool(int maxActive) {
pool = new GenericObjectPool(new XmlParserPoolableObjectFactory(), maxActive,
GenericObjectPool.WHEN_EXHAUSTED_BLOCK, 0);
}
public Object borrowObject() throws Exception {
return pool.borrowObject();
}
public void returnObject(Object obj) throws Exception {
pool.returnObject(obj);
}
}

Now we can change our test

class XmlParserTest {
static XmlParserPool parserPool = new XmlParserPool(1000)
...
def parseSequential() {
iterations.times {
def parser = parserPool.borrowObject()
def root = new XmlSlurper(parser).parseText(xml)
parserPool.returnObject(parser)
assert 'aValue' == root.node1.@aName.toString()
}
}
}

and run it again

1000 XML documents parsed sequentially in 203 ms
5000 XML documents parsed parallelly by 5 threads in 172 ms

That's much better (0.034 ms per document), and most importantly multi-threading really works now.

Resources

• Source code for this blog

• Article "Improve performance in your XML applications"

• GPath vs XPath

• commons-pool home page

No comments: