Recently I was sitting in a bad place again. I had to automate one process step in a workflow, which was entirely based on processing files of different type, layout and size.
I love automating things by developing software, but I hate doing so whenever it is based entirely on file systems.
To make matters worse, I had to process Excel files ... again! Being a very experienced VBA programmer, in which I do take shame, not pride, I knew it will give me trouble.
Since I do not want to code in VBA anymore, I decided to use Java and the Apache POI API.
Started to use the API and was happy. See the following articles for reference.
How to setup an eclipse project to use POI.
Identifying the last row in Excel files can be hard. Depends on what you define as last row ;0).
A high level wrapper class around POI for processing Excel files.
And here the problem ...
Soon, of course, I ran into problems. I has to process a file, which was over 10 MB big and ran into following exception during unit testing.
Root cause is obvious: out of memory caused by reading a Row in a Sheet using POI API.
Below you can see the performance of my machine and how it aborts after some time dropping from 5 GB RAM usage and 100% CPU usage. Not satisfying.
Event API - a solution for "intermediate developers"
Google research delivered the following interesting article provided by Apache on the POI Event API, which apparently has to be used instead of the user API for very big Excel files. Having this said, the people I worked with see Excel files as "big" when they are above 50 MB. Excel driven companies (hate them because it takes a lot of pain to change for the better and love them because there is always work).
Search for the heading "XSSF and SAX (Event API)".
The event API is separated into different parts for different Excel file types. For reading big Excel files in xlsx format introduced with Excel 2007 you can use the XSSF and SAX (Event API). Basically you will parse the underlying XML and won't rely on the POI usermodel API anymore. Therefore you could theoretically develop your own API for parsing xlsx files, but I do not recommend doing so. If you however want to take this step JAXB might be of interest for you. I will definitely not go this far.
Starting out with simple unit tests provided by Apache under above URL for the event API, I ran into different exceptions.
First try I got the file path to my source file I wanted to read wrong. This gave me following exception, which is not very helpful. I would have liked a FileNotFoundException.
Second try I got the file path correct, but then I ran into following exception.
Sigh! Anyhow. Removing the reference to the not existing class did the trick for me, but might not do the trick for you. The above exception is caused by the following call to method createXmlHandler(String className).
My first best guess solution was to switch to the overridden method createXmlHandler(), which attempts to guess the XmlHandler by checking certain places for constants providing the class path for the XMLHandler to use. Just read the Java Doc for the method XMLReaderFactory.createXMLReader() and you understand what I mean.
So I changed the method to fetch the sheet parser to have the XMLReaderFactory determine the XMLReader class I need itself.
Et volia. We have console output as you can see in below screenshot. The API processed over 91.405 records and 14 columns most of them filled (no null Rows in referenced Excel Sheet) in 45 seconds without exhausting my machines resources. I would say that POI passed my performance test and I am happy again.
Refactoring to my high level classes using POI
Now for understanding the source code provided by Apache and the new concept behind it. Below you can find the source code as it is working for me as Unit Test using JUnit4. An explanation of the source code can be found below.
So, in or der to understand what above code is doing, it helps to understand some of Excel file type internals. Lets assume the following screenshot as my sample file.
It has several worksheets and in the worksheet of interest only one row with data. If I open the same file with an archiving software like WinRAR I will see something like this.
Open the directory "xl" and have a look at the files in the directory worksheets as well as the file "sharedStrings.xml".
The concept behind the file "sharedStrings.xml" in order to save space is to save a String only once and hold only a reference to a position of an xml tag in the sharedStrings.xml instead of repeatedly saving the same value in one file. The directory worksheets contains several or only one XML file representing a Sheet in a Workbook. If you open one of the Sheet XML files, you will see that it does not contain any Strings in cells, but numbers. Pick a number of a cell you know the content of and check a tag at this position in the sharedStrings.xml. You will find the value that is displayed in the cell.
The SAXParser simply runs through a handed XML file representing a Sheet top to bottom. During this each XML tag is handed to a ContentHandler set for a current instantiation of a parser. The nested class SheetHandler is the ContentHandler for above unit test. If it finds a cell indicated by the tag name "c" it will check if the contained value is of type String. If it is a String, it will get the corresponding value from the SharedStringsTable handed to the handler at creation of a new instance and print the gotten value in the console.
Below a screenshot of an example sharedStrings.xml in Eclipse.
I will now have to adjust this to meet my requirements as I also want to handle values, which are not Strings. I will refactor my class ExcelFile to be able to decide itself when to use the event API and when to use the usermodel API or a mix of both.
POI usermodel and StAX vs. SAX
SAX and StAX are event based XML Streaming APIs, which means they deliver events to a Parser, which can then be interpreted. Event based because a Parser can ignore certain events and only "listen" to events, which are of interest. When reading Excel files using one of these APIs you will most likely not be interested in the formatting of an Excel sheet. Therefore your Parser will probably ignore everything except "rows" and cells ("c").
Tree based APIs like the POI usermodel or DOM are very inefficient when it comes to memory usage. They have to read the whole file into memory and the tree representation of XML files tends to be several times larger than the original file.
SAX and StAX are streaming APIs which differ in the way data is streamed. SAX is a push API, while StAX is pulling the information in chunks from a file. StAX above SAX leaves the utilizing program in control of when a next chunck of data is gotten. Using SAX you really have to implement a listener concept, since you do not have control over when data is pushed into your program.
Now, ... since we have to already use StAX or SAX to parse big Excel files, why would we still need POI? Above I briefly documented how an Excel file is set up internally. In order to handle special things like the shared strings table or the archive itself POI is still useful. Also you will most likely end up parsing single rows from a sheet into the POI usermodel because it provides a convenient API for handling Excel objects. However this is totally up to you and your use case. You could in general stream Excel files without POI just using StAX or SAX.
Reading Excel 2007 files with StAX
Since the following POI web page only provides an example for streaming an Excel file using SAX, I coded a unit test using StAX. Compared to SAX, StAX is it much more intuitive to code, read and debug. Enjoy.
A very good but not too long explanation of SAX and StAX.