If you have to determine the last row in a Microsoft Excel file you will face some trouble.
For whatever reason there is no reliable method available in Excel VBA to retrieve the last row with data. Since Microsoft developers seem to have decided to treat formatting and data the same way when it comes to determining the so called UsedRange in a worksheet, we are officially screwed.
Here is what POI developers are stating in their API docs regarding this problem :0)."Gets the number last row on the sheet. Owing to idiosyncrasies in the excel file format, if the result of calling this method is zero, you can't tell if that means there are zero rows on the sheet, or one at position zero. For that case, additionally call getPhysicalNumberOfRows() to tell if there is a row at position zero or not."
I fully agree: Excel's core implementation and resulting behavior is nuts many times.
Apache POI developers are now facing the same issue, Excel VBA developers face when determining row counts in sheets. In Excel VBA I found following workaround. Read below article before continuing with this one, since it also describes how Excels UsedRange behaves. This is important to know when dealing with Microsoft Excel files. If you know that already you can of course skip it.
In POI? Well, let's examine and see what can be done.
First things first: Test and observe
For my test I developed a little class called ExcelFile. Since I only want a very generalized API when I have to deal with files I do not want to fix issues like the one which motivates me to write this article all the time. If you are interested in the class ExcelFile, you can find a link at the end of this article.
If we call the method getLastRowNum() on a sheet formatted as the above one, we will receive 8 as total row count. That is because Excel is regarding the formatting in sheets as well. Since this is the case and POI wants to support formats in Excel files, developers did not have a choice but to implement it in the same manner. Maybe they would have wished for the number of rows with data to be stored separately in any Excel file as a property. Well, ... nope. They do not have a choice but to live with that, since they cannot tell Microsoft what to do.
If the blue formatting is gone and all cells are empty, the method getLastRowNum() would return 6 as the total row count, which is what I want at least in this case.
Enough test, I know what to do since I've been coding Excel VBA for some years now. I hate it with a passion and had enough frustration with this kind of issues to unfortunately call myself an Excel expert.
Definition of physical and logical Rows in a Sheet
First I tried the following function, which did not give me the desired result.
According to the Apache POI API docs the call to the method getPhysicalNumberOfRows is always necessary, if the int returned by method getLastRowNum is zero. Zero could mean that a referenced Sheet contains one row with data. The call to method getPhysicalNumberOfRows returns the actual number of rows, basically the count of initialized rows from an iterator. My understanding is, that an completely empty row in a Sheet is not represented in the Rows of a sheets data model in order to save RAM. But that is just an assumption.
You have to understand the difference between count of physical and logical rows in an Excel sheet. Let me try following simple definition.
A physical Row in a Sheet contains data, formatting or other information. At least something must be in a Row for it to be considered in a Sheets underlying data model.
A logical Row in contrary could be located anywhere in between physical Rows in a Sheet, but still be empty. A logical Row can be null depending on settings in interface Sheet.
Since this does apply, neither the physical nor the count of logical rows will ever give you what you need, if you are interested in the last Row number with data in a Sheet. Now you might think this is useless since you will always have to iterate through all lines anyhow, if you are interested in the last row with data.
Believe me, I have enough experience with coding in Excel to say, you will need it.
So this is all very confusing and so on, but let's have a look at method getFirstRowNum(). API docs say the following.
the number of the first logical row on the sheet (0-based)
I think what the author actually means to say is the following.
the logical row number of the first physical row on the sheet (0-based)
Yeah! believe me I tested it. The term "physical row" in the API docs is used consistently with a methods functionality also for the class Cell. I think the definition above fits the purpose, until I'll prove myself wrong :0). The term "logical" is definitely used inconsistently in the API docs, as they say at one point a logical row can be null, but the method getFirstRowNum() returns the result I would expect from a physical rule.
The following code works for me, but it is slow and not elegant. Until I found a faster solution this is faster, I will be using my workaround this missing method in POI. It iterates through all rows and compares cells to constants from class Cell for each row.
I am not happy.
Anyhow, here is the link to my class called ExcelFile.