wood burning stoves 2.0*
The moose likes Other Open Source Projects and the fly likes Identifying in bold, italics, underline using POI Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Identifying in bold, italics, underline using POI" Watch "Identifying in bold, italics, underline using POI" New topic
Author

Identifying in bold, italics, underline using POI

Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
We are using Apache POI 3.0 for reading values from an excel sheet.
We have a requirement which requires us to identify if some word in the excel sheet is in bold or italics or underline.

Can anyone please tell me how to do this?
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Sorry for my delayed reply.

This is the code I am using

void func()
{

try{

InputStream fis = new FileInputStream("C:/bind.xls");;
POIFSFileSystem fs = new POIFSFileSystem( fis );
HSSFWorkbook wb = new HSSFWorkbook(fs);
HSSFSheet sheet = wb.getSheetAt(0);
HSSFRow row=sheet.getRow(0);
HSSFCell cell=row.getCell((short)0);


HSSFRichTextString rts=cell.getRichStringCellValue();
System.out.println("----"+rts);

}catch(Exception e)
{
e.printStackTrace();
}

}
The problem is once the value of cell is fetched in the code all formatting is lost. Please tell me how to find out the formatting.

The value in cell could be something like

"abc pqr xyz"

where pqr is in bold,
abc is in italics
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Any one please
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
The following snippet will tell you about the style at each character index:


This is another approach that enumerates the formatting runs, but the indices it returns are a bit goofy:


Ping & DNS - my free Android networking tools app
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Thanks for replying.


Originally posted by Ulf Dittmer:
The following snippet will tell you about the style at each character index:



It always returns the same result whether the character is bold or italic or underline. The result I get is always

index = 0, char 'O'
font # = 0
bold = 400
italic = false
underline = 0
strikeout = false

index = 1, char 'b'
font # = 0
bold = 400
italic = false
underline = 0
strikeout = false

index = 2, char 'j'
font # = 0
bold = 400
italic = false
underline = 0
strikeout = false
[ August 05, 2007: Message edited by: Thomas Greene ]
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Originally posted by Ulf Dittmer:

This is another approach that enumerates the formatting runs, but the indices it returns are a bit goofy:


System.out.println(rts.numFormattingRuns());
returns 0, so it never enters the loop.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
That's odd. I tested it with a string that contains runs of all 4 formatting features, and it reported all of them.

And then I noticed that I had misinterpreted the index numbers, and that they actually do make sense. So in my test case, both approaches work. I've put the file here.
[ August 05, 2007: Message edited by: Ulf Dittmer ]
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
That is really surprising. Your excel sheet works fine, while mine does not. Is there a way I can upload my excel sheet for you to have a look?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
I've sent you a private message. You can read it by following the "My Profile" link at the top of the page.
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Originally posted by Ulf Dittmer:
I've sent you a private message. You can read it by following the "My Profile" link at the top of the page.


Please check your email

Thank You.
[ August 06, 2007: Message edited by: Thomas Greene ]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
The difference seems to be that there are no "runs" in your example - all of the cell contents use the same styling. In that case, you can get the style information from the cell font:



That is apparently the style information used for the first run if there is more than one run in a cell.
[ August 06, 2007: Message edited by: Ulf Dittmer ]
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Thanks a lot Ulf
Can you please tell me what does "run" mean here.

Also, I tried the code you have given. Still not getting absolutely correct results.
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
I have mailed the excel sheet.
If you observe, last word (uuu) is both underline, bold and italics.
But only italics is giving true (assuming bold=400 means normal formatting and not bold)

Thanks again
[ August 06, 2007: Message edited by: Thomas Greene ]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
The code prints repeatedly the cell style, not the character style, that's why you're getting the same style for all characters.

It looks correct to me. The cell style is italics, non-bold, non-underline. That's the first run, i.e. what the string starts with. Then there's a bold, italic, non-underline run that starts at index 10, and another run that's bold, italics, underline starting at index 17.

"Run" here is a sequence of characters that use the same style -and by implication, the same font index-. The example has three runs - the first one is represented by the cell style, and the other two are accessed through numFormattingRuns and getFontOfFormattingRun.
[ August 06, 2007: Message edited by: Ulf Dittmer ]
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Thanks for replying Ulf
Originally posted by Ulf Dittmer:
and another run that's bold, italics, underline starting at index 17.


[ August 06, 2007: Message edited by: Ulf Dittmer ]


But in the output I am getting (starting index 17)


Doesn't that mean "not bold", "not underline" but only italics
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
Did you fix the bug in your code that I described in the first sentence of my last post? With that fixed, I'm getting the following, which looks OK.
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Thanks Ulf

I see what you are trying to say.

Instead of
HSSFCellStyle style = cell.getCellStyle();
I should be getting character style. But there does not seem to be any method which will get me the character style. Which method should be used instead?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
But there does not seem to be any method which will get me the character style.


The code in my first post did exactly that.
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Are you referring to this code

String text = rts.toString();

for (int i=0; i<rts.length(); i++) {
System.out.println("index = "+i+", char '"+text.charAt(i)+"'");
System.out.println("font # = "+rts.getFontAtIndex(i));
HSSFFont font = wb.getFontAt(rts.getFontAtIndex(i));
System.out.println("bold = "+font.getBoldweight());
System.out.println("italic = "+font.getItalic());
System.out.println("underline = "+font.getUnderline());
System.out.println("strikeout = "+font.getStrikeout());
System.out.println();
}


But as mentioned above it is giving inconsistent results.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
Maybe that's an artifact if there is only a single run. Judging by my tests, the algorithm would be something like:

- the style of the first run of text (which may be the only one) is the one in wb.getFontAt(cell.getCellStyle().getFontIndex());

- if there are more runs (determined by rts.numFormattingRuns()), then starting with the second run, the style can be determined by wb.getFontAt(rts.getFontOfFormattingRun(i)) starting at index rts.getIndexOfFormattingRun(i)

That operates at a higher level than characters (with runs), and may thus simplify the processing.
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Thanks for replying Ulf.

rts.numFormattingRuns() returns 0, so I never get into the for loop.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
rts.numFormattingRuns() returns 0, so I never get into the for loop.


Just to clarify: it does that for a cell that has more than one run, i.e. at least two different styles in it? If so, you might want to file a bug report with the POI folks.

As stated before, I've seen it do that for cells that only have a single style, but not for cells that have more than one.
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Thanks Ulf.

Is there a way that this problem can be solved?

Actually my requirement is something like this.
In the excel sheet, some of the cells may have values like

abc xyz

This value needs to be fetched using Java code and send to some system in the form of

<b>abc</b> <i>xyz</i>

These HTML tags are required since the content will then be pulled out and displayed on HTML page.

Is there something that can be done to solve this problem?

Thank You.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
Short of fixing the bug -if there is one- I don't think this can be solved.

But let me ask again, because I'm still not clear on this from what you write: numFormattingRuns() returns 0 for a cell that contains more than one style, e.g. "bbb iii"? That would be a bug, but I'm not seeing it. Feel free to send me an XLS file that exhibits this behavior.
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Originally posted by Ulf Dittmer:
Short of fixing the bug -if there is one- I don't think this can be solved.

But let me ask again, because I'm still not clear on this from what you write: numFormattingRuns() returns 0 for a cell that contains more than one style, e.g. "bbb iii"? That would be a bug, but I'm not seeing it. Feel free to send me an XLS file that exhibits this behavior.


You are right Ulf. Seems to be my mistake. I might be pointing to the wrong cell.
Although I am getting a value for numFormattingRuns but it is not the correct value. I'll mail you that sheet in a while.
Thomas Greane
Greenhorn

Joined: Aug 08, 2007
Posts: 5
Originally posted by Thomas Greene:


You are right Ulf. Seems to be my mistake. I might be pointing to the wrong cell.
Although I am getting a value for numFormattingRuns but it is not the correct value. I'll mail you that sheet in a while.


Seems it doesn't give the index of first word but starts from the space after the first word.
Why is it so?

If you are surprised about my new id, then please have a look at

http://www.coderanch.com/t/3240/Ranch-Office/Changed-mail-id-incorrect
[ August 08, 2007: Message edited by: Thomas Greane ]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
It looks like the "algorithm" I gave before works fine for this case. The first run of text is obtained from the cell style, and the other runs after that by numFormattingRuns() and the associated methods. So the number of runs returned by is() is one less than the actual number of runs (because the first one is considered the cell style). A bit confusing, but not at all hard to code. As to why that is, I have no idea - that's a question for the POI project members.
Thomas Greane
Greenhorn

Joined: Aug 08, 2007
Posts: 5
Originally posted by Ulf Dittmer:
So the number of runs returned by is() is one less than the actual number of runs (because the first one is considered the cell style).


Shouldn't that make it one more.
Suppose I have one word in bold, second word in italics and third word on underline. Now shouldn't the algorithm be giving me 4 values.
One for cell style, second for bold, third for italics and fourth for underline.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
Maybe the API is modeled after the internal structure of the XLS file format, and that this is how the file format stores it.

The important thing is that it is possible to get at all data even if it's non-obvious how to do so. I'd recommend that you submit improved documentation for the HSSFRichTextString class to the POI folks, so that the next person wondering about this knows what's going on.
Thomas Greane
Greenhorn

Joined: Aug 08, 2007
Posts: 5
Originally posted by Ulf Dittmer:
Maybe the API is modeled after the internal structure of the XLS file format, and that this is how the file format stores it.

The important thing is that it is possible to get at all data even if it's non-obvious how to do so. I'd recommend that you submit improved documentation for the HSSFRichTextString class to the POI folks, so that the next person wondering about this knows what's going on.


I would surely do that. But I am still wondering how to get the formatting of the first word in the cell.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
But I am still wondering how to get the formatting of the first word in the cell.


By using the cell style. See my post starting with "The difference seems to be that there...".

I had described the algorithm to use -no matter how many runs of styled text there are- in the post starting "Maybe that's an artifact if there ...".

BTW, did you see this?
[ August 08, 2007: Message edited by: Ulf Dittmer ]
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Originally posted by Ulf Dittmer:



BTW, did you see this?

[ August 08, 2007: Message edited by: Ulf Dittmer ]


Thanks Ulf. I am back to my old log in.
Thomas Greene
Ranch Hand

Joined: Aug 09, 2004
Posts: 126
Originally posted by Ulf Dittmer:


By using the cell style. See my post starting with "The difference seems to be that there...".

I had described the algorithm to use -no matter how many runs of styled text there are- in the post starting "Maybe that's an artifact if there ...".

BTW, did you see this?

[ August 08, 2007: Message edited by: Ulf Dittmer ]


Thanks Ulf, it's working fine
 
GeeCON Prague 2014
 
subject: Identifying in bold, italics, underline using POI