• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Identifying in bold, italics, underline using POI

 
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
We are using Apache POI 3.0 for reading values from an excel sheet.
We have a requirement which requires us to identify if some word in the excel sheet is in bold or italics or underline.

Can anyone please tell me how to do this?
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sorry for my delayed reply.

This is the code I am using

void func()
{

try{

InputStream fis = new FileInputStream("C:/bind.xls");;
POIFSFileSystem fs = new POIFSFileSystem( fis );
HSSFWorkbook wb = new HSSFWorkbook(fs);
HSSFSheet sheet = wb.getSheetAt(0);
HSSFRow row=sheet.getRow(0);
HSSFCell cell=row.getCell((short)0);


HSSFRichTextString rts=cell.getRichStringCellValue();
System.out.println("----"+rts);

}catch(Exception e)
{
e.printStackTrace();
}

}
The problem is once the value of cell is fetched in the code all formatting is lost. Please tell me how to find out the formatting.

The value in cell could be something like

"abc pqr xyz"

where pqr is in bold,
abc is in italics
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Any one please
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The following snippet will tell you about the style at each character index:


This is another approach that enumerates the formatting runs, but the indices it returns are a bit goofy:
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for replying.


Originally posted by Ulf Dittmer:
The following snippet will tell you about the style at each character index:



It always returns the same result whether the character is bold or italic or underline. The result I get is always

index = 0, char 'O'
font # = 0
bold = 400
italic = false
underline = 0
strikeout = false

index = 1, char 'b'
font # = 0
bold = 400
italic = false
underline = 0
strikeout = false

index = 2, char 'j'
font # = 0
bold = 400
italic = false
underline = 0
strikeout = false
[ August 05, 2007: Message edited by: Thomas Greene ]
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Ulf Dittmer:

This is another approach that enumerates the formatting runs, but the indices it returns are a bit goofy:



System.out.println(rts.numFormattingRuns());
returns 0, so it never enters the loop.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That's odd. I tested it with a string that contains runs of all 4 formatting features, and it reported all of them.

And then I noticed that I had misinterpreted the index numbers, and that they actually do make sense. So in my test case, both approaches work. I've put the file here.
[ August 05, 2007: Message edited by: Ulf Dittmer ]
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That is really surprising. Your excel sheet works fine, while mine does not. Is there a way I can upload my excel sheet for you to have a look?
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I've sent you a private message. You can read it by following the "My Profile" link at the top of the page.
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Ulf Dittmer:
I've sent you a private message. You can read it by following the "My Profile" link at the top of the page.



Please check your email

Thank You.
[ August 06, 2007: Message edited by: Thomas Greene ]
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The difference seems to be that there are no "runs" in your example - all of the cell contents use the same styling. In that case, you can get the style information from the cell font:



That is apparently the style information used for the first run if there is more than one run in a cell.
[ August 06, 2007: Message edited by: Ulf Dittmer ]
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks a lot Ulf
Can you please tell me what does "run" mean here.

Also, I tried the code you have given. Still not getting absolutely correct results.
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have mailed the excel sheet.
If you observe, last word (uuu) is both underline, bold and italics.
But only italics is giving true (assuming bold=400 means normal formatting and not bold)

Thanks again
[ August 06, 2007: Message edited by: Thomas Greene ]
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The code prints repeatedly the cell style, not the character style, that's why you're getting the same style for all characters.

It looks correct to me. The cell style is italics, non-bold, non-underline. That's the first run, i.e. what the string starts with. Then there's a bold, italic, non-underline run that starts at index 10, and another run that's bold, italics, underline starting at index 17.

"Run" here is a sequence of characters that use the same style -and by implication, the same font index-. The example has three runs - the first one is represented by the cell style, and the other two are accessed through numFormattingRuns and getFontOfFormattingRun.
[ August 06, 2007: Message edited by: Ulf Dittmer ]
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for replying Ulf

Originally posted by Ulf Dittmer:
and another run that's bold, italics, underline starting at index 17.


[ August 06, 2007: Message edited by: Ulf Dittmer ]



But in the output I am getting (starting index 17)


Doesn't that mean "not bold", "not underline" but only italics
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Did you fix the bug in your code that I described in the first sentence of my last post? With that fixed, I'm getting the following, which looks OK.
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Ulf

I see what you are trying to say.

Instead of
HSSFCellStyle style = cell.getCellStyle();
I should be getting character style. But there does not seem to be any method which will get me the character style. Which method should be used instead?
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

But there does not seem to be any method which will get me the character style.



The code in my first post did exactly that.
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Are you referring to this code

String text = rts.toString();

for (int i=0; i<rts.length(); i++) {
System.out.println("index = "+i+", char '"+text.charAt(i)+"'");
System.out.println("font # = "+rts.getFontAtIndex(i));
HSSFFont font = wb.getFontAt(rts.getFontAtIndex(i));
System.out.println("bold = "+font.getBoldweight());
System.out.println("italic = "+font.getItalic());
System.out.println("underline = "+font.getUnderline());
System.out.println("strikeout = "+font.getStrikeout());
System.out.println();
}



But as mentioned above it is giving inconsistent results.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Maybe that's an artifact if there is only a single run. Judging by my tests, the algorithm would be something like:

- the style of the first run of text (which may be the only one) is the one in wb.getFontAt(cell.getCellStyle().getFontIndex());

- if there are more runs (determined by rts.numFormattingRuns()), then starting with the second run, the style can be determined by wb.getFontAt(rts.getFontOfFormattingRun(i)) starting at index rts.getIndexOfFormattingRun(i)

That operates at a higher level than characters (with runs), and may thus simplify the processing.
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for replying Ulf.

rts.numFormattingRuns() returns 0, so I never get into the for loop.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

rts.numFormattingRuns() returns 0, so I never get into the for loop.



Just to clarify: it does that for a cell that has more than one run, i.e. at least two different styles in it? If so, you might want to file a bug report with the POI folks.

As stated before, I've seen it do that for cells that only have a single style, but not for cells that have more than one.
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Ulf.

Is there a way that this problem can be solved?

Actually my requirement is something like this.
In the excel sheet, some of the cells may have values like

abc xyz

This value needs to be fetched using Java code and send to some system in the form of

<b>abc</b> <i>xyz</i>

These HTML tags are required since the content will then be pulled out and displayed on HTML page.

Is there something that can be done to solve this problem?

Thank You.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Short of fixing the bug -if there is one- I don't think this can be solved.

But let me ask again, because I'm still not clear on this from what you write: numFormattingRuns() returns 0 for a cell that contains more than one style, e.g. "bbb iii"? That would be a bug, but I'm not seeing it. Feel free to send me an XLS file that exhibits this behavior.
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Ulf Dittmer:
Short of fixing the bug -if there is one- I don't think this can be solved.

But let me ask again, because I'm still not clear on this from what you write: numFormattingRuns() returns 0 for a cell that contains more than one style, e.g. "bbb iii"? That would be a bug, but I'm not seeing it. Feel free to send me an XLS file that exhibits this behavior.



You are right Ulf. Seems to be my mistake. I might be pointing to the wrong cell.
Although I am getting a value for numFormattingRuns but it is not the correct value. I'll mail you that sheet in a while.
 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Thomas Greene:


You are right Ulf. Seems to be my mistake. I might be pointing to the wrong cell.
Although I am getting a value for numFormattingRuns but it is not the correct value. I'll mail you that sheet in a while.



Seems it doesn't give the index of first word but starts from the space after the first word.
Why is it so?

If you are surprised about my new id, then please have a look at

https://coderanch.com/t/3240/Ranch-Office/Changed-mail-id-incorrect
[ August 08, 2007: Message edited by: Thomas Greane ]
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It looks like the "algorithm" I gave before works fine for this case. The first run of text is obtained from the cell style, and the other runs after that by numFormattingRuns() and the associated methods. So the number of runs returned by is() is one less than the actual number of runs (because the first one is considered the cell style). A bit confusing, but not at all hard to code. As to why that is, I have no idea - that's a question for the POI project members.
 
Thomas Greane
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Ulf Dittmer:
So the number of runs returned by is() is one less than the actual number of runs (because the first one is considered the cell style).



Shouldn't that make it one more.
Suppose I have one word in bold, second word in italics and third word on underline. Now shouldn't the algorithm be giving me 4 values.
One for cell style, second for bold, third for italics and fourth for underline.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Maybe the API is modeled after the internal structure of the XLS file format, and that this is how the file format stores it.

The important thing is that it is possible to get at all data even if it's non-obvious how to do so. I'd recommend that you submit improved documentation for the HSSFRichTextString class to the POI folks, so that the next person wondering about this knows what's going on.
 
Thomas Greane
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Ulf Dittmer:
Maybe the API is modeled after the internal structure of the XLS file format, and that this is how the file format stores it.

The important thing is that it is possible to get at all data even if it's non-obvious how to do so. I'd recommend that you submit improved documentation for the HSSFRichTextString class to the POI folks, so that the next person wondering about this knows what's going on.



I would surely do that. But I am still wondering how to get the formatting of the first word in the cell.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

But I am still wondering how to get the formatting of the first word in the cell.



By using the cell style. See my post starting with "The difference seems to be that there...".

I had described the algorithm to use -no matter how many runs of styled text there are- in the post starting "Maybe that's an artifact if there ...".

BTW, did you see this?
[ August 08, 2007: Message edited by: Ulf Dittmer ]
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Ulf Dittmer:



BTW, did you see this?

[ August 08, 2007: Message edited by: Ulf Dittmer ]



Thanks Ulf. I am back to my old log in.
 
Thomas Greene
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Ulf Dittmer:


By using the cell style. See my post starting with "The difference seems to be that there...".

I had described the algorithm to use -no matter how many runs of styled text there are- in the post starting "Maybe that's an artifact if there ...".

BTW, did you see this?

[ August 08, 2007: Message edited by: Ulf Dittmer ]



Thanks Ulf, it's working fine
 
reply
    Bookmark Topic Watch Topic
  • New Topic