Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Special characters parsing in Java

 
N Naresh
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi i have the following program

if we run above we are getting following output where it is prep-ending extra  character for each special character.

<html>
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<title>Character Set Test</title>
</head>
<body>
<h2>Character Set Test</h2>

<table border="1" >
<caption>Character entity references in HTML 4.0</caption>

<col align="LEFT" />
<col align="CENTER" style="font-size:1.4em" />
<col align="LEFT" />
<col align="RIGHT" />
<col align="RIGHT" />
<col align="LEFT" />
<tr>
<th class="main" ><em>ISO 8859-1
characters</em><br />
</th>
</tr>

<tr>
<td>nbsp</td>
<td> </td>
<td>no-break space (non-breaking space)</td>
<td>160</td>
</tr>

<tr>
<td>iexcl</td>
<td>¡</td>
<td>inverted exclamation mark</td>
<td>161</td>
</tr>

<tr>
<td>cent</td>
<td>¢</td>
<td>cent sign</td>
<td>162</td>
</tr>

<tr>
<td>pound</td>
<td>£</td>
<td>pound sign</td>
<td>163</td>
</tr>

<tr>
<td>curren</td>
<td>¤</td>
<td>currency sign</td>
<td>164</td>
</tr>

<tr>
<td>yen</td>
<td>Â¥</td>
<td>yen sign (yuan sign)</td>
<td>165</td>
</tr>

<tr>
<td>brvbar</td>
<td>¦</td>
<td>broken bar (broken vertical bar)</td>
<td>166</td>
</tr>

<tr>
<td>sect</td>
<td>§</td>
<td>section sign</td>
<td>167</td>
</tr>

<tr>
<td>uml</td>
<td>¨</td>
<td>diaeresis (spacing diaeresis)</td>
<td>168</td>
</tr>

<tr>
<td>copy</td>
<td>©</td>
<td>copyright sign</td>
<td>169</td>
</tr>

<tr>
<td>ordf</td>
<td>ª</td>
<td>feminine ordinal indicator</td>
<td>170</td>
</tr>

<tr>
<td>laquo</td>
<td>«</td>
<td>left-pointing double angle quotation mark (left pointing
guillemet)</td>
<td>171</td>
</tr>

<tr>
<td>not</td>
<td>¬</td>
<td>not sign</td>
<td>172</td>
</tr>

<tr>
<td>shy</td>
<td>­</td>
<td>soft hyphen (discretionary hyphen)</td>
<td>173</td>
</tr>

<tr>
<td>reg</td>
<td>®</td>
<td>registered sign (registered trade mark sign)</td>
<td>174</td>
</tr>

<tr>
<td>macr</td>
<td>¯</td>
<td>macron (spacing macron, overline APL overbar)</td>
<td>175</td>
</tr>

<tr>
<td>deg</td>
<td>°</td>
<td>degree sign</td>
<td>176</td>
</tr>

<tr>
<td>plusmn</td>
<td>±</td>
<td>plus-minus sign (plus-or-minus sign)</td>
<td>177</td>
</tr>

<tr>
<td>sup2</td>
<td>²</td>
<td>superscript two (superscript digit two, squared)</td>
<td>178</td>
</tr>

<tr>
<td>sup3</td>
<td>³</td>
<td>superscript three (superscript digit three, cubed)</td>
<td>179</td>
</tr>

<tr>
<td>acute</td>
<td>´</td>
<td>acute accent (spacing acute)</td>
<td>180</td>
</tr>

<tr>
<td>micro</td>
<td>µ</td>
<td>micro sign</td>
<td>181</td>
</tr>

<tr>
<td>para</td>
<td>¶</td>
<td>pilcrow sign (paragraph sign)</td>
<td>182</td>
</tr>

<tr>
<td>middot</td>
<td>·</td>
<td>middle dot (Georgian comma, Greek middle dot)</td>
<td>183</td>
</tr>

<tr>
<td>cedil</td>
<td>¸</td>
<td>cedilla (spacing cedilla)</td>
<td>184</td>
</tr>

<tr>
<td>sup1</td>
<td>¹</td>
<td>superscript one (superscript digit one)</td>
<td>185</td>
</tr>

<tr>
<td>ordm</td>
<td>º</td>
<td>masculine ordinal indicator</td>
<td>186</td>
</tr>

<tr>
<td>raquo</td>
<td>»</td>
<td>right-pointing double angle quotation mark (right pointing
guillemet)</td>
<td>187</td>
</tr>

<tr>
<td>frac14</td>
<td>¼</td>
<td>vulgar fraction one quarter (fraction one quarter)</td>
<td>188</td>
</tr>

<tr>
<td>frac12</td>
<td>½</td>
<td>vulgar fraction one half (fraction one half)</td>
<td>189</td>
</tr>

<tr>
<td>frac34</td>
<td>¾</td>
<td>vulgar fraction three quarters (fraction three quarters)</td>
<td>190</td>
</tr>

<tr>
<td>iquest</td>
<td>¿</td>
<td>inverted question mark (turned question mark)</td>
<td>191</td>
</tr>

<tr>
<td>Agrave</td>
<td>À</td>
<td>Latin capital letter A with grave (Latin capital letter A
grave)</td>
<td>192</td>
</tr>

<tr>
<td>Aacute</td>
<td>Ã?</td>
<td>Latin capital letter A with acute</td>
<td>193</td>
</tr>

<tr>
<td>Acirc</td>
<td>Â</td>
<td>Latin capital letter A with circumflex</td>
<td>194</td>
</tr>

<tr>
<td>Atilde</td>
<td>Ã</td>
<td>Latin capital letter A with tilde</td>
<td>195</td>
</tr>

<tr>
<td>Auml</td>
<td>Ä</td>
<td>Latin capital letter A with diaeresis</td>
<td>196</td>
</tr>

<tr>
<td>Aring</td>
<td>Ã…</td>
<td>Latin capital letter A with ring above (Latin capital letter A
ring)</td>
<td>197</td>
</tr>

<tr>
<td>AElig</td>
<td>Æ</td>
<td>Latin capital letter AE (Latin capital ligature AE)</td>
<td>198</td>
</tr>

<tr>
<td>Ccedil</td>
<td>Ç</td>
<td>Latin capital letter C with cedilla</td>
<td>199</td>
</tr>

<tr>
<td>Egrave</td>
<td>È</td>
<td>Latin capital letter E with grave</td>
<td>200</td>
</tr>
</table>
</body>
</html>

where original URLwebpage characters are different as follows.

Character Set Test
Character entity references in HTML 4.0 ISO 8859-1 characters
nbsp no-break space (non-breaking space) 160
iexcl ¡ inverted exclamation mark 161
cent ¢ cent sign 162
pound £ pound sign 163
curren ¤ currency sign 164
yen ¥ yen sign (yuan sign) 165
brvbar ¦ broken bar (broken vertical bar) 166
sect § section sign 167
uml ¨ diaeresis (spacing diaeresis) 168
copy © copyright sign 169
ordf ª feminine ordinal indicator 170
laquo « left-pointing double angle quotation mark (left pointing guillemet) 171
not ¬ not sign 172
shy ­ soft hyphen (discretionary hyphen) 173
reg ® registered sign (registered trade mark sign) 174
macr ¯ macron (spacing macron, overline APL overbar) 175
deg ° degree sign 176
plusmn ± plus-minus sign (plus-or-minus sign) 177
sup2 ² superscript two (superscript digit two, squared) 178
sup3 ³ superscript three (superscript digit three, cubed) 179
acute ´ acute accent (spacing acute) 180
micro µ micro sign 181
para ¶ pilcrow sign (paragraph sign) 182
middot · middle dot (Georgian comma, Greek middle dot) 183
cedil ¸ cedilla (spacing cedilla) 184
sup1 ¹ superscript one (superscript digit one) 185
ordm º masculine ordinal indicator 186
raquo » right-pointing double angle quotation mark (right pointing guillemet) 187
frac14 ¼ vulgar fraction one quarter (fraction one quarter) 188
frac12 ½ vulgar fraction one half (fraction one half) 189
frac34 ¾ vulgar fraction three quarters (fraction three quarters) 190
iquest ¿ inverted question mark (turned question mark) 191
Agrave À Latin capital letter A with grave (Latin capital letter A grave) 192
Aacute Á Latin capital letter A with acute 193
Acirc  Latin capital letter A with circumflex 194
Atilde à Latin capital letter A with tilde 195
Auml Ä Latin capital letter A with diaeresis 196
Aring Å Latin capital letter A with ring above (Latin capital letter A ring) 197
AElig Æ Latin capital letter AE (Latin capital ligature AE) 198
Ccedil Ç Latin capital letter C with cedilla 199
Egrave È Latin capital letter E with grave 200

Anybody please suggest me how to solve this problem it is urgent.

 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Does the console -or wherever you're printing this- support those characters?
 
N Naresh
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
in any browser as well as in eclipse console output is same.
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
in any browser as well as in eclipse console output is same.

What do you mean by "in any browser"? The Java code runs on the command line, not in a browser, right? And again, does the Eclipse console support those characters?

Also, while I'm not sure what " tidy.setCharEncoding(org.w3c.tidy.Configuration.UTF8)" does, if the numerical entities are replaced by their corresponding characters (is that what "tidy.setNumEntities(true)" does?), then "stream.toString()" uses the platform default encoding - which is likely not ISO-8859 or UTF-8.
 
N Naresh
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i took that code from my original web project and try to test it in eclipse console where in both cases the output which i am getting is same.
 
Rob Spoor
Sheriff
Pie
Posts: 20532
54
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
N Naresh wrote:

You are using UTF-8 when reading the contents from the InputStream and writing to the ByteArrayOutputStream, but then you're using the system default encoding to convert that byte[] into a String. Try using this:
 
N Naresh
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you it is working fine it seems..
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic