This week's book giveaway is in the Clojure forum.
We're giving away four copies of Clojure in Action and have Amit Rathore and Francis Avila on-line!
See this thread for details.
Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

How to identify some english word is number

 
Em Aiy
Ranch Hand
Posts: 226
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am parsing some sentence which would be having english sentences with numbers like 87,000 or 8.302 or 45.43e3 or 54BA3E

how can i check whether a word is english word or its a number?
 
fred rosenberger
lowercase baba
Bartender
Pie
Posts: 12017
24
Chrome Java Linux
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First, you have to define in English what determines if a set of characters is a number or not. What, EXACTLY is allowed, and what EXACTLY is NOT allowed. Just writing down 3 or 4 examples may be enough for your brain, but there are a LOT of implicit assumptions there.

Once you decide what the rules are, then you can start coding them. But until you define what the rules are, writing any code is pointless.
 
Em Aiy
Ranch Hand
Posts: 226
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
fred rosenberger wrote:First, you have to define in English what determines if a set of characters is a number or not. What, EXACTLY is allowed, and what EXACTLY is NOT allowed. Just writing down 3 or 4 examples may be enough for your brain, but there are a LOT of implicit assumptions there.

Once you decide what the rules are, then you can start coding them. But until you define what the rules are, writing any code is pointless.

lets say the rules are numeric values in any format i.e 88,000 or 88000 or 88,000.00
the hexadecimal numbers
the floating point with "power" sign or suffix

I can write the code to iterate through every character of a word to determine what i want .. I wanted to ask is there any built in support in java? i.e some methods like isNumber()
 
Garrett Rowe
Ranch Hand
Posts: 1296
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There's no built-in methods that do what you're asking for. There are a few algorithms I could think of to start. Obviously, there some ambiguity in the rules you've given thus far. would DEAD or FADE be parsed as a number or a w
 
Em Aiy
Ranch Hand
Posts: 226
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Garrett Rowe wrote:There's no built-in methods that do what you're asking for. There are a few algorithms I could think of to start. Obviously, there some ambiguity in the rules you've given thus far. would DEAD or FADE be parsed as a number or a w

actually i was about to write some code to to check whether some word is number or not so i thought better search around rather than reinventing the wheel.

I was confuse since the number like 88000 (is easy to detect) but then i would have to tackle these cases as well
88,000 (coma separated)
88,000.00 (proper decimal notation)

so thats why i was asking this question.

Talking about rules. I would say again lets say the rules are "basic". Like you are reading a newspaper and there are chances that few numebrs can be there in news and you have to detect those. now you can imagine what kind a number could ever be appear in news papers + "the hexadecimal notation"
 
Garrett Rowe
Ranch Hand
Posts: 1296
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well... there is Integer.parseInt(String) to convert a String to an int, if the String isn't parsable, that method throws a NumberFormatException, which you could catch and try again. You could strip the punctuation out of each token so that that doesn't cause failures:

 
Paul Clapham
Sheriff
Pie
Posts: 20177
25
MySQL Database
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay, so if I encountered the string "two billion" then that would be a number? Or are you only interested in numbers rendered as digits? Is there a limit on the number of digits or would a string of 87 digits be a number?

And what about i (the square root of minus one)? Or e (the root of the natural logarithms)? Or pi?
 
Campbell Ritchie
Sheriff
Pie
Posts: 47244
52
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When I saw "word means number" I interpreted that as "word meaning a natural number", so you would include zero, nought, naught, aught, nothing, O, cipher, nil, love, duck etc. And that is before you have even got to "one"

"Natural number" means a member of the set ℕ, ie non-negative integers, or 0 ... ∞.
 
Maneesh Godbole
Saloon Keeper
Posts: 10971
11
Android Eclipse IDE Google Web Toolkit Java Mac Ubuntu
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How about stuff like dozen, score, pair?
 
Vidmantas Maskoliunas
Greenhorn
Posts: 22
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If I understand your problem well, java.util.Scanner with its methods hasNextInt(), hasNextDouble(), nextInt(), nextDouble() and so on may help.
 
Arjun Abhishek
Ranch Hand
Posts: 57
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi

I have written the solution for identifying if the words with digits are a valid number. While this solution is simple, it can be expanded by adding more regex to the Pattern.




Please let me know in what are the cases this program would fail and if possible how that could be avoided.

cheers
K
 
Em Aiy
Ranch Hand
Posts: 226
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:Okay, so if I encountered the string "two billion" then that would be a number? Or are you only interested in numbers rendered as digits? Is there a limit on the number of digits or would a string of 87 digits be a number?

And what about i (the square root of minus one)? Or e (the root of the natural logarithms)? Or pi?

I have to detect only digits .. no the words which means a number.

one million - should not be detected
1,000,000 - should be detected
 
fred rosenberger
lowercase baba
Bartender
Pie
Posts: 12017
24
Chrome Java Linux
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Again, giving examples like "THIS is good, THIS is bad" is not a way to write programs. you need to define exactly what is allowed, or what causes it to be excluded.

First, how will you get the tokens from the string? is "100 273" the number 100,273 or is it TWO numbers, 100 and 273?

I'm trying to get you to define the rules. once you have a well defined set, you can code to them. your rules may be

1) separate tokens based on the space character.
2) remove all punctuation from each token, except a '.' between two digits
3) a '-' is optional as the first character, but nowhere else.
4) There is an optional number of digits or characters A-F (are lowercase allowed?)
5) there is an optional decimal point
6) There is an optional number of digits or characters A-F (are lowercase allowed?)
7) there is an optional character 'e'
8) if there is an 'e', then there can be an optional number of digits

Will this work? i don't know. do you want to allow "1.738 e4"? the above rules would fail seeing this as 17380 since there is a space between the 8 and the 'e'.

We can't tell you what the rules should be because we don't know your requirements. You have to tell us that.


 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic