Meaningless Drivel is fun!*
The moose likes Performance and the fly likes Efficient algorithm for string manipulation Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Efficient algorithm for string manipulation" Watch "Efficient algorithm for string manipulation" New topic
Author

Efficient algorithm for string manipulation

Nagya Chindhi
Greenhorn

Joined: Jun 30, 2004
Posts: 3
Somebody asked me this simple question recently. I want to find the
best algorithm for this:

1. We have two strings s1 and s2. We want to subtract every character
that exists in s2 from s1.

String s1 = "Misinterpret";
String s2 = "interpret";

We want to write a method that will return string "Mis" (in any order
"siM", "iMs" etc. is acceptable);

2. The order of characters does not matter.

I know there are java String API's to do that, but we will assume in
our case that we are manipulating using basic arrays.

What is the best approach to do this and what is its efficiency in
Big(O) ?
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Well, I'd probably put the characters into two Sets and apply a removeAll operation.

But if you have to do it using arrays, here is one way:

Create a boolean array with one boolean value for each possible character value. Set every value to false initially.

Now, for every char in the first string, set the corresponding boolean value to true. Then for every char in the second string, set it to false.

Every char that now has a correspending boolean value of true is one that needs to be in your output.

Out of the head I'd say that this has a complexity of O(n) (where n is the sum of characters in the two strings).


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Nagya Chindhi
Greenhorn

Joined: Jun 30, 2004
Posts: 3
What happens if a character is repeated. Each character is not necessarily unique. Like in our case, the character "e" occurs twice.


Originally posted by Ilja Preuss:
Well, I'd probably put the characters into two Sets and apply a removeAll operation.

But if you have to do it using arrays, here is one way:

Create a boolean array with one boolean value for each possible character value. Set every value to false initially.

Now, for every char in the first string, set the corresponding boolean value to true. Then for every char in the second string, set it to false.

Every char that now has a correspending boolean value of true is one that needs to be in your output.

Out of the head I'd say that this has a complexity of O(n) (where n is the sum of characters in the two strings).
Steven Bell
Ranch Hand

Joined: Dec 29, 2004
Posts: 1071
I would make a 'map' out of the first string with a counter for each character. Either two arrays char[52] int[52] or make some object that had a char and an int counter.

grab the first String and push it into the 'map', then grab the second String and do removes. A final walk down the map would finish you off with the remainder.

That should still be O(n), well O(n + m), the two Strings, but basically the same.

My only question would be what if you are trying to eleminate a char that wasn't in the first String? I suppose you would just ignore it.
James Carman
Ranch Hand

Joined: Feb 20, 2001
Posts: 580
Something like this...


[added code tags - Ilja]
[ February 09, 2005: Message edited by: Ilja Preuss ]

James Carman, President<br />Carman Consulting, Inc.
James Carman
Ranch Hand

Joined: Feb 20, 2001
Posts: 580
Oh, that's O(m + n) where m = len(string) and n = len(invalidString)
Steven Bell
Ranch Hand

Joined: Dec 29, 2004
Posts: 1071
Yes. At a glance that looks about right. And yes also on the O(m+n). It all depends on how detailed you are getting with your performance analysis. In general terms O(n+m) is roughly equal to O(n) because constants can be ignored and you end up with, worst case, O(2n) or O(2m)
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by Steven Bell:
I would make a 'map' out of the first string with a counter for each character.


Such a structure is typically called a Bag.

http://jakarta.apache.org/commons/collections/ implements it - James' algorithm would probably be simplified quite a bit by using it.
Steven Bell
Ranch Hand

Joined: Dec 29, 2004
Posts: 1071
Originally posted by Ilja Preuss:

Such a structure is typically called a Bag.


Thank you. I knew I didn't quite have the right structure name, that's why I put it in ''.
James Carman
Ranch Hand

Joined: Feb 20, 2001
Posts: 580
Originally posted by Ilja Preuss:


Such a structure is typically called a Bag.

http://jakarta.apache.org/commons/collections/ implements it - James' algorithm would probably be simplified quite a bit by using it.


Yes it would and if I actually wrote it for my use I would use Commons Collections.
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by James Carman:


Yes it would and if I actually wrote it for my use I would use Commons Collections.


Kirk Pepperdine
Author
Ranch Hand

Joined: Jan 17, 2001
Posts: 71
Originally posted by James Carman:
Oh, that's O(m + n) where m = len(string) and n = len(invalidString)


which is O(n)


Author of <a href="http://www.amazon.com/exec/obidos/ASIN/0672324261/ref=jranch-20" target="_blank" rel="nofollow">Ant Developer's Handbook</a>
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by Kirk Pepperdine:


which is O(n)


Only if n > m, isn't it?
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Well, let's say m = 2n. Then the whole thing is O(3n), which is still O(n). We could invent other relationships between m and n which trasnform the equation in all sorts of other ways if we want. But with big O notation, all we really care about is the exponent - variable names are generally interchangeable. N is the number of "things" - whatever things are most significant. The listener is expected to apply some common sense to figure out which things those are.
[ February 28, 2005: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, Twister
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
To me, O(m+n) means that

- we have two inputs into the algorithm
- the two inputs are independent of each other, and
- the algorithm is linear over both inputs

That looks differently to me than O(n), but I might simply have stopped my studies too early, I guess...
David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
The total input is N = n + m, and the algorithm is O(N). All this means is that it has run time linearly dependent on the inputs. Whether the actual runtime is O(2n + 5m) makes no difference as it's still O(N). Any O(xN) algorithm is considered O(N) since xN is a linear function on N. Similarly, an O(xN + y) algorithm is also O(N).

On the other hand, if it weren't linearly dependent on both inputs, I guess you could consider them seperately, for example O(n^2,m^3). Then again, one of the inputs usually overshadows the other, and the problems we were given in school always had simple inputs. In a collection we dealt with just n, the number of items.
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by David Harkness:
Similarly, an O(xN + y) algorithm is also O(N).


In my understanding, that's only true if y is a constant, or if you can prove that y < xN for all N greater some value.
Glenn Murray
Ranch Hand

Joined: Dec 07, 2001
Posts: 74
Originally posted by Ilja Preuss:


In my understanding, that's only true if y is a constant, or if you can prove that y < xN for all N greater some value.


This isn't quite right, either. If we assume x is 1 and y is 2N then
y > xN, but O(xN+y) is still O(N). I think David meant x and y
to be constant, though.

I think it helps to keep in mind that the big O notation describes
categories of behavior (e.g., linear, quadratic, logarithmic, etc.).
It's pretty easy to determine which category by recalling your first
year calculus, in particular L'Hospital's Rule.

Cheers,
Glenn


Glenn Murray
Author of Yo Soy Una Vaca De Hoy
Jayesh Lalwani
Ranch Hand

Joined: Nov 05, 2004
Posts: 502
Originally posted by Ilja Preuss:


In my understanding, that's only true if y is a constant, or if you can prove that y < xN for all N greater some value.



y is constant wrt N. If y is dependent of some other factor, then that factor should be considered as an input to your functions

So, function funcA takes time T(funcA) = an + y, and y = bm (where a and b are constants) then T= an + bm, and m is an input to your function. If y is dependent on n, then you are expressing T in a wrong way. The correct way to express T is this

T = a(p)n^p + a(p-1)n^(p-1) + ......... a(1)n + a(0) + b(p)m^p + b(p-1)b^(p-1) + ......... b(1)m + b(0)
where n and m are inputs to your function and a(p), a(p-1)....a(1), a(0), b(p), b(p-1)....b(1), b(0) are constants

To simplify computing performance, we always take highest order of the equation. So, if your T is

T = 2n^2 + 10 n + 20 + 5m + 50

for a really large n, 2n^2 will be much larger than other parts of the equation. So, large that 10n + 20 + 5m + 50 becomes negligible

So, for large n, T = 2n^2

When we denote the Order, we ignore the coefficient of that equation . So,

O(funcA) = n^2
[ March 01, 2005: Message edited by: Jayesh Lalwani ]
David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
Originally posted by Glenn Murray:
I think David meant x and y to be constant, though.
Yes, I could have been more clear. I had just finished watching Shaun of the Dead and had a difficult time getting my mind into a mathematical frame.

Jayesh, thank you for posting a much more rigorous description.
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
So, if an algorithm has to inputs, and I want to say that it is linear regarding both inputs, how do I say that (say, in contrast to being linear to one, but constant regarding the other input)?
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Efficient algorithm for string manipulation
 
Similar Threads
Are String objects created in System.out.println Statement
comparator problem
meaningfully equivalent variables
How many Objects?
equals() method