Meaningless Drivel is fun!*
The moose likes Linux / UNIX and the fly likes Extract text lying between two patterns Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » Linux / UNIX
Bookmark "Extract text lying between two patterns" Watch "Extract text lying between two patterns" New topic
Author

Extract text lying between two patterns

Abhinav Srivastava
Ranch Hand

Joined: Nov 19, 2002
Posts: 349

I have one file with content like



I need to create a new file which would look like



Basically finding all occurences of text between <?xml> and </Product>

I tried sed -n and awk range commands but they don't seem to give the desired output.


Any ideas?
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16019
    
  20

That's not going to produce a valid XML file. The <?xml> processing instruction can only appear once in an XML stream and only on the first line.

As for the rest of it, the main reason why your match fails is that "?" is a match control character. So instead of matching "<?xml", it's looking for [<]xml - where the square brackets indicate that the "<" is an optional character. You actually need to match "<\?xml".


Customer surveys are for companies who didn't pay proper attention to begin with.
Abhinav Srivastava
Ranch Hand

Joined: Nov 19, 2002
Posts: 349

I don't want it to be an XML doc, rather just a text file having xml fragments. Actually its not about XML at all, just the text.
My problem is that sed is spitting out the entire line where it finds the match, not just the text lying between the two patterns.
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16019
    
  20

Abhinav Srivastava wrote:I don't want it to be an XML doc, rather just a text file having xml fragments. Actually its not about XML at all, just the text.
My problem is that sed is spitting out the entire line where it finds the match, not just the text lying between the two patterns.


You can use parenthesis to delimit match groups, like so:

<Product>(.*)</Product>

Then you can reference the match group by its group number. It's usually something like "$1" for the first group, "$2" for the second group - if you have multiple group patterns - and so forth. The exact form varies depending of the app/library doing the matching.

AWK is probably better for this than sed. Sed can be programmed to do it, but it requires various buffer tricks. AWK would be much simpler. Something vaguely like the following:



I'm out of practice with AWK, though, so expect to do some heavy tweaking to make it work.>
rajesh thiru
Greenhorn

Joined: May 24, 2010
Posts: 2
Did you got this puzzle out ?
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Extract text lying between two patterns