• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Tim Cooke
  • Junilu Lacar
Sheriffs:
  • Rob Spoor
  • Devaka Cooray
  • Jeanne Boyarsky
Saloon Keepers:
  • Jesse Silverman
  • Stephan van Hulst
  • Tim Moores
  • Carey Brown
  • Tim Holloway
Bartenders:
  • Jj Roberts
  • Al Hobbs
  • Piet Souris

Regex to parse simple fixed-format JSON

 
Rancher
Posts: 144
9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So, as I'm currently developing OAuth2 client's to access Twitch and YouTube apis I encountered differences in what they return as token.
To show the minor differences here's how the come over the wire (every time \n occurs that's an actual 0x0A):

google initial auth:

google refresh:

twitch initial auth:

twitch refresh:


So, the first difference is that google separates each line with a single \n where as twitch only sends a single \n at the end of the line. As I create Strings from them with new String(byte-array) (Is there actual a "better" way to do this?) I remove them by a simple replaceAll("\n", "") (Don't if a replaceAll is needed as RegEx matching of the \n or if a normal replace() would be sufficient - subject to further testing.). This way I end up all four reponses to be just a "single line" without any other than printable characters within the first 127 ASCII codepoints. Although I'd like to remove the additional spaces on the google response they have to be kept for at least the scope array.

Side-note: Yes, I tried to use several OAuth2 libs, but got only googles lib working on google, but it fails on twitch while parsing the scope array (I tried to debug, but didn't found the exact issue. I can only assume that it's caused by the google lib requires a space separated string for the scope parameter where twitch sends a json-array - wich causes a strange exception: "array expected but found String" - should be the other way around to make sense - nvm.). The other libs I tried either miss the final step (so when got the auth code from the user there was no way to execute the final request) or failed cause it didn't set the required (per RFC!) request parameters correctly. That's why I ended up re-write it my own. Also: Aside from Googles APIs all other lack proper documentation about how to use them - and the javadoc is all just empty skeleton without any single line of useful information (pretty much like the bouncycastle doc). So, if you feel to point me out to use an existing OAuth2 lib - please also explain me how to successfully do it on at least Twitch and Google. Otherwise please just don't mention it. I tried - and failed.

Back to topic:
The question part comes now: Parsing!
I know there're a few json libs out there, but I couldn't get my head around them - and from what I found most of them require to provide a skeleton class the parser then fills (kinda like de-serialization). As the responses are differnt it would require a matching skeleton class for each service - not practical.
So, I tried to parse them with RegEx - as I guess this should be possible. The closest I ever got was this: Pattern.compile("(.+?:.+?),"); - but that only works so far as it doesn't get me the token type (as it isn't tailed by a comma) and splits the array on the twitch response in half. Any other try only gave me even less, null, no matches, or exceptions about illegal RegEx.
So, what I want to achieve is this:
Splitting each name:value pair separated by "," - for the google response that's easy as it doesn't contain that comma separated json-array - but on the twitch reply I somehow have to handle the json-array enclosed by the brackets as one without it also splitting - so I end up with a list of name:value pairs - wich I then can split further on the : to get a key-value map. My main issue is that the replies have different format. Google replies on a refresh only with a new token, but doesn't reply the not changing refresh token - twitch replies both but the scope array differs from what google sends.

I guess this can be done with one single RegEx - but I just doesn't know enough about regex to get it together my own. I tried many "tutorials" and "reference helps" - but I struggle hard to come up with some "simple" classes like "anything but not the brackets".
I don't request a complete solution but rather an explanation how I might be able to get the RegEx together myself.
Most help is need in grouping and classes like match any character  followed by a colon followed by any character but not by brackets until either the next comma or end - and such.
My first line is to remove the braces by just a simple: Pattern.compile("\\{(.+?)\\}"); - but when I try to combine both - so I match the pairs inside the braces - I get nothing: Pattern.compile("\\{(?:(.+?:.+?),)+\\}");
So, I'm pretty much lost on how to "simple" parse those responses without a full blown json lib but with a RegEx (maybe two/three if needed).
Any help appreciated.

Kris
 
Saloon Keeper
Posts: 8704
71
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I would not recommend tackling this problem with regular expressions. There are too many instances where a pattern you may want to match on as a delimiter may also happen to appear in the middle of valid data. It also can't deal with structures like arrays of name/value pairs.

If I were to do it I'd write a state machine which is verbose and not easily maintainable but I've written enough of them over the years to be able to write them without too much difficulty. Not something I'd recommend for the novice.

If I didn't know about state machines then I'd roll up my sleeves and use a JSON library. Not having used one I'm guessing that for your specific case it should be relatively compact. The learning curve will be the stumbling block but I'm sure people here can help out.
 
Marshal
Posts: 3708
523
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If I were tackling this, I would use a JSON object mapper such as Jackson or GSON.

The structure of the Google and Twitch payloads are pretty-well the same - the only difference is the way the the scope is represented.  Google uses represents the parts of the scope as a whitespace delimited string, and Twitch uses an array of strings.

In your application, you want to be able to use authentication information without being concerned with the how that information was presented by Google or Twitch (or others sites), so you could create an interface defining how to access that authentication information.  For example:
You could create classes which the object mapper could use to map from the JSON strings received from Google and Twitch, and have them the interface.  For example:

Google:
Twitch:
Then you could process the JSON payloads to get the authentication information that the application needs.  For example:
Sample output:

Edit: update the code examples
 
Ron McLeod
Marshal
Posts: 3708
523
Android Eclipse IDE TypeScript Redhat MicroProfile Quarkus Java Linux
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Also - the naming of the access_token, expires_in, refresh_token, and token_type fields are not according to best practices.  I kept the names the same as how they were specified in the JSON strings to make the example code simpler.  

When using Jackson, the @JsonProperty annotations can be used to map the names between Java and JSON.  Other object mappers use different ways to accomplish this.
 
Kristina Hansen
Rancher
Posts: 144
9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
@Ron McLeod
I'm very impressed - unfortunate only I can reply with is a big thank you.

About the naming:
That's how the OAuth2 RFC 6749 specify them in appendix A - so Google and Twitch just follow the RFC here - so even if I agree with you that those name schema does not play well along with the java conventions we have to take what we're given here.
In addition to that the RFC also specifies in section 5 that the content type of the reply from the endpoint has to be json:

The parameters are included in the entity-body of the HTTP response using the "application/json" media type as defined by [RFC4627].


What the RFC does not specify is the type of the scope array - so I guess it's legit that Twitch and Google use different "encodings" for what's pretty much just an array / a list of strings.

About using a lib and its mapper:
That's what I meant by "skeleton" classes magically populated by reflections. It's pretty much what was keeping me away from using a json lib. Don't get me wrong - it's not about using a lib, as said: I tried several for the whole OAuth itself - but it's just I'm a) not used to JSON itself b) not used to the libs c) at least for me it was hard to find good information about how to "construct" such classes you presented (big thanks again for your work) - caused by not-so-great documentation. Sure, Googles JavaDoc is also a mess - but with a little help of NetBeans I got around it - but about pretty much all the others - don't know if you had a look into BouncyCastle doc - but they are pretty much the same level of "emptyness". In addition to that, when ever I'm "forced" to use external libs I'm pretty much overwhelmed by the amount of dependecies the often require. Just as an example: To use the YouTube API, wich, basically, just isn't more as a convenient wrapper around all that JSON stuff one need a total of 24 additional jars total up to over 11MB (!) - and the amount of code actually used is just a few kB of that - for me that's a huge payload an order of magnitude to big. Same goes for simple tasks like only using a few helpers from ... let's say Apache commons. Just to run maybe a few lines of codes you have to carry around a lib several 100 kB in size + all the stuff it depends on. It's one reason I actually forced myself to use an IDE with dependency management: It just became to much to keep track of all by myself - just right click > add dependency > and the IDE does the magic rest ... when you accumulate 10s of MB for just 20 lines of code ... it just becomes insane.

Sure, I also fully agree with Carey Brown as on there're many pitfalls - and any of the services (not limited to Twitch and Google but anyone using OAuth) can change their structure from one response to another - it just makes sense to use a robust lib that can handle all these - as long as what comes as response is valid in whatever format it's encoded in. Just rely on "yea, it's been the same for two years now" might not be the best idea. Am I going to use a JSON lib - well, as someone thankfully took the time and effort to work on my problem - the least I can do is show and pay some respect by using it so the work wasn't done for nothing. It's not what I hoped for as a reply and would had liked to do it a bit simpler - but I guess this time there's no "easy way around" it ...

Thanks to anyone had a look on it.
 
Kristina Hansen
Rancher
Posts: 144
9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I tinkered around a bit more just for the sake (json.org really helped as I now understand the rules a bit better) and came up with this regex:
[ ]*?(\"[[.][^\"\\\\]]+?\"[ ]*?:[ ]*?(?:(?:\"[[.][^\\[\\]]]+?\")|(?:\\d+)|(?:\\[.+?\\])))+?[ ]*?
Yes, it aint perfect, but it works for at least Twitch and Google. Currently I don't use other services using OAuth2 - but this may change in the future. For now I'm using what Ron McLeod provided - great work.
 
Saloon Keeper
Posts: 13366
295
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you want to keep fiddling with the regex for experience, great.

Don't use it in any real app though. Regular expression can handle regular languages, and JSON is more complex than that. It's simply not possible to write a regular expression that will handle all cases, and if Google or Twitch decide to make a change you will find your application broken.
 
Kristina Hansen
Rancher
Posts: 144
9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yea, no, I'm just fooling around with it a bit. From json.org I got a bit better understanding of JSON, but about RegEx I just copied together a bit I found on the net.
 
Kristina Hansen
Rancher
Posts: 144
9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So I was able to simplify it a bit further: [ ]*?\"([.[^\"]]+?)\"[ ]*?:[ ]*?(\".+?\"|\\d+|\\[.+?\\])+?[ ]*?
Now I'm at processing the value (2nd capture group) and want to at least get rid of the quotes. I tried to do this by using different capture groups, but it seems I misunderstand how RegEx handels OR of capture groups. That's what I tried: (?:\"(.+?)\"|(\\d+)|\\[(.+?)\\])+? but instead of getting only a single 2nd capture group with either the string, the int or the array I get 3 capture groups (so final with the first I end up with four) for wich the one matched has the value and the other two are null.
For my fooling around I guess I could just figure a way out with a big if() construct - but to the RegEx gurus: Is there a smarter way to already form the RegEx this way I only get a 2nd capture group instead of 3 additional?
 
Stephan van Hulst
Saloon Keeper
Posts: 13366
295
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
  • I'm assuming that your 'regex' is really a Java string literal. When posting a regex, post a regex, so lose all the string escaping. When posting a Java string literal, include the starting and ending double quotes.

  • [ ]*?("[[.][^"\\]]+?"[ ]*?:[ ]*?(?:(?:"[[.][^\[\]]]+?")|(?:\d+)|(?:\[.+?\])))+?[ ]*?


  • [[.][^"\\]] means "A dot character AND everything that is not a double-quote or a backslash". Since the negated character class already includes the dot character, the first part is superfluous.

  • [ ]*?("[^"\\]+?"[ ]*?:[ ]*?(?:(?:"[^\[\]]+?")|(?:\d+)|(?:\[.+?\])))+?[ ]*?


  • It looks like you're performing a search inside the JSON rather than a match on the complete object (as evidenced by the fact that you're not matching commas between the properties), so matching spaces around the properties is unnecessary.

  • "[^"\\]+?"[ ]*?:[ ]*?(?:(?:\"[^\[\]]+?")|(?:\d+)|(?:\[.+?\]))


  • The union operator has low precedence, so lose your non-capturing groups.

  • "[^"\\]+?"[ ]*?:[ ]*?(?:\"[^\[\]]+?"|\d+|\[.+?\])


  • JSON uses all sorts of whitespace to delimit tokens, so use \s instead of [ ]. Besides, you can just use a single space without the square brackets if you really want to match just a space.

  • "[^"\\]+?"\s*?:\s*?(?:\"[^\[\]]+?"|\d+|\[.+?\])


  • In "[^"\\]+?, it looks like you were trying to make a special case for backslashes in strings. It's not going to work, because now it rejects any string that contains a backslash. It also rejects empty strings. For strings and arrays, allow empty sequences.

  • "[^"]*?"\s*?:\s*?(?:\"[^\[\]]*?"|\d+|\[.*?\])


  • The part of your regex that matches string values is wrong. It tries to match any character except brackets, while allowing unescaped double-quotes. Likewise, the part that matches arrays is wrong.

  • "[^"]*?"\s*?:\s*?(?:\"[^"]*?"|\d+|\[[^\]]*?\])


  • Instead of using a non-capturing group for the value of the property, you may actually want to capture the value. The same goes for the name of the property.

  • (?<name>"[^"]*?")\s*?:\s*?(?<value>"[^"]*?"|\d+|\[[^\]]*?\])


    Finally, property values may not just consist of strings, integers and arrays, but also of objects, decimals, and the literals true, false and null. However, you can't use a single regular expression to parse objects and arrays, because they are recursive data structures and require at least a stack machine to parse.

    If you want to write your own JSON parser instead of relying on a JSON binding library, you can write separate regular expressions to parse strings, numbers, objects, arrays and boolean and null literals, and write a simple recursive algorithm to parse array elements and object property values. Do this for practice only. For a professional application, use JSON-B or Jackson.
     
    Kristina Hansen
    Rancher
    Posts: 144
    9
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I really appreciate your effort. Yes, it's just "fooling around" - as Ron McLeod did a very good job I use this.
    I don't plan to build my own json parser - I realized that to be a bit more complicated as I understood the rules better after careful reading of json.org. In addition I don't know much about RegEx - so the idea to "just use a simple regex" just came up as by the relative simple reply for getting a OAuth2 token (as I played around I got that twitch is violating the OAuth RFC on a few points - so this may could be reason why googles oauth lib fails - but as I know to less about oauth,json,regex I just reported to their forum that google oauth lib fails cause, as by my understanding, the reply doesn't match what the lib expects).
     
    Carey Brown
    Saloon Keeper
    Posts: 8704
    71
    Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Stephan van Hulst wrote:"[^"]*?"\s*?:\s*?(?:\"[^"]*?"|\d+|\[[^\]]*?\])


    Why use non-greedy for spaces? As in:
    \s*?:\s*?
     
    Kristina Hansen
    Rancher
    Posts: 144
    9
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Carey Brown wrote:

    Stephan van Hulst wrote:"[^"]*?"\s*?:\s*?(?:\"[^"]*?"|\d+|\[[^\]]*?\])


    Why use non-greedy for spaces? As in:
    \s*?:\s*?


    Can't really tell, but with any other combination the google reply with the space between the colon and the value didn't match - can't tell why ...
     
    Carey Brown
    Saloon Keeper
    Posts: 8704
    71
    Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
     
    Stephan van Hulst
    Saloon Keeper
    Posts: 13366
    295
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    The reason is that typically the number of symbols you want to match is less than the remaining amount of input. It's a simple performance enhancement that really adds up to a lot if your pattern consists of a lot of quantifiers.

    For the last outermost group of symbols I usually use a greedy quantifier, or if I know the entire remaining input has to match exactly, a possessive quantifier.
     
    Carey Brown
    Saloon Keeper
    Posts: 8704
    71
    Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Stephan van Hulst wrote:The reason is that typically the number of symbols you want to match is less than the remaining amount of input. It's a simple performance enhancement that really adds up to a lot if your pattern consists of a lot of quantifiers.


    Sorry, I don't see how this applies to my specific example about spaces. You can't accidentally consume too many spaces while consuming only spaces. I also don't see where performance (in this case) enters into it. In greedy or non-greedy form, either way, the first non-space and you're done.
     
    Carey Brown
    Saloon Keeper
    Posts: 8704
    71
    Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Kristina Hansen wrote:So I was able to simplify it a bit further: [ ]*?\"([.[^\"]]+?)\"[ ]*?:[ ]*?(\".+?\"|\\d+|\\[.+?\\])+?[ ]*?...


    Can you share a simple self contained program with this regex that works so that we can play with it?
     
    Kristina Hansen
    Rancher
    Posts: 144
    9
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    json replies (twitch initial auth, twitch refresh, google initial auth, google refresh) (actual tokens replaced):current regex (regex - not a java string): "(.+?)" *?: *?(".+?"|\d+|\[.+?\])
    output (same order as json replies):
     
    Stephan van Hulst
    Saloon Keeper
    Posts: 13366
    295
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Carey Brown wrote:You can't accidentally consume too many spaces while consuming only spaces. I also don't see where performance (in this case) enters into it. In greedy or non-greedy form, either way, the first non-space and you're done.


    The difference is in where quantifiers start matching.

    Greedy quantifiers first try to match the entire remaining input string (so including all the text that comes after the last space you would successfully match) and then backtrack one symbol at a time until it's found the longest string that consists of only spaces.

    Lazy quantifiers first try to match the next symbol only and then consume an extra character as long as the entire regex doesn't match.

    This difference in behavior may cause the regex to find different matches, depending on the input string and the quantifiers used. For the input string "abab abab":
  • the regex (ab)+? will find "ab" four times,
  • the regex (ab)+ will find "abab" twice.

  • You already know this.

    However, in cases where it doesn't matter to the outcome which form you choose, there will still be an effect on performance. Consider the input string "aa bbbbbbbbbbbbbbbbbbbb". The regex a*?[^a] and the regex a*[^a] will both find "aa " once. The difference is:
  • the regex (ab)+? will consume "aa " before it finds a match,
  • the regex (ab)+ will reject "bbbbbbbbbbbbbbbbbbbb" before it finds a match.

  • In this example, the lazy quantifier consumed 3 symbols before it found a match, while the greedy quantifier rejected 20 symbols before it found a match. The lazy quantifier was almost 7 times as fast.

    This may not seem like a big deal, but wait until you use multiple quantifiers in the same regex. For the input string "abcccccccccccccccccccc", the regex a?b?abc+ will first reject "cccccccccccccccccccc" before it gives up the optional "b", then it will reject "cccccccccccccccccccc" again before it gives up the optional "a". Now imagine that instead of two simple optional elements you use more complex quantifiers, and imagine that instead of the input string "abcccccccccccccccccccc", you have an input that consists of several kilobytes worth of JSON.

    You should not take this to mean that greedy quantifiers are always less performant than lazy quantifiers. Greedy quantifiers are faster than lazy quantifiers when the expected match forms the greater part of the remaining input. However, a sequence of whitespace between two JSON tokens will typically only form a fraction of the entire JSON file.
     
    Don't get me started about those stupid light bulbs.
    reply
      Bookmark Topic Watch Topic
    • New Topic