Skip to content

Parse long strings example

Merkushev Kirill edited this page Jun 4, 2014 · 2 revisions

How to parse looooong string

with java-verbal-expressions?

1. Let's start! (Given)

Given lines to parse:

3    4    1    http://localhost:20001    1    28800    0    528800    1000000000    0    528800    STR1
3    5    1    http://localhost:20002    1    28800    0    528800    1000020002    0    528800    STR2
4    6    0    http://localhost:20002    1    48800    0    528800    1000000000    0    528800    STR1
4    7    0    http://localhost:20003    1    48800    0    528800    1000020003    0    528800    STR2
5    8    1    http://localhost:20003    1    68800    0    528800    1000000000    0    528800    STR1
5    9    1    http://localhost:20004    1    28800    0    528800    1000020004    0    528800    STR2

Take one to write test:

String logLine = "3\t4\t1\thttp://localhost:20001\t1\t63528800\t0\t63528800\t1000000000\t0\t63528800\tSTR1";

2. The decision on the forehead

Simply write all what we see (remember that we need to save every column):

VerbalExpression regex = regex()
   .capt().digit().oneOrMore().endCapture().tab()                           // 3
   .capt().digit().oneOrMore().endCapture().tab()                           // 4
   .capt().range("0", "1").count(1).endCapture().tab()                      // 1 (or 0)
   .capt().find("http://localhost:20").digit().count(3).endCapture().tab()  // http://localhost:20001
   .capt().range("0", "1").count(1).endCapture().tab()                      // again 1 (or 0)
   .capt().digit().oneOrMore().endCapture().tab()                           // 63528800 (lots of digits)
   .capt().range("0", "1").count(1).endCapture().tab()                      // again 1 (or 0)
   .capt().digit().oneOrMore().endCapture().tab()                           // again lots of digit
   .capt().digit().oneOrMore().endCapture().tab()                           // ... and again
   .capt().range("0", "1").count(1).endCapture().tab()                      // 1 (or 0)
   .capt().digit().oneOrMore().endCapture().tab()                           // ... and again
   .capt().find("STR").range("0", "2").count(1).endCapture().build();       // at last STR1

Now we can check matching (with help of simple matcher)

assertThat(regex, matchesExactly(logLine)); // Hooray! It matches.

This regexp looks like: ((?:\d)+)(?:\t)((?:\d)+)(?:\t)([0-1]{1})(?:\t)((?:http\:\/\/localhost\:20)(?:\d){3})(?:\t)([0-1]{1})(?:\t)((?:\d)+)(?:\t)([0-1]{1})(?:\t)((?:\d)+)(?:\t)((?:\d)+)(?:\t)([0-1]{1})(?:\t)((?:\d)+)(?:\t)((?:STR)[0-2]{1}). Wow! You can match it yourself with help of RegexPlanet. But usually you don't need to see that form.

3. Optimize

I think you see how much "again" words in comments. Let's get rid of duplication! Group same in that way:

VerbalExpression.Builder digits = regex().capt().digit().oneOrMore().endCapt().tab();
VerbalExpression.Builder range = regex().capt().range("0", "1").count(1).endCapt().tab();
VerbalExpression.Builder host = regex().capt().find("http://localhost:20").digit().count(3).endCapt().tab();
VerbalExpression.Builder str = regex().capt().find("STR").range("0", "2").count(1);

Now we have 4 groups for use:

VerbalExpression regex2 = regex()
    .add(digits).add(digits)
    .add(range).add(host).add(range).add(digits).add(range)
    .add(digits).add(digits)
    .add(range).add(digits).add(str).build();

And recheck:

assertThat(regex2, matchesExactly(logLine)); // still matches!

4. Try youself!

You can try working example in unit-tests section: RealWorldUnitTest