15. Regular expressions

Pattern matching:

Shell patterns

Too weak.

Regular expressions

Mastering Regular Expressions by Jeffrey Friedl (aka The Owl Book).

Narrowest Chomsky_hierarchy formal language class.

Warning

To write a regexp is far more easier than to read other's regexp

  1. Atomic regexp:
    • any non-special character matches exactly same character
    • "E" → «E»

    • a dot "." matches any one character

    • "." → «E»

    • "." → «:»

    • "." → «.»

    • a set of characters matches any character from the set:

    • "[quack!]" → «a»

    • "[quack!]" → «!»

    • "[a-z]" → «q» (any small letter)

    • "[a-z]" → «z» (any small letter)

    • "[a-fA-F0-9]" → «f» (any hexadecimal digit)

    • "[a-fA-F0-9]" → «D» (any hexadecimal digit)

    • "[abcdefABCDEF0-9]" → «4» (any hexadecimal digit)

    • a negative set of characters matches any character not from the set:

    • "[^quack!]" → «r»

    • "[^quack!]" → «#»

    • "[^quack!]" → «A»

    • any atomic regexp followed by "*" repeater matches a continuous sequence of substrings, including empty sequence, each matched by the regexp

    • "a*" → «aaa»

    • "a*" → «»

    • "a*" → «a»

    • "[0-9]*" → «7»

    • "[0-9]*" → «»

    • "[0-9]*" → «1231234»

    • ".*" → any string!

    • any complex regexp enclosed by special grouping parenthesis "\(" and "\)" (see below)

  2. Complex regexp

    • A sequence of atomic regexps
    • Matches a continuous sequence of substrings, each matched by corresponded atomic regexp
    • "boo" → «boo»

    • "r....e" → «riddle»

    • "r....e" → «r re e»

    • "[0-9][0-9]*" → any non-negative integer

    • "[A-Za-z_][A-Za-z0-9]*" → C identifier (alphanumeric sequence with «_», not started from digit)

    • grouping parenthesis can be used for repeating complex regexp:
    • "\([A-Z][a-z]\)*" → «ReGeXp»

    • "\([A-Z][a-z]\)*" → «»

    • "\([A-Z][a-z]\)*" → «Oi»

    • Implies leftmost longest rule (aka «greedy»):

      • In successful match of complex regexp leftmost atomic regexp takes longest possible match, second leftmost atomic regexp takes longest match that possible in current condition; and so on

      • ".*.*" → all the string leftmost, empty string next

      • "[a-z]*[0-9]*[a-z0-9]*" → «123b0c0»

        • "[a-z]*" → «»

        • "[0-9]*" → «123»

        • "[a-z0-9]*" → «b0c0»

      • "[a-d]*[c-f]*[d-h]*" → «abcdefgh»

        • "[a-d]*" → «abcd»

        • "[c-f]*" → «ef»

        • "[d-h]*" → «gh»

  3. Positioning mark
    • "^regexp" matches only substrings located at the beginning of the line

    • "rgexp$" matches only substrings located at the end of line

Regexp tools

Search and replace

sed — stream editor; if not sure, do not go too deep in :)

Extended regexp and dialects

Disadvantages of traditional regexp: it;s not easy to

Extended regexp:

Vim regexp

Superseding engines

Regexps are context unaware: see main() example above: how to replace «'..'» and «'..…'» patterns only, but not «.»?

Superseding engines

HSE/ProgrammingOS/15_Regexp (последним исправлял пользователь FrBrGeorge 2021-11-15 12:47:19)