Regex Notes

Notes from my talk at RI LUG about regular expressions. 

Regex Notes:
** Precision == perfoamnce !!!
** KNOW YOUR TOOL !!!
** KNOW YOUR DATA !!!

Definition:
Regular expressions are  a concise description of a regular formal language with notations for concatenation, alternation and iteration of sub-expressions. They generally appear as strings of text that are used to match or search within other strings of text. 

For a formal definition / explanation in terms of computer science / computational theory, see the following wikipedia articles:
http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory
http://en.wikipedia.org/wiki/Formal_language
http://en.wikipedia.org/wiki/Regular_language
http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine
http://en.wikipedia.org/wiki/Deterministic_finite_automata

Common Simplpe Exampls:
File globbing, text search fields, text processing command line tools, etc.

Types:
  BRE v. ERE
    BRE simply uses 2 character metacharacters, like \{3\} in grep.
    ERE iis almost always an NFA w/ lots of fun metacharacters.
    These do not necessarily have any bearing on the engine type

  Text Directed / DFA
    DFAs always return the longest possible match.
    Match speed is unrelated to the regex complexity (generally)
    No lazy quanitifiers EVER (because they are incapable of back-tracking)
    DFA matching very consistent.

  Regex Directed / NFA
    NFA always return left most match (see alternation)
    Can be directed to return any portion / subexpression / position
    More optimisations can be made in regex itself
    More "advanced" features, like look-around, and capturing (requires back-tracking)
    analyses sub expressesions indepent of one another.
    These come in 2 major flavors, traditional NFA, and POSIX NFA
	POSIX NFAs are intentially crippled to make them more predictable.
	POSIX NFAs will often behave like DFAs in certain cases
	Traditional NFAs are generally the most feature rich
    NFAs can be described as eager. This applies to alternation

  Can you tell what you have?
    Lazy quantifier support: Never in DFA, not usualy in POSIX NFA.
    Do a greedy, 0 length match on a string w/ a lot of repetition
	e.g. X(.+)X  against "==XX======================================================"
	if it's FAST, it's a DFA. If it's slow or throws and error, it's an NFA.
    Do a simple test w/ alternation
	e.g. (regex|regex not) aganst "regex not"
	NFA returns "regex". DFA returns "regex not"
	Traditional NFA will return left most match, better or not. it's "eager" to return to the next task
	DFA will return best match. will try EACH alternative. POSIX NFA will do this, too. 

Common tools that implement DFA style regex engines include:
	grep / egrep *
	awk *
	lex / flex
	procmail
	some SQL servers (including MySQL)
	TCL +
		* The GNU versions of these applications implement multiple engines
 		+ TCL has a unique hybrid engine

Core Concepts
  atoms - smallest unit of uniqueness, consists of a single literal character or a metacharacter / metasequence
  literal - a literal single character, matches the same character in the language / encoding. Simplest atom
  metacharacter - way to tell the regex engine what you want in a slightly more vague way
  metasequence - an atomic series of characters  

  \Q .. \E 	Turns meta characters between them into litterals

  Alternation
    Lowest precedence of all regex operators
    Functional equivelant of an OR

  characters set - [xyz] - atom matching a single member of the set in that position in the examined string
    [^xyz] - negated set
    [0-9] - inclusive set, very dependant on ecnoding / character set
    [-^] - special cases where - and ^ are not metacharacters in a set
    *some* engines allow set operations:
	[[a-z]-[aeiou]] or [[a-z]&&[^aeiou]] (subtraction, intersection)

  character class - [:digit:] <- POSIX - \d <- short hand - equivilent to [0-9]
    [:alnum:] == [a-zA-Z0-9]
    [:alpha:] == [a-zA-Z]
    [:ascii:] == [\x00-\x7F]
    [:blank:] == [ \t]
    [:cntrl:] == [\x00-\x1F\x7F]
    [:digit:] == [0-9]					 == \d
    [:graph:] == [\x21-\x7E]
    [:lower:] == [a-z]
    [:print:] == [\x20-\x7E]
    [:punct:] == [!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]
    [:space:] == [ \t\r\n\v\f]				 == \s
    [:upper:] == [A-Z]
    [:word:] == [-A-Za-z0-9_]				 == \w
    [:xdigit:] == [A-Fa-f0-9]
	POSIX also defines locale specific collating sequences like ll in spanish in tortilla -> [[.span-ll.]]
	POSIX also has equivelance characters [=e=] is e w/ ^'` over it, etc.
	We're not worrying about that.
	\X matches any unicode combining character, e.g. use inplace of . in fran.ais as fran\Xais
	\p{prop} matches a character w/ a unicode property, \P{prop} amtches w/out

  Non-printable characters
    \t		
    \r		
    \n		
    \a		
    \e		
    \f		
    \v		
    \xAA	hex character AA in current character set.
    \uAAAA	Unicode character AAAA
    \b		 ** some engines only use this inside a [] and as word bound outside
    \cA		A
    \###	Octal e.g. \000 - \377 (sometimes out of bound = unicode, sometimes wierdness ensues)
		some times it's a back reference. ** KNOW YOUR TOOL !!!
    \0		Usually a NULL byte

  The .
    Matches almost any character. Usually, in most engines it matches everything except the OS's new line.
    	unless engine supports a single line mode.
    The . character is not percise and generally makes the regex slower.
**  Precision == perfoamnce !!

  Anchors
    Anchors do not match a character. They match a location!
    ^  - Beginning of string / line
    $  - End of string / line
    \A - Beginning of string (not in POSIX) (GNU added \`)
    \Z - End of string (not in POSIX) (GNU added \')
	  ** Note! $ and \Z will not match a \n or \r\n at the end of the line unless in multiline mode
    \b - Word bound or \< and \> depending on engine
	  ** Note! not all engines support this
    \G - Matches where last match left off

  Quantifiers / Repetition
    GREEDY!!!
      Will expand the expression as far as possible before returning the next part that requires a match to succeed.
    ? 		0,1	AKA makes atom optional
    + 		1,...
    * 		0,...
    {n}		n
    {n,m}	n ... m
	Exmaple:
		<.+>  applied to "< H 1 >REGEXES ARE FUN!< / H 1 > < B R / >< B R / >" returns the whole string.
		<.*> would also match "<>"
	To make it lazy, add a ?
		<.+?> applied above matches just the < H 1 >
	Specificity is always better than laziness!!!!	

  Grouping and Referencing
    ( )		Normal capturing group (or just grouping if capturing / referencing not supported
    (?: )	Non-capturing, but still good for applying quantifiers.
    (? )	Named capturing group
    \# , $1m	Tool specific way to back-reference
    \k , (?P=name)	first is .net, second is PHP or python named back reference
    Example:
	<([A-Z][A-Z0-9]*)\b[^>]*>.*?
	 ^^^^^^^^^^^^^^^^ open tag   ^^^ close tag
    (?> )	Atomic grouping. This makes the group "greedy".
    Example:
	a(?>.*)a -> "abcda" yields no match. .* consumes the a, the (?> ) won't return the last a to the engine

  Looking around
    (?= )	Positive assertion look ahead
    (?! )	Negative assertion look ahead
    (?<= )	Positive assertion look behind
    (?]+\/?>

Comify big numbers in perl:
	s/(?<=\d)(?=(?:\d\d\d)+\(?!\d)/,/g