This is a tool I built to convert a UNIX-style regular expression to an animatable
nondeterministic finite automata. The goal is to be able to visualize the sort of
complexity necessary to represent the unix-regex constructs, and to be able to watch
an ε-NFA (specified using a language familiar to programmers) actually process a string.
Instructions
Legal Syntax
Repetition quantifiers:
| Symbol | Function | Example | Explanation |
| * |
Zero or more |
ab*c |
The letter "a" followed by zero or more "b"s followed by a "c" |
| + |
One or more |
ab+c |
The letter "a" followed by one or more "b"s followed by a "c" (this is equivalent to abb*c) |
| ? |
Zero or one (i.e., optional) |
abc? |
The letters "ab" optionally followed by one "c" |
| {n} |
Exactly n |
(abc?){4} |
Four repetitions of either "ab" or "abc" (can be mixed) |
| {n,} |
n or more |
a{4,}bc |
At least four "a"s followed by "bc" |
| {n,m} |
At least n and no greater than m |
ab*c{3,152} |
The letter "a" followed by 0 or more "b"s followed by anywhere from 3 to 152 "c"s |
Alternation with "
|"
(bcd|d*|f){3} means 3 instances of either "bcd", any number of "d"s, or an "f"
Character Classes
| \s |
Any whitespace (currently only includes the space character, since no allowed alphabets include any other whitespace
characters |
| \w |
Any letter, number, or underscore |
| \d |
Any digit |
| \S \W \D |
The complements of the above character classes |
| [a-h] |
The square brackets enclose a list of character classes, which can include letter/number ranges |
| [^abc0-5] |
Opening the square brackets with a "^" negates the class - so this class is all
characters except for "a","b","c","0","1","2","3","4", and "5" |
| a[^b]+c |
Bracket-classes (and backslash classes) can be treated as a single character - this regex matches
the letter "a" followed by one or more non-"b" characters, followed by a "c" |
Other Notes
Escaping
The following characters must be escaped with a backslash in any context:
()[]{}|\.?-*$^
with perhaps the only exception being that a dash can be used at the beginning of a character class with no special meaning.
A backslash may be used to escape any character whatsoever, if you feel so inclined.
Anchors and Backreferences
Anchors and backreferences are not allowed at all. Backreferences extend regular expressions out of the regular
languages (and so cannot be represented with an NFA). Anchors are either also that way, or they're just difficult
to implement and I don't care. You can, however, distinguish between matching the entire string or matching any
part of the string (as you might normally do by wrapping the regex in ^$) using the provided checkbox below.
Lazy vs. Eager
Lazy/eager quantifiers are not used, since the automata only give boolean responses.
Whitespace
For the moment, at least, the only whitespace allowed is regular spaces. If you include any of
\n \r \t in your regex, they will count as the letters n, r, or t. So
\s is currently identical to a simple space.
Non-ASCII characters
Just don't do it. It's not available as an alphabet choice anyhow.