Sandbox | Index | gfrlog

Regular Expressions to Nondeterministic Finite Automata

This is a tool I built to convert a UNIX-style regular expression to an animatable nondeterministic finite automata. The goal is to be able to visualize the sort of complexity necessary to represent the unix-regex constructs, and to be able to watch an ε-NFA (specified using a language familiar to programmers) actually process a string.

Instructions

Legal Syntax
Repetition quantifiers:
SymbolFunctionExampleExplanation
* Zero or more ab*c The letter "a" followed by zero or more "b"s followed by a "c"
+ One or more ab+c The letter "a" followed by one or more "b"s followed by a "c" (this is equivalent to abb*c)
? Zero or one (i.e., optional) abc? The letters "ab" optionally followed by one "c"
{n} Exactly n (abc?){4} Four repetitions of either "ab" or "abc" (can be mixed)
{n,} n or more a{4,}bc At least four "a"s followed by "bc"
{n,m} At least n and no greater than m ab*c{3,152} The letter "a" followed by 0 or more "b"s followed by anywhere from 3 to 152 "c"s

Alternation with "|"

(bcd|d*|f){3} means 3 instances of either "bcd", any number of "d"s, or an "f"

Character Classes

\s Any whitespace (currently only includes the space character, since no allowed alphabets include any other whitespace characters
\w Any letter, number, or underscore
\d Any digit
\S \W \D The complements of the above character classes
[a-h] The square brackets enclose a list of character classes, which can include letter/number ranges
[^abc0-5] Opening the square brackets with a "^" negates the class - so this class is all characters except for "a","b","c","0","1","2","3","4", and "5"
a[^b]+c Bracket-classes (and backslash classes) can be treated as a single character - this regex matches the letter "a" followed by one or more non-"b" characters, followed by a "c"

Other Notes

Escaping
The following characters must be escaped with a backslash in any context: ()[]{}|\.?-*$^
with perhaps the only exception being that a dash can be used at the beginning of a character class with no special meaning.
A backslash may be used to escape any character whatsoever, if you feel so inclined.
Anchors and Backreferences
Anchors and backreferences are not allowed at all. Backreferences extend regular expressions out of the regular languages (and so cannot be represented with an NFA). Anchors are either also that way, or they're just difficult to implement and I don't care. You can, however, distinguish between matching the entire string or matching any part of the string (as you might normally do by wrapping the regex in ^$) using the provided checkbox below.
Lazy vs. Eager
Lazy/eager quantifiers are not used, since the automata only give boolean responses.
Whitespace
For the moment, at least, the only whitespace allowed is regular spaces. If you include any of \n \r \t in your regex, they will count as the letters n, r, or t. So \s is currently identical to a simple space.
Non-ASCII characters
Just don't do it. It's not available as an alphabet choice anyhow.

Alphabet:

The larger the alphabet, the bigger the transition table is.
Digits:
binary (0,1)
decimal (0-9)
octal (0-7)
none
Lowercase:
hex (a-f)
all (a-z)
none
Uppercase:
hex (A-F)
all (A-Z)
none
Whitespace/Punctuation:
spaces
Simple punctuation (.,?!)
Quotes ("'`)
Common punctuation (:;()/\-_&%*$@)
Less common punctuation (~[]{}#^=+|><)

The Regular Expression:

Match entire string