Speeding Up Slow RegExp-based Tokenizers in ECMAScript
Andy Chu, January 2010
This is a simple proposal to enhance the regular expression API in ECMAScript, based on some problems encountered during real usage.
I ported the
JScript (IE), by removing Mozilla
specific extensions (github repo). In
The Narcissus tokenizer thus uses the following algorithm:
To work around this problem, you can write a tokenizer which avoids regular
expressions altogether and instead uses a loop with successive
Another way to avoid the blowup is to split the source string into lines. There's still a lot of wasteful work being done, but the algorithm does not blow up as the source file becomes longer.
Alternatively, some small API additions to
I propose two additions to the
Note that setting
But, importantly, these existing features make this proposal more like "exposing something that already exists" rather than "adding something new".
Update: Please see this addendum for a new API proposal.
There are a few ways to add this to ECMAScript.
1. Positional arguments:
This is not great, because boolean arguments are hard to read at the call site. In addition, both arguments are optional, so writing:
is awkward when you want to override the default
2. Using an object literal for optional arguments:
This style is becoming popular now, but not consistent with too many native APIs.
3. Using two methods:
Downside: You will also need
After some thought, I recommend option 3, for readability and
future-proofing of the API. Python also has an
Adding optional, named arguments to ECMAScript, in the style of Python, may produce a nicer API.
regular expression engine, and speed up many real programs, without complicating
its implementation. Since anchored matches are already possible with