Notes for the 'grep' tool family ================================ The members of the 'grep' family are: - grep, which uses the most traditional regular expressions, once derived from the NFA code in ed (g/re/p); - egrep, the traditional Unix DFA matching utility; - fgrep, which searches for a set of fixed strings using the algorithm by Aho and Corasick; - new grep, which understands POSIX.2 regular expressions and uses the modified 'Unix regular expression library'. This provides both a DFA and a NFA and uses either of one dependent on the pattern. The first three utilities still use the original Unix algorithms, but were extended to handle multibyte characters and expressions of arbitrary length and complexity. This includes an implementation of 'lazy transition evaluation' for egrep (which was traditionally found in System V, but not in Unix 32V and BSD Unix). All search methods are embedded into a newly developed input framework, which limits the line length by the amount of available memory only. The combination of these changes and improvements in compiler technology make these utilities up to twice as fast as the implementations found in traditional Unix systems. Performance comparison ====================== All tests were performed on a Intel Celeron 800 system running Linux 2.4.20, glibc 2.3.2, gcc 3.2.2, with a singlebyte locale, using a completely cached input file of 128 MB size generated by od /dev/urandom | cut -d' ' -f2- | nl -w7 -nrz As the lines of this file contain only digits and spaces, performance of some of the following tests is different from that encountered on searches in human or computer language texts. The results thus do not serve as a benchmark, but highlight characteristical strengths and weaknesses of the algorithms and their implementations. Measurements with GNU grep have additionally been included for reference. Search for a single character that does not occur in input: -------------------------------------------------------------------- -c x file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 1.6 1.5 1.7 0.5 user 1.3 1.1 1.4 0.2 sys 0.3 0.2 0.3 0.2 Search for a single character that does often occur in input: -------------------------------------------------------------------- -c 0 file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 1.0 1.0 1.2 1.3 user 0.7 0.7 0.9 1.0 sys 0.3 0.2 0.2 0.2 Search for three characters that do not occur in input: -------------------------------------------------------------------- -c abc file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 1.6 1.5 1.7 0.6 user 1.3 1.3 1.4 0.4 sys 0.3 0.1 0.2 0.2 Search for a three character sequence that does often occur in input: -------------------------------------------------------------------- -c 123 file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 2.9 1.5 1.7 1.0 user 2.7 1.3 1.4 0.7 sys 0.2 0.2 0.2 0.3 Search for a ten-character sequence that does not occur in input: -------------------------------------------------------------------- -c abcdefghij file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 1.6 1.5 1.7 0.4 user 1.4 1.2 1.3 0.1 sys 0.2 0.3 0.3 0.2 GNU grep will usually be faster than any of the grep utilities included here when the expression consists mostly of a fixed character substring, thanks to its optimization techniques. egrep and new grep show the characteristic performance of their DFA technique, which effectively sets an upper time limit on most searches; they can be slightly faster if the match occurs at early points of many lines. The speed of the NFA used in traditional grep depends on the characters encountered in input. The NFA in traditional grep will also run slower with metacharacters in the expression, as the following tests show: Search for an expression that starts with a dot: -------------------------------------------------------------------- -c .foo file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 12.3 1.5 1.7 0.7 user 12.0 1.2 1.3 0.4 sys 0.3 0.2 0.3 0.2 Search for bracket expressions: -------------------------------------------------------------------- -c '[01][23][45][67]' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 7.9 1.5 1.7 2.3 user 7.5 1.2 1.3 2.0 sys 0.3 0.2 0.3 0.2 NFA stress test: -------------------------------------------------------------------- -c '0.*1.*2.*3.*4.*5.*6.*7' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 8:56.1 1.6 1.8 3.2 user 8:54.1 1.3 1.5 2.9 sys 0.3 0.3 0.3 0.2 But DFA performance can also degrade with some patterns, although this happens less often in practice. The following sequence of tests uses patterns of increasing complexity that eventually overflow the cache used in any of the DFAs: DFA stress test sequence: -------------------------------------------------------------------- -c '.*0....' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 3.0 1.1 1.3 1.9 user 2.7 0.8 1.0 1.6 sys 0.3 0.2 0.3 0.2 -------------------------------------------------------------------- -c '.*0.....' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 3.1 1.2 2.7 2.0 user 2.8 0.8 2.4 1.7 sys 0.2 0.3 0.3 0.2 -------------------------------------------------------------------- Cache overflow in new grep. -c '.*0......' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 3.6 2.1 4.1 2.0 user 3.3 1.8 3.8 1.8 sys 0.3 0.2 0.2 0.2 -------------------------------------------------------------------- Cache overflow in egrep. -c '.*0.......' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 3.6 3.5 5.6 2.1 user 3.4 3.1 5.3 1.8 sys 0.2 0.3 0.2 0.2 -------------------------------------------------------------------- -c '.*0........' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 3.8 5.0 7.6 2.2 user 3.4 4.6 7.2 1.9 sys 0.3 0.3 0.2 0.2 -------------------------------------------------------------------- -c '.*0.........' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 3.9 6.8 10.1 2.2 user 3.6 6.4 9.7 1.9 sys 0.2 0.3 0.2 0.3 -------------------------------------------------------------------- -c '.*0..........' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 4.1 8.7 12.7 6.5 user 3.8 8.4 12.4 6.1 sys 0.2 0.3 0.3 0.4 -------------------------------------------------------------------- Cache overflow in GNU grep. -c '.*0...........' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 4.2 12.1 17.0 27.9 user 3.9 11.8 16.6 27.3 sys 0.2 0.3 0.2 0.4 -------------------------------------------------------------------- -c '.*0............' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 4.3 15.1 20.8 2:12.8 user 4.0 14.8 20.5 2:12.1 sys 0.2 0.2 0.2 0.5 -------------------------------------------------------------------- -c '.*0.............' file -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 5.0 18.7 24.4 11:47.9 user 4.6 18.3 24.0 11:46.6 sys 0.3 0.2 0.2 0.5 GNU grep refills more positions of the cache than actually needed. This eventually leads to huge performance decreases. An expression feature that DFAs cannot provide are backreferences (strictly speaking, a NFA also cannot provide it, but NFA-like search methods can easily be extended thus). While egrep, which uses a DFA only, cannot search for this construct, the new grep will use its NFA here. -------------------------------------------------------------------- -c '\(0\)1\1' m -------------------------------------------------------------------- grep egrep new grep GNU grep 2.5 -------------------------------------------------------------------- real 14.1 12.4 1:29.6 user 13.8 n/a 12.0 1:28.6 sys 0.3 0.3 0.5 Gunnar Ritter 5/26/03