Non-standard stringology: Algorithms and complexity
Abstract
Non-standard stringology concerns string matching problems, wherein a position in the "text" (of size n) matches one in the "pattern)) (of size m), based on very general relationships between the corresponding "symbols". For example, string matching with don't cares is a simple non-standard string matching prob.' lem, wherein text andjor pattern positions might have wildcard symbols rather than those drawn from the base alphabet Σ these wildcards match ever-y symbol from Σ. The main results in this paper concern the inherent complexity of a variety of non-standard string mat thing problems, characterized in terms of algebraic convolutions. Non-standard Basic String Matching: l For three problems from this family - including string mat thing with don't cares and its generalizations - we prove a lower bound of Ω(f(Σ 1)) convolutions, where the (increasing) function depends on the problem. These results are proved in the boolean convolution model that we introduce here. We also match this bound by improving or adapting the bestknown algorithms to this model. l In the RAM model we show, using reductions, that all of these problems encode a variant of truncated boolean convolution with integer parameter n. These reductions allow us to infer that any improvement to the best-known algorithms for these problems (in the RAM) will yield faster algorithms for solving parametrized truncated convolutions of the input vectors. As we show here, the fastest algorithms for the latter problem derived by extending the scheme from [K089] uses 0(min{Pi;,√ M}) convolutions. We also derive analogous results for eight other variants of non-standard string matching problems that are drawn from the following two families: nonstandard counting string matching (eg., variant of classical string mat thing, that involves counting number of mismatches at each text position), and nonstandard threshold string matching (in which the kmismatches problem is a basic example). Interestingly, all of the above results are derived using the structure of the "match graph" defined by the mat thing relation of the given inst ante of the nonstandard string matching problem, and its complement. It turns out that our lower bounds and reductions depend upon the sizes of the induced cliques in these graphs - in particular, the "dominating" cliques in the match graph, and the "clique edge covers/ partitions" in its complement. We also provide improved deterministic and randomized algorithms, as well are those with better expected running times for some non-standard string matching problems.