Skip to content

Commit

Permalink
cite Kozen's work on intersection nonemptiness
Browse files Browse the repository at this point in the history
  • Loading branch information
breandan committed Nov 6, 2024
1 parent d2b2e02 commit 9dcf34d
Show file tree
Hide file tree
Showing 4 changed files with 45 additions and 22 deletions.
9 changes: 9 additions & 0 deletions latex/bib/acmart.bib
Original file line number Diff line number Diff line change
Expand Up @@ -2232,4 +2232,13 @@ @book{flajolet2009analytic
author={Flajolet, P},
year={2009},
publisher={Cambridge University Press}
}

@inproceedings{kozen1977lower,
title={Lower bounds for natural proof systems},
author={Kozen, Dexter},
booktitle={18th Annual Symposium on Foundations of Computer Science (sfcs 1977)},
pages={254--266},
year={1977},
organization={IEEE}
}
Binary file modified latex/thesis/Thesis.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion latex/thesis/content/Ch1_Introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,6 @@ \chapter{\rm\bfseries Introduction}

Program repair is a highly underdetermined problem, meaning that the validity constraints do not uniquely determine a solution. A proper theory of program repair must be able to resolve this ambiguity to infer the user's intent from an incomplete specification, and incrementally refine its guess as more information becomes available from the user.

This calculus has a number of desirable properties. It is highly compositional, meaning that users can manipulate constraints on programs while retaining the algebraic closure properties, such as union, intersection, and differentiation. It is well-suited for probabilistic reasoning, meaning we can use any probabilistic model of language to guide the repair process. It is also amenable to incremental repair, meaning that we can repair programs in a streaming fashion, while the user is typing.
This calculus we propose has a number of desirable properties. It is highly compositional, meaning that users can manipulate constraints on programs while retaining the algebraic closure properties, such as union, intersection, and differentiation. It is well-suited for probabilistic reasoning, meaning we can use any probabilistic model of language to guide the repair process. It is also amenable to incremental repair, meaning that we can repair programs in a streaming fashion, while the user is typing.

\clearpage
56 changes: 35 additions & 21 deletions latex/thesis/content/Ch2_Formal_Language_Theory.tex
Original file line number Diff line number Diff line change
Expand Up @@ -3,50 +3,49 @@ \chapter{\rm\bfseries Formal Language Theory}

In computer science, it is common to conflate two distinct notions for a set. The first is a collection sitting on some storage device, e.g., a dataset. The second is a lazy construction: not an explicit collection of objects, but a representation that allows us to efficiently determine membership on demand. This lets us represent infinite sets without requiring an infinite amount of storage. Inclusion then, instead of being simply a lookup query, becomes a decision procedure. This is the basis of formal language theory.

The representation we are chiefly interested in is the grammar, a common metanotation for specifying the syntactic constraints on programs, shared by nearly every programming language. Programming language grammars are overapproximations to the true language of interest, but provide a reasonably fast procedure for rejecting invalid programs and parsing valid ones.
The representation we are chiefly interested in is the grammar, a common metanotation for specifying the syntactic constraints on programs, shared by nearly every programming language. Programming language grammars are overapproximations to the true language of interest, but provide a reasonably detailed specification for rejecting invalid programs and parsing valid ones.

Formal languages are arranged in a hierarchy of containment, where each language family strictly contains its predecessors. On the lowest level of the hierarchy are finite languages. Type 3 contains finite and infinite languages generated by a regular grammar. Type 2 contains context-free languages, which admit parenthetical nesting. Supersets, such as the recursively enumerable sets, are Type 0. There are other kinds of formal languages, such as logics and circuits, which are incomparable with the Chomsky hierarchy.

Most programming languages leave level 2 after the parsing stage, and enter the realm of type theory. At this point, compiler authors layer additional semantic refinements on top of syntax, but must deal with phase ordering problems related to the sequencing of such analyzers, breaking commutativity and posing challenges for parallelization. This lack of compositionality is a major obstacle to the development of modular static analyses.

The advantage of dealing with formal language representations is that we can reason about them algebraically. Consider the context-free grammar: the arrow $\rightarrow$ becomes an $=$ sign, $\mid$ becomes $+$ and $AB$ becomes $A \times B$. The ambiguous Dyck grammar, then, can be seen as a system of equations.

\begin{align*}
\begin{equation}
S \rightarrow ( ) \mid ( S ) \mid S S \Longleftrightarrow f(x) = x^2 + x^2 f(x) + f(x)^2
\end{align*}
\end{equation}

\noindent Now, we can solve for $f(x)$, giving us the generating function for the language:
\noindent We will now solve for $f(x)$, giving us the generating function for the language:

\begin{align*}
f(x) &= x^2 + x^2 f(x) + f(x)^2\\
0 &= f(x)^2 + x^2 f(x) - f(x) + x^2
\end{align*}
\begin{equation}
0 = f(x)^2 + x^2 f(x) - f(x) + x^2
\end{equation}

\noindent Now, using the quadratic equation, where $a = 1, b = x^2 - 1, c = x^2$, we have:

\begin{align*}
f(x) &= \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} = \frac{-x^2 + 1 \pm \sqrt{x^4 - 6x^2 + 1}}{2}
\end{align*}
\begin{equation}
f(x) = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} = \frac{-x^2 + 1 \pm \sqrt{x^4 - 6x^2 + 1}}{2}
\end{equation}

\noindent Note there are two solutions, but only one where $\lim_{x\rightarrow 0} = 1$. From the ordinary generating function (OGF), we also have that $f(x)=\sum _{n=0}^{\infty }f_nx^{n}$. Expanding $\sqrt{x^4 - 6x^2 + 1}$ via the generalized binomial theorem, we have:

\begin{align*}
\begin{align}
f(x) = (1+u)^{\alpha }&=\sum _{k=0}^{\infty }\;{\binom {\alpha }{k}}\;u^{k}\\
&=\sum _{k=0}^{\infty }\;{\binom {\frac{1}{2} }{k}}\;(x^4 - 6x^2)^{k} \text{ where } u = x^4-6x^2
\end{align*}
\end{align}

Now, to obtain the number of ambiguous Dyck trees of size $n$, we can extract the $n$th coefficient using the binomial series:
Now, to obtain the number of ambiguous Dyck trees of size $n$, we can extract the $x^n$-th coefficient using the binomial series:

\begin{align*}
\begin{align}
[x^n]f(x) &= [x^n]\frac{-x^2 + 1}{2} + \frac{1}{2}[x^n]\sum _{k=0}^{\infty }\;{\binom {\frac{1}{2} }{k}}\;(x^4 - 6x^2)^{k}\\
[x^n]f(x) &= \frac{1}{2}{\binom {\frac{1}{2} }{n}}\;[x^n](x^4 - 6x^2)^n = \frac{1}{2}{\binom {\frac{1}{2} }{n}}\;[x^n](x^2 - 6x)^n
\end{align*}
\end{align}

We can use this technique, first described by Flajolet \& Sedgewick~\cite{flajolet2009analytic}, to count the number of trees in any CFL or distinct words in an unambiguous CFL. This lets us understand grammars as a kind of algebra, which is useful for enumerative combinatorics on words and syntax-guided synthesis.
We can use this technique, first described by Flajolet \& Sedgewick~\cite{flajolet2009analytic}, to count the number of trees of a given size or distinct words in an unambiguous CFG. This lets us understand grammars as a kind of algebra, which is useful for enumerative combinatorics on words and syntax-guided synthesis.

Like algebra, there is also a kind of calculus to formal languages. Janusz Brzozowski~\cite{brzozowski1964derivatives} introduced the derivative operator for regular languages, which can be used to determine membership, and extract subwords from the language. This operator has been extended to CFLs by Might et al.~\cite{might2011parsing}, and is the basis for a number of elegant parsing algorithms.
Naturally, like algebra, there is also a kind of calculus to formal languages. Janusz Brzozowski~\cite{brzozowski1964derivatives} introduced the derivative operator for regular languages, which can be used to determine membership, and extract subwords from the language. This operator has been extended to CFLs by Might et al.~\cite{might2011parsing}, and is the basis for a family of elegant parsing algorithms.

The Brzozowski derivative has an extensional and intensional form. Extensionally we have $\partial_a L = \{b \in \Sigma^* \mid ab \in L\}$. Intensionally, we have an induction over generalized regular expressions (GREs), which are a superset of regular expressions that also support intersection and negation.\vspace{-1cm}
The Brzozowski derivative has an extensional and intensional form. Extensionally, we have $\partial_a L = \{b \in \Sigma^* \mid ab \in L\}$. Intensionally, we have an induction over generalized regular expressions (GREs), which are a superset of regular expressions that also support intersection and negation.\vspace{-1cm}

\begin{multicols}{2}
\begin{eqnarray*}
Expand All @@ -73,9 +72,24 @@ \chapter{\rm\bfseries Formal Language Theory}
\end{eqnarray*}
\end{multicols}

Similar to sets, it is possible to combine languages by manipulating their grammars, mirroring the setwise notions of union, intersection, complementation and difference over languages. These operations are convenient for combining, for example, syntactic and semantic constraints on programs. For example, we might have two grammars, $G_a, G_b$ representing two properties that are both necessary for a program to be considered valid. %We can treat valid programs $P$ as a subset of the language intersection $P \subseteq \mathcal{L}(G_a) \cap \mathcal{L}(G_b)$.

Like sets, it is possible to abstractly combine languages by manipulating their grammars, mirroring the setwise operations of union, intersection, and difference over languages. These operations are convenient for combining, for example, syntactic and semantic constraints on programs. For example, we might have two grammars, $G_a, G_b$ representing two properties that are both necessary for a program to be considered valid. %We can treat valid programs $P$ as a subset of the language intersection $P \subseteq \mathcal{L}(G_a) \cap \mathcal{L}(G_b)$.
Like all representations, grammars are themselves a trade-off between expressiveness and efficiency. It is possible to represent the same finite set with multiple representations of varying complexity. For example, the set of strings containing ten or fewer balanced parentheses can be expressed as a finite automaton containing millions of states, or a simple conjunctive grammar containing a few productions, $\mathcal{L}\Big(S \rightarrow ( ) \mid (S) \mid S S \Big) \cap \Sigma^{[0,10]}$.

Like all representations, grammars are a trade-off between expressiveness and efficiency. It is often possible to represent the same finite set with multiple representations of varying complexity. For example, the set of strings containing ten or fewer balanced parentheses can be expressed as deterministic finite automaton containing millions of states, or a simple conjunctive grammar containing a few productions, $\mathcal{L}\Big(S \rightarrow ( ) \mid (S) \mid S S \Big) \cap \Sigma^{[0,10]}$.
The choice of representation is heavily usage-dependent. For example, if we are interested in recognition, we might favor a disjoint representation, allowing properties to be checked independently without merging, whereas if we are interested in generation or deciding non-emptiness, we might prefer a unified representation which can be efficiently sampled without rejection.

Union, concatenation and repetition are all mundane in the theory of formal languages. Intersection and negation are more challenging concepts to borrow from set theory, and do not translate naturally into the Chomsky hierarchy. For example, the intersection of two CFLs is Turing Complete, but the intersection of a CFL and a regular language is a CFL.

Deciding intersection non-emptiness of a finite collection of finite automata is known to be PSPACE-complete~\cite{kozen1977lower}. It is still unknown whether a faster algorithm than the product construction exists for deciding intersection non-emptiness of just two finite automata.

The textbook algorithm proceeds as follows: create an automaton containing the cross-product of states, and simulate both automata in lockstep, creating arcs between states that are co-reachable. If a final state is reachable in the product automaton, then the intersection is non-empty. This requires space proportional to the Cartesian product of the two states.

\begin{figure}[h]
\caption{TODO: depict product construction for finite automata here.}
\end{figure}

The goal of this thesis is to speed up the product construction by leveraging (1) parameterized complexity (2) pruning and (3) parallelization to speed up the wallclock runtime of the product construction and generalize it to CFG-REG intersections. We show it is possible to decide intersection non-emptiness in realtime for Levenshtein automata and build a tool to demonstrate it on real-world programming languages and grammars.

Finally, we show a probabilistic extension of the REG-CFL product construction, which can be used to decode the top-K most probable words in the intersection of two languages. This is useful for applications in natural language processing, where we might want to find the most natural word that satisfies multiple constraints, such as being a valid repair with fewer than $k$ edits whose probability is maximized.

\clearpage

0 comments on commit 9dcf34d

Please sign in to comment.