Skip to content

Commit

Permalink
clean up definitions
Browse files Browse the repository at this point in the history
  • Loading branch information
breandan committed Dec 24, 2024
1 parent ad968bb commit ba7886a
Show file tree
Hide file tree
Showing 5 changed files with 38 additions and 21 deletions.
Binary file modified latex/thesis/Thesis.pdf
Binary file not shown.
6 changes: 3 additions & 3 deletions latex/thesis/content/Ch0_Literature_Review.tex
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ \section{Syntactic program repair}

Recent work attempts to use neural language models to generate probable fixes. For example, Yasunaga et al.~\cite{yasunaga2021break} use an unsupervised method that learns to synthetically corrupt natural source code (simulating a typographic noise process), then learn a second model to repair the broken code, using the uncorrupted source as the ground truth. This method does not require a parallel corpus of source code errors and fixes, but can produce a misaligned noise model and fail to generalize to out-of-distribution samples. It also does not guarantee the generated fix is valid according to the grammar.

Sakkas et al.~\cite{sakkas2022seq2parse} introduce a neurosymbolic model, Seq2Parse, which adapts the Early parser~\cite{earley1970efficient} with a learned PCFG and a transformer-classifier to predict error production rules. This approach aims to generate only sound repairs, but lacks the ability to generate every valid repair within a given edit distance. While this has the benefit of better interpretability than end-to-end neural repair models, it is not clear how to scale up this technique to handle additional test-time compute.
Sakkas et al.~\cite{sakkas2022seq2parse} introduce a neurosymbolic model, Seq2Parse, which adapts the Early parser~\cite{earley1970efficient} with a learned PCFG and a transformer-classifier to predict error production rules. This approach aims to generate only sound repairs, but lacks the ability to generate every valid repair within a given edit distance. While this has the benefit of better interpretability than end-to-end neural repair models, it is not clear how to scale up this technique to accommodate additional test-time compute.

Neural language models are adept at learning statistical patterns, but often sacrifice validity, precision or latency. Existing neural repair models are prone to misgeneralize and hallucinate syntactically invalid repairs and do not attempt to sample from the space of all and only valid repairs. As a consequence, they have difficulty with inference scaling, where additional test time compute does not translate to a significant improvement on the target domain. Furthermore, even if sound in theory, the generated samples may not even be syntactically valid, as we observe in practice.
Neural language models are adept at learning statistical patterns, but often sacrifice validity, precision or latency. Existing neural repair models are prone to misgeneralize and hallucinate syntactically invalid repairs and do not attempt to sample from the space of all and only valid repairs. As a consequence, they have difficulty with inference scaling, where additional test-time compute does not translate to a significant improvement on the target domain. Furthermore, even if theoretically sound, the generated samples may not even be syntactically valid, as we observe in practice.

Our work aims to address all of these concerns. We try to generate every nearby valid program and prioritize the solutions by naturalness, while ensuring response time is tolerable. In other words, we attempt to satisfy soundness, completeness, naturalness and latency simultaneously.

Expand All @@ -27,6 +27,6 @@ \section{Semantic program repair}

One approach to handling more complex synthesis tasks uses \textit{angelic execution} to generate and optimistically evaluate execution paths. Shi et al.'s FrAngel~\cite{shi2019frangel} is a particular example of this approach, which uses component-based program synthesis in conjunction with angelic nondeterminism to repair a broken program. The idea of angelic execution can be retraced to Bod\'ik et al.~\cite{bodik2010programming} who attribute the original idea to Floyd's nondeterministic \texttt{choice} operator~\cite{floyd1967nondeterministic}. In the context of semantic program repair, angelic execution has been successfully developed for program synthesis by Singh et al.~\cite{singh2013automated} for auto-grading and providing feedback on programming assignments.

The idea of angelic execution has also been employed to great effect to assist with automated program repair. In particular, Long \& Rinard~\cite{long2016automatic} introduce a tool called Prophet and use a very similar evaluation to ours to generate and rank candidate patches from a search space, then rank the generated patches according to a learned probabilistic model. Chandra et al.~\cite{chandra2011angelic} also use angelic execution to guide the search for semantic repairs. Both systems bypass the syntactic checking stage and search directly for semantic repairs, using a set of test cases as guard conditions to reject invalid candidates. Their approach is closely related to fault localization and mutation testing in the software engineering literature.
The idea of angelic execution has also been employed to great effect to assist with automated program repair. In particular, Long \& Rinard~\cite{long2016automatic} introduce a tool called Prophet and use a very similar evaluation to ours to generate and rank candidate patches from a search space, then rank the generated patches according to a learned probabilistic model. Chandra et al.~\cite{chandra2011angelic} also use angelic execution to guide the search for semantic repairs. Both systems bypass the syntax checking stage and search directly for semantic repairs, using a set of test cases as guard conditions to reject invalid candidates. Their approach is closely related to fault localization and mutation testing in the software engineering literature.

\clearpage
5 changes: 3 additions & 2 deletions latex/thesis/content/Ch1_Introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ \chapter{\rm\bfseries Introduction}

\section{Examples}


Syntax errors are a familiar nuisance for programmers, arising due to a variety of factors, from inexperience, typographic error, to cognitive load. Often the mistake itself is simple to fix, but manual correction can disrupt concentration, a developer's most precious and fickle resource. Syntax repair attempts to automate the correction process by trying to anticipate which program, out of the many possible alternatives, the developer actually intended to write.

Taking inspiration from formal and statistical language modeling alike, we adapt a construction from Bar-Hillel~\cite{bar1961formal} for formal language intersection, to the problem of syntactic program repair. Our work shows this approach, while seemingly intractable, can be scaled up to handle real-world program repair tasks. We will then demonstrate how, by decoding the Bar-Hillel construction with a simple Markov model, it is possible to predict human syntax repairs with the accuracy of large language models, while retaining the correctness and interpretability of classical repair algorithms.
Expand Down Expand Up @@ -89,10 +88,12 @@ \section{Example}\label{sec:example}
% \end{enumerate}
%\end{multicols}
\begin{figure}[H]
\resizebox{\textwidth}{!}{
\noindent\begin{tabular}{@{}l@{\hspace{10pt}}l@{\hspace{10pt}}l@{}}
(1) \texttt{v = df.iloc(5\hlred{:}, 2\hlorange{,})} & (3) \texttt{v = df.iloc(5\hlgreen{[}:, 2:\hlgreen{]})} & (5) \texttt{v = df.iloc\hlorange{[}5:, 2:\hlorange{]}} \\
\rule{0pt}{4ex}(2) \texttt{v = df.iloc(5\hlorange{)}, 2\hlorange{(})} & (4) \texttt{v = df.iloc(5\hlred{:}, 2\hlred{:})} & (6) \texttt{v = df.iloc(5\hlgreen{[}:, 2\hlorange{]})}\\
\end{tabular}\vspace{-5pt}
\end{tabular}
}\vspace{-10pt}
\end{figure}

With some semantic constraints, we could easily narrow the results, but even in their absence, one can probably rule out (2, 3, 6) given that \texttt{5[} and \texttt{2(} are rare bigrams in Python, and knowing \texttt{df.iloc} is often followed by \texttt{[}, determine (5) is the most likely repair. This is the key insight behind our approach: we can usually locate the intended fix by exhaustively searching small repairs. As the set of small repairs is itself often small, if only we had some procedure to distinguish valid from invalid patches, the resulting solutions could be simply ranked by likelihood.
Expand Down
37 changes: 21 additions & 16 deletions latex/thesis/content/Ch3_Deterministic_Repair.tex
Original file line number Diff line number Diff line change
Expand Up @@ -243,44 +243,49 @@ \subsection{Parikh Refinements}

To identify superfluous $q, v, q'$ triples, we define an interval domain that soundly overapproximates the Parikh image, encoding the minimum and maximum number of terminals each nonterminal must and can generate, respectively. Since some intervals may be right-unbounded, we write $\mathbb{N}^*=\mathbb{N} \cup \{\infty\}$ to denote the upper bound, and $\Pi = \{[a, b] \in \mathbb{N} \times \mathbb{N}^* \mid a \leq b\}^{|\Sigma|}$ to denote the Parikh image of all terminals.

\begin{definition}[Parikh mapping of a nonterminal]\label{def:parikh}
Let $p: \Sigma^*\rightarrow\mathbb{N}^{|\Sigma|}$ be the Parikh operator~\cite{parikh1966context}, which counts the frequency of terminals in a string. We define the Parikh map, $\pi: V \rightarrow \Pi$, as a function returning the smallest interval such that $\forall \sigma: \Sigma^*, \forall v: V$, $v \Rightarrow^* \sigma \vdash p(\sigma) \in \pi(v)$.
\begin{definition}{Parikh mapping of a nonterminal}{def:parikh}
Let $p: \Sigma^*\rightarrow\mathbb{N}^{|\Sigma|}$ be the Parikh operator~\cite{parikh1966context}, which counts the frequency of terminals in a string. We define the Parikh map as a function, $\pi: V \rightarrow \Pi$, returning the smallest interval such that $\forall \sigma: \Sigma^*, \forall v: V$, $v \Rightarrow^* \sigma \vdash p(\sigma) \in \pi(v)$.
\end{definition}

The Parikh mapping computes the greatest lower and least upper bound of the Parikh image over all strings in the language of a nonterminal. The infimum of a nonterminal's Parikh interval tells us how many of each terminal a nonterminal \textit{must} generate, and the supremum tells us how many it \textit{can} generate. Likewise, we define a similar relation over NFA state pairs:

\begin{definition}[Parikh mapping of NFA states]
\begin{definition}{Parikh mapping of NFA states}{}
We define $\pi: Q\times Q \rightarrow \Pi$ as returning the smallest interval such that $\forall \sigma: \Sigma^*, \forall q, q': Q$, $q \overset{\sigma}{\Longrightarrow} q' \vdash p(\sigma) \in \pi(q, q')$.
\end{definition}

Next, we will define a measure on Parikh intervals representing the minimum total edits required to transform a string in one Parikh interval to a string in another, across all such pairings.

\begin{definition}[Parikh divergence]
\begin{definition}{Parikh divergence}{}
Given two Parikh intervals $\pi, \pi': \Pi$, we define the divergence between them as $\pi \parallel \pi' = \sum_{n=1}^{|\Sigma|} \min_{(i, i') \in \pi[n]\times \pi'[n]} |i - i'|$.
\end{definition}

We know that if the Parikh divergence between two intervals is nonzero, those intervals must be incompatible as no two strings, one from each Parikh interval, can be transformed into the other with fewer than $\pi \parallel \pi'$ edits.

\begin{definition}[Parikh compatibility]
\begin{definition}{Parikh compatibility}{}
Let $q, q'$ be NFA states and $v$ be a CFG nonterminal. We call $\langle q, v, q'\rangle: Q\times V\times Q$ \textit{compatible} iff their divergence is zero, i.e., $v \lhd qq' \iff \big(\pi(v) \parallel \pi(q, q')\big) = 0$.
\end{definition}

Finally, we define the modified Bar-Hillel construction for nominal Levenshtein automata as:\vspace{-2pt}

For efficiency, Parikh compatibility can be precomputed for each $Q \times V \times Q$ triple and reused for each synthetic production. Finally, we are ready to define the modified Bar-Hillel construction:

\begin{definition}{Modified Bar-Hillel construction}{}
Let $w, x, z$ be nonterminals in a CNF CFG and $p, q, r$ be states in an FSA. We modify the $\Join$ rule from the BH construction as follows:
\begin{prooftree}
\hskip -0.9em
\def\defaultHypSeparation{\hskip 0.14cm}
\AxiomC{$(A \rightarrow a) \in P$}
\AxiomC{$(q\overset{{\color{orange}[\cdot]}}{\rightarrow}r) \in \delta$}
\AxiomC{$\color{orange}a[\cdot]$}
\RightLabel{$\hat\uparrow$}
\TrinaryInfC{$\big(qAr\rightarrow a\big)\in P_\cap$}
\DisplayProof
% \hskip -0.9em
% \def\defaultHypSeparation{\hskip 0.14cm}
% \AxiomC{$(A \rightarrow a) \in P$}
% \AxiomC{$(q\overset{{\color{orange}[\cdot]}}{\rightarrow}r) \in \delta$}
% \AxiomC{$\color{orange}a[\cdot]$}
% \RightLabel{$\hat\uparrow$}
% \TrinaryInfC{$\big(qAr\rightarrow a\big)\in P_\cap$}
% \DisplayProof
\AxiomC{$\vphantom{\overset{[\cdot]}{\rightarrow}}\color{orange} w \lhd pr \phantom{\land} x \lhd pq \phantom{\land} z \lhd qr$}
\AxiomC{$(w \rightarrow xz) \in P\vphantom{\overset{a}{\rightarrow}}$}
\AxiomC{$p,q,r \in Q$}
\RightLabel{$\hat\Join$}
\TrinaryInfC{$\big(pwr\rightarrow (pxq)(qzr)\big) \in P_\cap$}
\end{prooftree}\vspace{2pt}
\end{prooftree}
\end{definition}


%\noindent Once constructed, we normalize $G_\cap$ by removing unreachable and non-generating productions~\cite{firsov2015certified} to obtain $G_\cap'$, which is a recognizer for the admissible set, i.e., $\mathcal{L}(G_\cap') = \ell_\cap$, satisfying Def.~\ref{def:bcflr}. Note, the original BH construction and our adapted version both reduce to the same CNF, $G_\cap'$, but normalization becomes significantly more tractable for large intersections, as far fewer useless productions are instantiated to only later be removed during normalization.
Once constructed, we normalize $G_\cap$ by removing unreachable and non-generating productions~\cite{firsov2015certified} to obtain $G_\cap'$, which is a recognizer for the admissible set, i.e., $\mathcal{L}(G_\cap') = \ell_\cap$. Note, the original BH construction and our adapted version both reduce to the same CNF, $G_\cap'$, but normalization becomes significantly more tractable for large intersections, as far fewer useless productions are instantiated to only later be removed during normalization. This modified rule is not specific to Levenshtein automata and can be used to accelerate any FSA-CFG intersection.
11 changes: 11 additions & 0 deletions latex/thesis/preamble.tex
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,17 @@
]

\usepackage[skins,breakable,listings]{tcolorbox}
\tcbuselibrary{theorems}
\newtcbtheorem
[]% init options
{definition}% name
{Definition}% title
{%
colback=black!5,
colframe=black!65,
fonttitle=\bfseries,
}% options
{def}% prefix

\lstdefinelanguage{python}{
comment=[l]{//},
Expand Down

0 comments on commit ba7886a

Please sign in to comment.