Skip to content

Commit

Permalink
fix minor typos
Browse files Browse the repository at this point in the history
  • Loading branch information
breandan committed Apr 6, 2024
1 parent e803206 commit e0bf4dc
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
Binary file modified latex/splash2024/splash.pdf
Binary file not shown.
16 changes: 8 additions & 8 deletions latex/splash2024/splash.tex
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@

\section{Example}

Syntax errors are usually fixable with a small number of edits. If we assume the intended repair contains just a few edits, this imposes strongly locality constraints on space of possible edits. For example, let us consider the following Python snippet, which contains a small syntax error:\\
Syntax errors are usually fixable with a small number of edits. If we assume the intended repair contains just a few edits, this imposes strongly locality constraints on the space of possible edits. For example, let us consider the following Python snippet, which contains a small syntax error:\\

\texttt{def prepend(i, k, L=[]) n and [prepend(i - 1, k, [b] + L) for b in range(k)]}\\

Expand Down Expand Up @@ -130,7 +130,7 @@

\clearpage\section{Problem statement}

Source code in a programming language can be treated a string over a finite alphabet, $\Sigma$. We will use a lexical alphabet for convenience. The language has a syntax, $\ell \subset \Sigma^*$, containing every acceptable program. A syntax error is an unacceptable string, $\err\sigma \notin \ell$. We can model syntax repair as a language intersection between a context-free language (CFL) and a regular language. Henceforth, $\err\sigma$ will always and only be used to denote a syntactically invalid string whose target language is known.
Source code in a programming language can be treated as a string over a finite alphabet, $\Sigma$. We use a lexical alphabet for convenience. The language has a syntax, $\ell \subset \Sigma^*$, containing every acceptable program. A syntax error is an unacceptable string, $\err\sigma \notin \ell$. We can model syntax repair as a language intersection between a context-free language (CFL) and a regular language. Henceforth, $\err\sigma$ will always and only be used to denote a syntactically invalid string whose target language is known.

\begin{definition}[Bounded Levenshtein-CFL reachability]\label{def:bcflr}
Given a CFL, $\ell$, and an invalid string, $\err{\sigma}: \ell^\complement$, find every valid string reachable within $d$ edits of $\err{\sigma}$, i.e., letting $\Delta$ be the Levenshtein metric and $L(\err\sigma, d) = \{\sigma' \mid \Delta(\err{\sigma}, \sigma') \leq d\}$ be the Levenshtein $d$-ball, we seek to find $A = L(\err\sigma, d) \cap \ell$.
Expand Down Expand Up @@ -933,7 +933,7 @@

We use syntax errors and fixes from the Python language to validate our approach. Python source code fragments are abstracted as a sequence of lexical tokens using the official Python lexer, erasing numbers and identifiers, but retaining all other keywords. Precision is evaluated across a test set by checking for lexical equivalence with the ground-truth repair, following Sakkas et al. (2022)~\cite{sakkas2022seq2parse}.

We compare our method against two separate baselines, Seq2Parse and Break-It-Fix-It (BIFI)~\cite{yasunaga2021break} on a single test set. This dataset~\cite{wong2019syntax} consists of 20k naturally-occurring pairs of Python errors and their corresponding human fixes from StackOverflow and is used compare the precision of each method at blind recovery of the ground truth repair across varying edit distances, snippet lengths and latency cutoffs. We preprocess all source code by filtering for broken-fixed snippet pairs shorter than 80 tokens and fewer than five Levenshtein edits apart, whose broken and fixed form is accepted and rejected, respectively, by the Python 3.8.11 parser. We then balance the dataset by sampling an equal number of repairs from each length and Levenshtein edit distance.
We compare our method against two separate baselines, Seq2Parse and Break-It-Fix-It (BIFI)~\cite{yasunaga2021break} on a single test set. This dataset~\cite{wong2019syntax} consists of 20k naturally-occurring pairs of Python errors and their corresponding human fixes from StackOverflow and is used to compare the precision of each method at blind recovery of the ground truth repair across varying edit distances, snippet lengths and latency cutoffs. We preprocess all source code by filtering for broken-fixed snippet pairs shorter than 80 tokens and fewer than five Levenshtein edits apart, whose broken and fixed form is accepted and rejected, respectively, by the Python 3.8.11 parser. We then balance the dataset by sampling an equal number of repairs from each length and Levenshtein edit distance.

% In our synthetic experiments, we apply the pretrained BIFI breaker to synthetically corrupt Python snippets from the BIFI good code test set, using the clean source as the ground truth repair, and filter broken-fixed snippet pairs by the same criteria.

Expand Down Expand Up @@ -1237,7 +1237,7 @@

\clearpage\subsection{Subcomponent ablation}\label{sec:rq3}

Originally, we used a adaptive rejection-based sampler, which did not sample directly from the admissible set, but the entire Levenshtein ball, and then rejected invalid samples. Although rejection sampling has a much lower minimum latency threshold to return admissible repairs, i.e., a few seconds at most, the average time required to attain a desired precision on human repairs is much higher. We present the results from the rejection-based evaluation for comparison below.
Originally, we used an adaptive rejection-based sampler, which did not sample directly from the admissible set, but the entire Levenshtein ball, and then rejected invalid samples. Although rejection sampling has a much lower minimum latency threshold to return admissible repairs, i.e., a few seconds at most, the average time required to attain a desired precision on human repairs is much higher. We present the results from the rejection-based evaluation for comparison below.

\begin{figure}[H]
\resizebox{.24\textwidth}{!}{\input{repair1-3_10s_plot}}
Expand Down Expand Up @@ -1275,9 +1275,9 @@

Our primary insight leading to state-of-the-art precision is that repairs are typically concentrated near the center of a small Levenshtein ball, and by enumerating or sampling it carefully, then reranking all repairs found we can achieve a significant improvement over one-shot neural repair. This is especially true for small-radii Levenshtein balls, where the admissible set is small enough to be enumerated completely and ranked. For larger radii, we can still achieve competitive precision by using a PCFG to sample from the admissible set and reranking by perplexity.

Unexpectedly, we find that Precision@1 of our method is competitive with BIFI's Precision@20k, while requiring only a fraction of the data and compute. This is likely due to the fact that BIFI's training set does not cover the full space of syntactically valid repairs. As Tidyparse uses its own grammar, it can sample from the language directly, and does not require training distribution to suggest valid repairs, only to rank them by naturalness. The emphasis on completeness is especially useful for discovering small repairs, which may be overlooked by neural models.
Unexpectedly, we find that Precision@1 of our method is competitive with BIFI's Precision@20k, while requiring only a fraction of the data and compute. This is likely because BIFI's training set does not cover the full space of syntactically valid repairs. As Tidyparse uses its own grammar, it can sample from the language directly, and does not require training distribution to suggest valid repairs, only to rank them by naturalness. The emphasis on completeness is especially useful for discovering small repairs, which may be overlooked by neural models.

Although latency and precision are ultimately the deciding usability factors, repair throughput is an crucial intermediate factor to consider when evaluating the performance of a repair system. Even with a perfectly accurate scoring function, if the correct repair is never retrieved, it will be for naught. By maximizing the total number of unique valid repairs, we increase the likelihood of retrieving natural repairs to give the scoring function the best chance of ranking them successfully. For this reason, we prioritize throughput heavily in our design (Def.~\ref{def:linear-convergence}) and evaluation (Fig.~\ref{fig:throughput}).
Although latency and precision are ultimately the deciding usability factors, repair throughput is a crucial intermediate factor to consider when evaluating the performance of a repair system. Even with a perfectly accurate scoring function, if the correct repair is never retrieved, it will be for naught. By maximizing the total number of unique valid repairs, we increase the likelihood of retrieving natural repairs to give the scoring function the best chance of ranking them successfully. For this reason, we prioritize throughput heavily in our design (Def.~\ref{def:linear-convergence}) and evaluation (Fig.~\ref{fig:throughput}).

Rejection sampling can be a useful technique for quickly retrieving a subset of valid repairs, but has the disadvantage of converging very slowly, requiring far too many samples to achieve competitive precision on natural repairs. One avenue may be to use rejection sampling to find probable edit locations, then switch to an exhaustive method to retrieve all repairs in that region. This approach however, would not offer the same completeness guarantees as language intersection.

Expand Down Expand Up @@ -1379,15 +1379,15 @@

% As our work shows, not only is linear algebra over finite fields an expressive language for probabilistic inference, but also an efficient framework for inference on languages themselves. Borrowing analysis techniques from multilinear algebra and tensor completion in the machine learning setting, we develop an equational theory that allows us to translate various decision problems on formal languages into a system of inequalities over finite fields. We demonstrate the effectiveness of our approach for syntax repair in context-free languages, and show that our approach is competitive with state-of-the-art methods in terms of both accuracy and efficiency. In future work, we hope to extend our method to more natural grammars like conjunctive languages, TAG, LCFRS and other mildly context-sensitive languages.

From a usability standpoint, syntax repair tools should be as user-friendly and widely-accessible as autocorrection tools in word processors. We argue it is possible to reduce disruption from manual syntax repair and improve the efficiency of working programmers by driving down the latency needed to synthesize an acceptable repair. In contrast with program synthesizers that require intermediate editor states to be well-formed, our synthesizer does not impose any constraints on the code itself being written and is possible to use in an interactive programming setting.
From a usability standpoint, syntax repair tools should be as user-friendly and widely accessible as autocorrection tools in word processors. We argue it is possible to reduce disruption from manual syntax repair and improve the efficiency of working programmers by driving down the latency needed to synthesize an acceptable repair. In contrast with program synthesizers that require intermediate editor states to be well-formed, our synthesizer does not impose any constraints on the code itself being written and is possible to use in an interactive programming setting.

% The design of the tool itself is relatively simple. Tidyparse accepts a context-free language and a string. If the string is valid, it returns the parse forest, otherwise, it returns a set of repairs, ordered by likelihood. This approach has many advantages, enabling us to repair broken syntax, correct typos and recover from small errors, while being provably sound and complete with respect to the grammatical specification and a Levenshtein bound. It is also compatible with neural program synthesis and repair techniques, which can be used to score and rank the generated repairs.

We have implemented our approach and demonstrated its viability as a tool for syntax assistance in real-world programming languages. Tidyparse is capable of generating repairs for invalid source code in a range of practical languages with little to no data required. We plan to continue expanding the prototype's autocorrection functionality to cover a broader range of languages and hope to conduct a more thorough user study to validate its effectiveness in practical programming scenarios.

\section*{Data-Availability Statement}

An artifact for Tidyparse is currently available as a browser application.~\footnote{\url{https://tidyparse.github.io}} While the browser demo is current single-threaded and does not presently support ranking synthetic repairs by naturalness, it is capable of automatically repairing syntax errors in arbitrary context-free languages. The data and source code for the experiments contained in this paper will be made available upon publication.
An artifact for Tidyparse is currently available as a browser application.~\footnote{\url{https://tidyparse.github.io}} While the browser demo is single-threaded and does not support ranking synthetic repairs by naturalness, it is capable of automatically repairing syntax errors in arbitrary context-free languages. The data and source code for the experiments contained in this paper will be made available upon publication.

%\subsection{Ranking}
%
Expand Down

0 comments on commit e0bf4dc

Please sign in to comment.