Skip to content

Commit

Permalink
expand on method
Browse files Browse the repository at this point in the history
  • Loading branch information
breandan committed Feb 26, 2024
1 parent ff21a76 commit 5bcd840
Show file tree
Hide file tree
Showing 4 changed files with 24 additions and 7 deletions.
Binary file modified latex/splash2024/experiments/evaluation.pdf
Binary file not shown.
3 changes: 2 additions & 1 deletion latex/splash2024/experiments/evaluation.tex
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
% Document
\begin{document}

\section{Evaluation}
For our evaluation, we use the StackOverflow dataset from \cite{hindle2012naturalness}. We preprocess the dataset to lexicalize both the broken and fixed code snippets, then filter the dataset by length and edit distance, in which all Python snippets whose broken form is fewer than 80 lexical tokens and whose human fix is under four Levenshtein edits is retained.

For our first experiment, we run the sampler until the human repair is detected, then measure the number of samples required to draw the exact human repair across varying Levenshtein radii.
Expand All @@ -23,7 +24,7 @@
\caption{Sample efficiency of LBH sampler at varying Levenshtein radii.}\label{fig:sample_efficiency}
\end{figure}

Next, measure the precision at various ranking cutoffs for varying wall-clock timeouts. Here, P@\{k=1, 5, 10, All\} indicates the percentage of syntax errors with a human repair of $\Delta=\{1, 2, 3, 4\}$ edits found in $\leq p$ seconds that were matched within the top-k results, using an ngram likelihood model.
Next, measure the precision at various ranking cutoffs for varying wall-clock timeouts. Here, P@\{k=1, 5, 10, All\} indicates the percentage of syntax errors with a human repair of $\Delta=\{1, 2, 3, 4\}$ edits found in $\leq p$ seconds that were matched within the top-k results, using an n-gram likelihood model.

\begin{figure}[h!]
% \resizebox{.19\textwidth}{!}{\input{bar_hillel_repair.tex}}
Expand Down
16 changes: 16 additions & 0 deletions latex/splash2024/method/method.tex
Original file line number Diff line number Diff line change
Expand Up @@ -120,4 +120,20 @@ \section{Method}
\caption{Flowchart of our proposed method.}\label{fig:flowchart}
\end{figure}

\subsection{The Nominal Levenshtein Automaton}

Levenshtein edits are recognized by a certain kind of automaton, known as the Levenshtein automaton. Since the original approach used by Schultz and Mihov contains cycles and epsilon transitions, we propose a modified variant which is epsilon-free, acyclic and monotone. Furthermore, we use a nominal automaton, allowing for infinite alphabets. This considerably simplifies the langauge intersection.

\subsection{The Bar-Hillel Construction}

The Bar-Hillel construction is a general method for obtaining the context-free grammar representing the intersection of a context-free language and a regular language. We will now present the epsilon-free version of the Bar-Hillel construction used in our work.\footnote{Clemente Pasti has a version of the BH construction that supports epsilon transitions, but is slightly more complicated.}

\subsection{The Levenshtein-Bar-Hillel-Parikh Reduction}

The standard BH construction applies to any CFL and REG language. While straightforward, the general method can generate hundreds of trillions of productions for moderately sized grammars and Levenshtein automata. Our method considerably simplifies this process by eliminating the need to materialize most of those productions, and is the key to making our approach tractable.

To achieve this, we precompute upper and lower Parikh bounds for every terminal and integer range of the string, which we call the Parikh map. This construction soundly overapproximates the minimum and maximum number of terminals that can be derived from a given nonterminal in a bounded-length string, and is used to prune the search space. We will now describe this reduction in detail.



\end{document}
Original file line number Diff line number Diff line change
Expand Up @@ -101,27 +101,27 @@ class ProbabilisticLBH {


/*
./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.repair.ProbabilisticLBH.twoEditRepair"
./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.repair.ProbabilisticLBH.threeEditRepair"
*/
@Test
fun twoEditRepair() {
fun threeEditRepair() {
val source = "NAME = { STRING = NUMBER , STRING = NUMBER , STRING = NUMBER } NEWLINE"
val repair = "NAME = { STRING : NUMBER , STRING : NUMBER , STRING : NUMBER } NEWLINE"
val gram = Grammars.seq2parsePythonCFG.noEpsilonOrNonterminalStubs
MAX_TOKENS = source.tokenizeByWhitespace().size + 5
MAX_RADIUS = 3
// MAX_TOKENS = source.tokenizeByWhitespace().size + 5
// MAX_RADIUS = 3
val levDist = 3
assertTrue(repair in gram.language && levenshtein(source, repair) <= levDist)

val clock = TimeSource.Monotonic.markNow()
val clock = TimeSource.Monotonic.markNow()
val levBall = makeLevFSA(source.tokenizeByWhitespace(), levDist)
val intGram = gram.jvmIntersectLevFSA(levBall)
println("Finished ${intGram.size}-prod ∩-grammar in ${clock.elapsedNow()}")
val lbhSet = intGram.toPTree().sampleDirectlyWOR()
.takeWhile { clock.elapsedNow().inWholeSeconds < 30 }.collect(Collectors.toSet())
println("Sampled ${lbhSet.size} repairs using Levenshtein/Bar-Hillel in ${clock.elapsedNow()}")
assertTrue(repair in intGram.language)
assertTrue(repair in lbhSet)
println(repair in lbhSet)
}

/*
Expand Down

0 comments on commit 5bcd840

Please sign in to comment.