Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V Corpus #27

Open
jfioasd opened this issue Apr 9, 2023 · 4 comments
Open

V Corpus #27

jfioasd opened this issue Apr 9, 2023 · 4 comments

Comments

@jfioasd
Copy link

jfioasd commented Apr 9, 2023

Similar to this issue, I decided to run Lynn's method on V answers.

Query used. Code:

import csv
import collections

digraphs = collections.Counter()
trigraphs = collections.Counter()
quadgraphs = collections.Counter()

cp1252 = "ǝʒαβγδεζηθ\nвимнтΓΔΘιΣΩ≠∊∍∞₁₂₃₄₅₆ !\"#$%&'()*+,-./0123456789" + \
                                 ":;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrst" + \
                                 "uvwxyz{|}~Ƶ€Λ‚ƒ„…†‡ˆ‰Š‹ŒĆŽƶĀ‘’“”•–—˜™š›œćžŸā¡¢£¤¥¦§¨©ª«¬λ®¯°" + \
                                 "±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëì" + \
                                 "íîïðñòóôõö÷øùúûüýþÿ"
with open("QueryResults(2).csv", newline="", encoding="utf-8") as f:
    for row in csv.reader(f):
        if row[0] == "Post Link":
            continue
        code = row[1]
        if "<pre><code>" not in code:
            continue

        # Extract the first bit of code
        vyxal = (
            code.partition("<pre><code>")[2]
            .partition("</code></pre>")[0]
            .strip()
        )
        vyxal = vyxal.replace("&quot;", '"')
        vyxal = vyxal.replace("&gt;", ">").replace("&lt;", "<")
        vyxal = vyxal.replace("&amp;", "&")

        for i in range(0, 256):
            vyxal = vyxal.replace("&#"+str(i)+";", cp1252[i])
        vyxal = vyxal.replace("<esc>", cp1252[0x1b])
        alpha = "abcdefghijklmnopqrstuvwxyz"
        for idx, i in enumerate(alpha):
            vyxal = vyxal.replace("<C-"+i+">", cp1252[idx+1])

        vyxal = vyxal.replace("<M-x>", "ø")

        if any(vyxal.count(c) >= 10 for c in vyxal):
            continue
        if len(vyxal) > 100:
            continue

        for line in vyxal.split("\n"):
            for (a, b) in zip(line, line[1:]):
                digraphs[a, b] += 1
            for (a, b, c) in zip(line, line[1:], line[2:]):
                trigraphs[a, b, c] += 1
            for (a, b, c, d) in zip(line, line[1:], line[2:], line[3:]):
                quadgraphs[a, b, c, d] += 1

with open("most-common.txt", "w", encoding="utf-8") as f:
    f.write("2-graphs:\n")
    for d, n in digraphs.most_common(30):
        f.write("%4d %s\n" % (n, "".join(d)))

    f.write("\n3-graphs:\n")
    for d, n in trigraphs.most_common(30):
        f.write("%4d %s\n" % (n, "".join(d)))

    f.write("\n4-graphs:\n")
    for d, n in quadgraphs.most_common(30):
        f.write("%4d %s\n" % (n, "".join(d)))

Results (displayed in the 05AB1E codepage):

2-graphs:
  24 Àñ
  24 ./
  21 Àé
  21 @"
  19 $x
  17 xx
  17  /
  17   
  15 /&
  15 dd
  15 «©
  14 Íî
  14 òÍ
  14 12
  13 2i
  13 lD
  13 Gp
  13 ll
  13 é 
  12 Θ"
  12 /d
  12 / 
  12 Yp
  11 r 
  11 lx
  11 e 
  11  ₂
  11 Ó.
  11 òd
  10 kl

3-graphs:
  11 ./&
  10 Ó./
   8 [ae
   8 aei
   8 eio
   8 iou
   8 "qp
   7 YGp
   7 /&ò
   7 lxx
   7 $xh
   7 D@"
   6 Í./
   6 xx>
   6 Ä$x
   6 qpx
   6 Àé 
   5 Àé*
   5 òͨ
   5 À­ñ
   5 ou]
   5 ¨ä«
   4 ©î±
   4 ¨[a
   4 ]«©
   4 «©¨
   4 «©/
   4 òÍî
   4 /  
   4 /12

4-graphs:
   8 [aei
   8 aeio
   8 eiou
   8 Ó./&
   6 ./&ò
   6 "qpx
   5 iou]
   4 ¨[ae
   4 òÄ$x
   4 Ä$xh
   4 ~"qp
   4 :se 
   4 2i2i
   4 ¨ä«©
   3 Í./&
   3 ¨.«©
   3 lxx>
   3 iouy
   3 ouy]
   3 uy]«
   3 À|lD
   3 Ñ~"q
   3 ./& 
   3 òhYp
   3 hYpX
   3 :sor
   3 éiD@
   3 iD@"
   3 ₂"qp
   3 gÓul
@DJMcMayhem
Copy link
Owner

I think there's something in your parsing that is a major oversight. &# is not really meaningful V code (It's valid, but not exactly 'useful') so there's absolutely no way that's the most common 2-byte sequence in V code. Look at this answer for example: https://codegolf.stackexchange.com/a/124772/31716

The markdown for that answer is

<pre><code>Í.&#147;op
</code></pre>

which renders on SE as

Í.“op

Also <C-x> means "ctrl-x", but it gets treated like it's 5 distinct bytes instead of 1 by this parser, which is why <C, C-, and esc all score so high. It seems like this parser isn't sophisticated enough to handle the way V answers tend to be formatted.

@jfioasd
Copy link
Author

jfioasd commented Apr 11, 2023

Thanks, I thought V was an SBCS. I'll try to parse <...> and &#...; in my analyzer.

Apparently, SE uses CP-1252, so I'll use the 05AB1E codepage to display it. (replacing these sequences into the respective characters)

@DJMcMayhem
Copy link
Owner

It is an SBCS, it's just the answers are frequently formatted in "readable mode" with things like <C-a>, <esc>, <M-D>, etc. That's one additional thing that would need to be parsed, <M-x> means "alt-x" which would mean 'x' with the high bit set in latin9, or ø.

@jfioasd
Copy link
Author

jfioasd commented Apr 13, 2023

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants