Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join datasets have values that seem off #21

Open
MrPowers opened this issue Dec 6, 2024 · 2 comments
Open

Join datasets have values that seem off #21

MrPowers opened this issue Dec 6, 2024 · 2 comments
Assignees

Comments

@MrPowers
Copy link
Contributor

MrPowers commented Dec 6, 2024

Doesn't seem like the id1, id2, and id4 columns in the main table and join tables match up.

Here's the main table:

┌───────┬─────────┬──────────────┬─────┬───┬───────┬─────┬─────┬───────────┐
│ id1   ┆ id2     ┆ id3          ┆ id4 ┆ … ┆ id6   ┆ v1  ┆ v2  ┆ v3        │
│ ---   ┆ ---     ┆ ---          ┆ --- ┆   ┆ ---   ┆ --- ┆ --- ┆ ---       │
│ str   ┆ str     ┆ str          ┆ i64 ┆   ┆ i64   ┆ i64 ┆ i64 ┆ f64       │
╞═══════╪═════════╪══════════════╪═════╪═══╪═══════╪═════╪═════╪═══════════╡
│ id038 ┆ id85082 ┆ id0000083703 ┆ 90  ┆ … ┆ 89817 ┆ 4   ┆ 15  ┆ 28.133477 │
│ id095 ┆ id7331  ┆ id0000031245 ┆ 3   ┆ … ┆ 17720 ┆ 1   ┆ 12  ┆ 91.555302 │
│ id055 ┆ id24810 ┆ id0000014164 ┆ 12  ┆ … ┆ 13241 ┆ 1   ┆ 3   ┆ 64.543029 │
│ id046 ┆ id75326 ┆ id0000061395 ┆ 2   ┆ … ┆ 25    ┆ 1   ┆ 14  ┆ 23.049223 │
│ id052 ┆ id4569  ┆ id0000011446 ┆ 3   ┆ … ┆ 96734 ┆ 1   ┆ 7   ┆ 87.987183 │
│ …     ┆ …       ┆ …            ┆ …   ┆ … ┆ …     ┆ …   ┆ …   ┆ …         │
│ id013 ┆ id66079 ┆ id0000051775 ┆ 8   ┆ … ┆ 93287 ┆ 4   ┆ 14  ┆ 87.804319 │
│ id055 ┆ id84022 ┆ id0000019517 ┆ 28  ┆ … ┆ 68045 ┆ 4   ┆ 4   ┆ 11.484207 │
│ id006 ┆ id78451 ┆ id0000052738 ┆ 66  ┆ … ┆ 29370 ┆ 5   ┆ 9   ┆ 81.052285 │
│ id064 ┆ id23530 ┆ id0000023096 ┆ 38  ┆ … ┆ 34837 ┆ 4   ┆ 11  ┆ 99.93739  │
│ id070 ┆ id51799 ┆ id0000008809 ┆ 58  ┆ … ┆ 46152 ┆ 4   ┆ 6   ┆ 62.117956 │
└───────┴─────────┴──────────────┴─────┴───┴───────┴─────┴─────┴───────────┘

Here's J1_1e7_1e1_0.parquet:

┌─────┬─────┬───────────┐
│ id1 ┆ id4 ┆ v2        │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ str ┆ f64       │
╞═════╪═════╪═══════════╡
│ 4   ┆ id4 ┆ 60.635302 │
│ 9   ┆ id9 ┆ 61.462762 │
│ 3   ┆ id3 ┆ 11.638566 │
│ 5   ┆ id5 ┆ 32.557228 │
│ 9   ┆ id9 ┆ 13.04837  │
│ 4   ┆ id4 ┆ 45.650663 │
│ 8   ┆ id8 ┆ 35.343098 │
│ 4   ┆ id4 ┆ 17.648019 │
│ 2   ┆ id2 ┆ 98.806282 │
│ 1   ┆ id1 ┆ 13.350346 │
└─────┴─────┴───────────┘

Here's J1_1e7_1e4_0.parquet:

┌─────┬───────┬─────┬─────────┬───────────┐
│ id1 ┆ id2   ┆ id4 ┆ id5     ┆ v2        │
│ --- ┆ ---   ┆ --- ┆ ---     ┆ ---       │
│ i64 ┆ i64   ┆ str ┆ str     ┆ f64       │
╞═════╪═══════╪═════╪═════════╪═══════════╡
│ 4   ┆ 10548 ┆ id4 ┆ id10548 ┆ 58.047598 │
│ 9   ┆ 5478  ┆ id9 ┆ id5478  ┆ 28.344673 │
│ 3   ┆ 5478  ┆ id3 ┆ id5478  ┆ 43.711834 │
│ 9   ┆ 10463 ┆ id9 ┆ id10463 ┆ 13.04837  │
│ 4   ┆ 10463 ┆ id4 ┆ id10463 ┆ 4.459861  │
│ …   ┆ …     ┆ …   ┆ …       ┆ …         │
│ 4   ┆ 10978 ┆ id4 ┆ id10978 ┆ 85.268812 │
│ 8   ┆ 10548 ┆ id8 ┆ id10548 ┆ 12.755955 │
│ 4   ┆ 4344  ┆ id4 ┆ id4344  ┆ 96.08827  │
│ 6   ┆ 417   ┆ id6 ┆ id417   ┆ 13.815532 │
│ 5   ┆ 10463 ┆ id5 ┆ id10463 ┆ 13.843241 │
└─────┴───────┴─────┴─────────┴───────────┘

Here's J1_1e7_1e7_NA.parquet:

┌─────┬──────┬─────────┬─────┬────────┬────────┬───────────┐
│ id1 ┆ id2  ┆ id3     ┆ id4 ┆ id5    ┆ id6    ┆ v2        │
│ --- ┆ ---  ┆ ---     ┆ --- ┆ ---    ┆ ---    ┆ ---       │
│ i64 ┆ i64  ┆ i64     ┆ str ┆ str    ┆ str    ┆ f64       │
╞═════╪══════╪═════════╪═════╪════════╪════════╪═══════════╡
│ 4   ┆ 1607 ┆ 8624889 ┆ id4 ┆ id1607 ┆ id1607 ┆ 32.761295 │
│ 5   ┆ 3972 ┆ 83754   ┆ id5 ┆ id3972 ┆ id3972 ┆ 17.648019 │
│ 2   ┆ 49   ┆ 5152803 ┆ id2 ┆ id49   ┆ id49   ┆ 94.688198 │
│ 2   ┆ 4833 ┆ 7623547 ┆ id2 ┆ id4833 ┆ id4833 ┆ 77.909412 │
│ 5   ┆ 5733 ┆ 6155714 ┆ id5 ┆ id5733 ┆ id5733 ┆ 2.269674  │
│ …   ┆ …    ┆ …       ┆ …   ┆ …      ┆ …      ┆ …         │
│ 2   ┆ 1402 ┆ 5541869 ┆ id2 ┆ id1402 ┆ id1402 ┆ 78.53926  │
│ 9   ┆ 1849 ┆ 4288916 ┆ id9 ┆ id1849 ┆ id1849 ┆ 34.115661 │
│ 6   ┆ 7407 ┆ 323953  ┆ id6 ┆ id7407 ┆ id7407 ┆ 71.674646 │
│ 4   ┆ 9078 ┆ 431080  ┆ id4 ┆ id9078 ┆ id9078 ┆ 76.78765  │
│ 1   ┆ 2991 ┆ 4564333 ┆ id1 ┆ id2991 ┆ id2991 ┆ 19.238275 │
└─────┴──────┴─────────┴─────┴────────┴────────┴───────────┘
@MrPowers
Copy link
Contributor Author

MrPowers commented Dec 6, 2024

I just ran the original script and this is what it output:

(rscript) ~/D/c/c/d/_data ❯❯❯ Rscript join-datagen.R 1e7 0 0 0
Generate join data of 1e7 rows
Producing keys for LHS and RHS data
Producing LHS 1e7 data from keys
Writing LHS 1e7 data J1_1e7_NA_0_0
Producing RHS 1e1 data from keys
Writing RHS 1e1 data J1_1e7_1e1_0_0
Producing RHS 1e4 data from keys
Writing RHS 1e4 data J1_1e7_1e4_0_0
Producing RHS 1e7 data from keys
Writing RHS 1e7 data J1_1e7_1e7_0_0
Join datagen of 1e7 rows finished in 24s

So perhaps J1_1e7_1e7_NA.parquet is the "left" table.

@MrPowers
Copy link
Contributor Author

MrPowers commented Dec 6, 2024

So here are the files generated by the official script:

  • J1_1e7_NA_0_0
  • J1_1e7_1e1_0_0
  • J1_1e7_1e4_0_0
  • J1_1e7_1e7_0_0

Here are the files that are created when I run this command: falsa join --path-prefix=~/data --size SMALL --data-format PARQUET:

  • J1_1e7_1e1_0
  • J1_1e7_1e4_0
  • J1_1e7_1e7_NA

So I guess we're just missing one of the files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants