Add support for `fsspec>=2023.9.0` #6244

mariosasko · 2023-09-15T17:58:25Z

Fix #6214

HuggingFaceDocBuilderDev · 2023-09-15T18:05:21Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-09-15T18:06:27Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006410 / 0.011353 (-0.004943)	0.003995 / 0.011008 (-0.007013)	0.083585 / 0.038508 (0.045076)	0.074285 / 0.023109 (0.051176)	0.307163 / 0.275898 (0.031265)	0.344691 / 0.323480 (0.021212)	0.004277 / 0.007986 (-0.003708)	0.004192 / 0.004328 (-0.000136)	0.065156 / 0.004250 (0.060905)	0.056774 / 0.037052 (0.019721)	0.315483 / 0.258489 (0.056994)	0.361911 / 0.293841 (0.068070)	0.030454 / 0.128546 (-0.098092)	0.008600 / 0.075646 (-0.067047)	0.286692 / 0.419271 (-0.132579)	0.052354 / 0.043533 (0.008821)	0.308997 / 0.255139 (0.053858)	0.337847 / 0.283200 (0.054647)	0.022459 / 0.141683 (-0.119224)	1.482758 / 1.452155 (0.030604)	1.572853 / 1.492716 (0.080137)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.288603 / 0.018006 (0.270597)	0.632903 / 0.000490 (0.632413)	0.013702 / 0.000200 (0.013502)	0.000284 / 0.000054 (0.000230)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028448 / 0.037411 (-0.008964)	0.082441 / 0.014526 (0.067916)	0.099048 / 0.176557 (-0.077508)	0.154370 / 0.737135 (-0.582765)	0.146143 / 0.296338 (-0.150195)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.399250 / 0.215209 (0.184040)	3.986683 / 2.077655 (1.909028)	1.962606 / 1.504120 (0.458486)	1.782653 / 1.541195 (0.241459)	1.830251 / 1.468490 (0.361761)	0.492498 / 4.584777 (-4.092278)	3.549581 / 3.745712 (-0.196131)	3.200056 / 5.269862 (-2.069806)	2.028109 / 4.565676 (-2.537568)	0.058222 / 0.424275 (-0.366053)	0.007629 / 0.007607 (0.000022)	0.482083 / 0.226044 (0.256039)	4.824728 / 2.268929 (2.555800)	2.448772 / 55.444624 (-52.995852)	2.079629 / 6.876477 (-4.796848)	2.267739 / 2.142072 (0.125667)	0.586712 / 4.805227 (-4.218515)	0.134073 / 6.500664 (-6.366591)	0.060565 / 0.075469 (-0.014904)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.263244 / 1.841788 (-0.578544)	18.964498 / 8.074308 (10.890190)	14.125062 / 10.191392 (3.933670)	0.167635 / 0.680424 (-0.512789)	0.018469 / 0.534201 (-0.515732)	0.390395 / 0.579283 (-0.188888)	0.406055 / 0.434364 (-0.028309)	0.460717 / 0.540337 (-0.079620)	0.642746 / 1.386936 (-0.744190)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006637 / 0.011353 (-0.004716)	0.003972 / 0.011008 (-0.007036)	0.064569 / 0.038508 (0.026061)	0.075450 / 0.023109 (0.052341)	0.405250 / 0.275898 (0.129352)	0.433530 / 0.323480 (0.110050)	0.005625 / 0.007986 (-0.002361)	0.004118 / 0.004328 (-0.000211)	0.065092 / 0.004250 (0.060842)	0.057979 / 0.037052 (0.020927)	0.413732 / 0.258489 (0.155243)	0.451983 / 0.293841 (0.158142)	0.032170 / 0.128546 (-0.096377)	0.008690 / 0.075646 (-0.066957)	0.071792 / 0.419271 (-0.347479)	0.048560 / 0.043533 (0.005027)	0.410312 / 0.255139 (0.155173)	0.427294 / 0.283200 (0.144095)	0.023006 / 0.141683 (-0.118677)	1.496319 / 1.452155 (0.044164)	1.566744 / 1.492716 (0.074027)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.266812 / 0.018006 (0.248805)	0.540277 / 0.000490 (0.539788)	0.008998 / 0.000200 (0.008799)	0.000101 / 0.000054 (0.000047)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032496 / 0.037411 (-0.004915)	0.091387 / 0.014526 (0.076861)	0.107516 / 0.176557 (-0.069041)	0.160019 / 0.737135 (-0.577116)	0.107686 / 0.296338 (-0.188652)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.433321 / 0.215209 (0.218111)	4.330221 / 2.077655 (2.252566)	2.367215 / 1.504120 (0.863095)	2.192464 / 1.541195 (0.651269)	2.200204 / 1.468490 (0.731714)	0.488057 / 4.584777 (-4.096720)	3.625429 / 3.745712 (-0.120283)	3.282859 / 5.269862 (-1.987003)	2.038716 / 4.565676 (-2.526960)	0.057968 / 0.424275 (-0.366307)	0.007753 / 0.007607 (0.000146)	0.509133 / 0.226044 (0.283089)	5.086445 / 2.268929 (2.817516)	2.846017 / 55.444624 (-52.598607)	2.469546 / 6.876477 (-4.406931)	2.673218 / 2.142072 (0.531145)	0.591228 / 4.805227 (-4.213999)	0.131920 / 6.500664 (-6.368744)	0.059967 / 0.075469 (-0.015502)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.375634 / 1.841788 (-0.466153)	19.506752 / 8.074308 (11.432444)	14.677876 / 10.191392 (4.486484)	0.165071 / 0.680424 (-0.515353)	0.020614 / 0.534201 (-0.513587)	0.395967 / 0.579283 (-0.183316)	0.424358 / 0.434364 (-0.010006)	0.469954 / 0.540337 (-0.070384)	0.643169 / 1.386936 (-0.743767)

github-actions · 2023-09-17T17:53:09Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006072 / 0.011353 (-0.005281)	0.003691 / 0.011008 (-0.007318)	0.081683 / 0.038508 (0.043175)	0.059114 / 0.023109 (0.036005)	0.317053 / 0.275898 (0.041155)	0.357672 / 0.323480 (0.034192)	0.003577 / 0.007986 (-0.004408)	0.003890 / 0.004328 (-0.000438)	0.063667 / 0.004250 (0.059417)	0.048233 / 0.037052 (0.011181)	0.322854 / 0.258489 (0.064365)	0.368014 / 0.293841 (0.074173)	0.027750 / 0.128546 (-0.100796)	0.008137 / 0.075646 (-0.067509)	0.263906 / 0.419271 (-0.155366)	0.045402 / 0.043533 (0.001870)	0.315414 / 0.255139 (0.060275)	0.340906 / 0.283200 (0.057707)	0.023475 / 0.141683 (-0.118208)	1.443922 / 1.452155 (-0.008233)	1.550332 / 1.492716 (0.057616)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.211914 / 0.018006 (0.193908)	0.423577 / 0.000490 (0.423088)	0.003436 / 0.000200 (0.003236)	0.000077 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024675 / 0.037411 (-0.012737)	0.072550 / 0.014526 (0.058024)	0.084533 / 0.176557 (-0.092024)	0.146106 / 0.737135 (-0.591029)	0.085523 / 0.296338 (-0.210816)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.403498 / 0.215209 (0.188289)	4.019000 / 2.077655 (1.941345)	1.984821 / 1.504120 (0.480701)	1.805071 / 1.541195 (0.263876)	1.860906 / 1.468490 (0.392416)	0.499570 / 4.584777 (-4.085207)	3.088424 / 3.745712 (-0.657288)	2.833693 / 5.269862 (-2.436169)	1.869731 / 4.565676 (-2.695945)	0.057606 / 0.424275 (-0.366669)	0.006960 / 0.007607 (-0.000647)	0.476085 / 0.226044 (0.250040)	4.774063 / 2.268929 (2.505134)	2.458079 / 55.444624 (-52.986545)	2.106075 / 6.876477 (-4.770402)	2.248373 / 2.142072 (0.106301)	0.589767 / 4.805227 (-4.215460)	0.124382 / 6.500664 (-6.376282)	0.060705 / 0.075469 (-0.014764)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.287031 / 1.841788 (-0.554756)	17.662455 / 8.074308 (9.588147)	14.288812 / 10.191392 (4.097420)	0.156168 / 0.680424 (-0.524256)	0.016795 / 0.534201 (-0.517406)	0.333726 / 0.579283 (-0.245557)	0.362327 / 0.434364 (-0.072037)	0.387773 / 0.540337 (-0.152564)	0.547232 / 1.386936 (-0.839704)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006494 / 0.011353 (-0.004859)	0.003762 / 0.011008 (-0.007247)	0.062373 / 0.038508 (0.023864)	0.066357 / 0.023109 (0.043247)	0.448687 / 0.275898 (0.172789)	0.482445 / 0.323480 (0.158965)	0.004990 / 0.007986 (-0.002996)	0.002945 / 0.004328 (-0.001384)	0.062444 / 0.004250 (0.058194)	0.051381 / 0.037052 (0.014329)	0.449310 / 0.258489 (0.190821)	0.483188 / 0.293841 (0.189347)	0.029078 / 0.128546 (-0.099468)	0.008146 / 0.075646 (-0.067501)	0.067369 / 0.419271 (-0.351903)	0.041732 / 0.043533 (-0.001801)	0.451675 / 0.255139 (0.196536)	0.470445 / 0.283200 (0.187246)	0.021053 / 0.141683 (-0.120630)	1.483627 / 1.452155 (0.031472)	1.541594 / 1.492716 (0.048878)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.210247 / 0.018006 (0.192240)	0.424663 / 0.000490 (0.424173)	0.005394 / 0.000200 (0.005194)	0.000076 / 0.000054 (0.000021)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026894 / 0.037411 (-0.010517)	0.081324 / 0.014526 (0.066798)	0.091362 / 0.176557 (-0.085195)	0.145602 / 0.737135 (-0.591533)	0.091896 / 0.296338 (-0.204443)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.469662 / 0.215209 (0.254453)	4.689495 / 2.077655 (2.611840)	2.596462 / 1.504120 (1.092342)	2.422584 / 1.541195 (0.881389)	2.476710 / 1.468490 (1.008220)	0.507049 / 4.584777 (-4.077728)	3.185519 / 3.745712 (-0.560193)	2.879842 / 5.269862 (-2.390019)	1.882643 / 4.565676 (-2.683034)	0.058046 / 0.424275 (-0.366229)	0.006797 / 0.007607 (-0.000811)	0.545245 / 0.226044 (0.319201)	5.449248 / 2.268929 (3.180319)	3.057341 / 55.444624 (-52.387283)	2.728385 / 6.876477 (-4.148092)	2.898945 / 2.142072 (0.756873)	0.600035 / 4.805227 (-4.205192)	0.126337 / 6.500664 (-6.374327)	0.061333 / 0.075469 (-0.014136)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.332966 / 1.841788 (-0.508821)	17.960805 / 8.074308 (9.886497)	14.978838 / 10.191392 (4.787446)	0.148852 / 0.680424 (-0.531572)	0.018307 / 0.534201 (-0.515894)	0.335234 / 0.579283 (-0.244050)	0.389659 / 0.434364 (-0.044704)	0.393259 / 0.540337 (-0.147078)	0.549237 / 1.386936 (-0.837699)

github-actions · 2023-09-17T22:49:23Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008808 / 0.011353 (-0.002545)	0.005001 / 0.011008 (-0.006008)	0.110022 / 0.038508 (0.071514)	0.078015 / 0.023109 (0.054906)	0.384724 / 0.275898 (0.108826)	0.441354 / 0.323480 (0.117874)	0.005116 / 0.007986 (-0.002870)	0.004308 / 0.004328 (-0.000020)	0.081679 / 0.004250 (0.077429)	0.061386 / 0.037052 (0.024333)	0.398149 / 0.258489 (0.139660)	0.464859 / 0.293841 (0.171018)	0.047443 / 0.128546 (-0.081104)	0.014693 / 0.075646 (-0.060954)	0.365438 / 0.419271 (-0.053833)	0.081689 / 0.043533 (0.038156)	0.400458 / 0.255139 (0.145319)	0.449958 / 0.283200 (0.166758)	0.038266 / 0.141683 (-0.103417)	1.795043 / 1.452155 (0.342888)	1.908819 / 1.492716 (0.416102)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.297911 / 0.018006 (0.279905)	0.601640 / 0.000490 (0.601150)	0.015406 / 0.000200 (0.015206)	0.000163 / 0.000054 (0.000108)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034520 / 0.037411 (-0.002891)	0.092657 / 0.014526 (0.078131)	0.113992 / 0.176557 (-0.062564)	0.189075 / 0.737135 (-0.548061)	0.106602 / 0.296338 (-0.189736)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.585838 / 0.215209 (0.370629)	5.719281 / 2.077655 (3.641627)	2.525914 / 1.504120 (1.021794)	2.231908 / 1.541195 (0.690713)	2.215272 / 1.468490 (0.746782)	0.814425 / 4.584777 (-3.770352)	5.243406 / 3.745712 (1.497694)	4.476642 / 5.269862 (-0.793220)	2.929438 / 4.565676 (-1.636239)	0.092070 / 0.424275 (-0.332205)	0.009358 / 0.007607 (0.001751)	0.713975 / 0.226044 (0.487931)	6.948846 / 2.268929 (4.679918)	3.361125 / 55.444624 (-52.083500)	2.575224 / 6.876477 (-4.301253)	2.783082 / 2.142072 (0.641009)	1.016205 / 4.805227 (-3.789022)	0.202578 / 6.500664 (-6.298086)	0.076696 / 0.075469 (0.001227)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.650889 / 1.841788 (-0.190898)	23.358273 / 8.074308 (15.283965)	19.882450 / 10.191392 (9.691058)	0.228971 / 0.680424 (-0.451453)	0.027736 / 0.534201 (-0.506465)	0.472405 / 0.579283 (-0.106878)	0.581799 / 0.434364 (0.147435)	0.533000 / 0.540337 (-0.007338)	0.815588 / 1.386936 (-0.571348)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009151 / 0.011353 (-0.002202)	0.005074 / 0.011008 (-0.005934)	0.078709 / 0.038508 (0.040201)	0.077696 / 0.023109 (0.054586)	0.522356 / 0.275898 (0.246458)	0.562345 / 0.323480 (0.238865)	0.006411 / 0.007986 (-0.001575)	0.004379 / 0.004328 (0.000051)	0.082402 / 0.004250 (0.078151)	0.064223 / 0.037052 (0.027170)	0.518184 / 0.258489 (0.259695)	0.566221 / 0.293841 (0.272380)	0.046796 / 0.128546 (-0.081750)	0.013987 / 0.075646 (-0.061659)	0.094925 / 0.419271 (-0.324346)	0.058810 / 0.043533 (0.015277)	0.520252 / 0.255139 (0.265113)	0.566403 / 0.283200 (0.283203)	0.034720 / 0.141683 (-0.106963)	1.796809 / 1.452155 (0.344654)	1.913787 / 1.492716 (0.421070)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.317449 / 0.018006 (0.299443)	0.620154 / 0.000490 (0.619664)	0.007066 / 0.000200 (0.006866)	0.000126 / 0.000054 (0.000072)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035252 / 0.037411 (-0.002160)	0.111648 / 0.014526 (0.097122)	0.120692 / 0.176557 (-0.055864)	0.193202 / 0.737135 (-0.543933)	0.127905 / 0.296338 (-0.168434)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.661012 / 0.215209 (0.445803)	6.626680 / 2.077655 (4.549026)	3.243065 / 1.504120 (1.738945)	2.904053 / 1.541195 (1.362858)	2.880516 / 1.468490 (1.412026)	0.875650 / 4.584777 (-3.709127)	5.381993 / 3.745712 (1.636281)	4.743997 / 5.269862 (-0.525864)	3.020736 / 4.565676 (-1.544940)	0.106573 / 0.424275 (-0.317702)	0.011151 / 0.007607 (0.003544)	0.821990 / 0.226044 (0.595946)	8.225383 / 2.268929 (5.956454)	3.963232 / 55.444624 (-51.481392)	3.288916 / 6.876477 (-3.587561)	3.579435 / 2.142072 (1.437363)	1.043379 / 4.805227 (-3.761848)	0.207508 / 6.500664 (-6.293156)	0.085109 / 0.075469 (0.009640)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.723798 / 1.841788 (-0.117990)	24.709848 / 8.074308 (16.635540)	22.484674 / 10.191392 (12.293282)	0.260357 / 0.680424 (-0.420067)	0.033539 / 0.534201 (-0.500662)	0.487814 / 0.579283 (-0.091469)	0.610171 / 0.434364 (0.175807)	0.585012 / 0.540337 (0.044674)	0.803764 / 1.386936 (-0.583172)

github-actions · 2023-09-17T22:50:06Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006661 / 0.011353 (-0.004692)	0.004022 / 0.011008 (-0.006987)	0.084269 / 0.038508 (0.045760)	0.070707 / 0.023109 (0.047598)	0.315035 / 0.275898 (0.039137)	0.339830 / 0.323480 (0.016350)	0.003994 / 0.007986 (-0.003991)	0.004129 / 0.004328 (-0.000199)	0.065383 / 0.004250 (0.061133)	0.055493 / 0.037052 (0.018441)	0.320521 / 0.258489 (0.062032)	0.354301 / 0.293841 (0.060460)	0.031177 / 0.128546 (-0.097370)	0.008724 / 0.075646 (-0.066922)	0.288298 / 0.419271 (-0.130974)	0.052418 / 0.043533 (0.008885)	0.319122 / 0.255139 (0.063983)	0.335859 / 0.283200 (0.052659)	0.026260 / 0.141683 (-0.115423)	1.450039 / 1.452155 (-0.002115)	1.545172 / 1.492716 (0.052455)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.234232 / 0.018006 (0.216226)	0.454983 / 0.000490 (0.454493)	0.007590 / 0.000200 (0.007390)	0.000550 / 0.000054 (0.000495)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028714 / 0.037411 (-0.008698)	0.083686 / 0.014526 (0.069160)	0.162986 / 0.176557 (-0.013570)	0.167574 / 0.737135 (-0.569561)	0.273158 / 0.296338 (-0.023180)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.388275 / 0.215209 (0.173066)	3.862034 / 2.077655 (1.784379)	1.843561 / 1.504120 (0.339441)	1.675224 / 1.541195 (0.134029)	1.730394 / 1.468490 (0.261904)	0.495259 / 4.584777 (-4.089518)	3.627155 / 3.745712 (-0.118557)	3.290590 / 5.269862 (-1.979272)	2.032432 / 4.565676 (-2.533245)	0.058212 / 0.424275 (-0.366063)	0.007815 / 0.007607 (0.000208)	0.460625 / 0.226044 (0.234580)	4.616845 / 2.268929 (2.347916)	2.339280 / 55.444624 (-53.105344)	1.957216 / 6.876477 (-4.919261)	2.129511 / 2.142072 (-0.012562)	0.591782 / 4.805227 (-4.213446)	0.136391 / 6.500664 (-6.364273)	0.059627 / 0.075469 (-0.015842)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.278998 / 1.841788 (-0.562789)	18.485496 / 8.074308 (10.411188)	14.161273 / 10.191392 (3.969881)	0.164346 / 0.680424 (-0.516078)	0.018144 / 0.534201 (-0.516057)	0.391601 / 0.579283 (-0.187682)	0.424391 / 0.434364 (-0.009973)	0.458209 / 0.540337 (-0.082129)	0.645124 / 1.386936 (-0.741812)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006799 / 0.011353 (-0.004554)	0.004023 / 0.011008 (-0.006985)	0.065206 / 0.038508 (0.026698)	0.074386 / 0.023109 (0.051277)	0.437399 / 0.275898 (0.161501)	0.467382 / 0.323480 (0.143903)	0.005467 / 0.007986 (-0.002519)	0.003324 / 0.004328 (-0.001005)	0.064289 / 0.004250 (0.060039)	0.057257 / 0.037052 (0.020205)	0.440035 / 0.258489 (0.181546)	0.477138 / 0.293841 (0.183298)	0.032171 / 0.128546 (-0.096375)	0.008400 / 0.075646 (-0.067247)	0.070877 / 0.419271 (-0.348395)	0.048180 / 0.043533 (0.004648)	0.441274 / 0.255139 (0.186135)	0.461386 / 0.283200 (0.178187)	0.022576 / 0.141683 (-0.119106)	1.520914 / 1.452155 (0.068759)	1.575593 / 1.492716 (0.082877)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.221551 / 0.018006 (0.203545)	0.447213 / 0.000490 (0.446723)	0.004435 / 0.000200 (0.004235)	0.000099 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032123 / 0.037411 (-0.005288)	0.091809 / 0.014526 (0.077283)	0.103938 / 0.176557 (-0.072618)	0.156878 / 0.737135 (-0.580258)	0.105071 / 0.296338 (-0.191268)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.430389 / 0.215209 (0.215180)	4.293496 / 2.077655 (2.215841)	2.292801 / 1.504120 (0.788681)	2.135320 / 1.541195 (0.594126)	2.195720 / 1.468490 (0.727229)	0.493277 / 4.584777 (-4.091500)	3.685617 / 3.745712 (-0.060096)	3.278897 / 5.269862 (-1.990965)	2.036939 / 4.565676 (-2.528737)	0.058766 / 0.424275 (-0.365509)	0.007783 / 0.007607 (0.000176)	0.511165 / 0.226044 (0.285120)	5.126757 / 2.268929 (2.857829)	2.756690 / 55.444624 (-52.687935)	2.421745 / 6.876477 (-4.454732)	2.597249 / 2.142072 (0.455177)	0.647206 / 4.805227 (-4.158021)	0.143392 / 6.500664 (-6.357273)	0.060110 / 0.075469 (-0.015359)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.340289 / 1.841788 (-0.501499)	19.057620 / 8.074308 (10.983312)	14.832892 / 10.191392 (4.641500)	0.167730 / 0.680424 (-0.512694)	0.020178 / 0.534201 (-0.514023)	0.394060 / 0.579283 (-0.185223)	0.433976 / 0.434364 (-0.000388)	0.474417 / 0.540337 (-0.065921)	0.640653 / 1.386936 (-0.746283)

github-actions · 2023-09-17T22:56:35Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007661 / 0.011353 (-0.003692)	0.004541 / 0.011008 (-0.006467)	0.100547 / 0.038508 (0.062039)	0.084257 / 0.023109 (0.061148)	0.377627 / 0.275898 (0.101729)	0.433764 / 0.323480 (0.110284)	0.005995 / 0.007986 (-0.001990)	0.003810 / 0.004328 (-0.000518)	0.076409 / 0.004250 (0.072158)	0.063411 / 0.037052 (0.026359)	0.382504 / 0.258489 (0.124015)	0.449721 / 0.293841 (0.155880)	0.036499 / 0.128546 (-0.092047)	0.009942 / 0.075646 (-0.065705)	0.343839 / 0.419271 (-0.075433)	0.062147 / 0.043533 (0.018614)	0.383244 / 0.255139 (0.128105)	0.415606 / 0.283200 (0.132406)	0.027475 / 0.141683 (-0.114207)	1.740413 / 1.452155 (0.288258)	1.862210 / 1.492716 (0.369493)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.260064 / 0.018006 (0.242058)	0.499001 / 0.000490 (0.498511)	0.015811 / 0.000200 (0.015611)	0.000119 / 0.000054 (0.000065)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033599 / 0.037411 (-0.003812)	0.099354 / 0.014526 (0.084828)	0.114693 / 0.176557 (-0.061864)	0.180231 / 0.737135 (-0.556904)	0.114715 / 0.296338 (-0.181623)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.459884 / 0.215209 (0.244675)	4.580806 / 2.077655 (2.503151)	2.270770 / 1.504120 (0.766650)	2.077127 / 1.541195 (0.535932)	2.167175 / 1.468490 (0.698685)	0.570593 / 4.584777 (-4.014184)	4.120926 / 3.745712 (0.375214)	3.817595 / 5.269862 (-1.452267)	2.404782 / 4.565676 (-2.160894)	0.067972 / 0.424275 (-0.356304)	0.009378 / 0.007607 (0.001771)	0.549642 / 0.226044 (0.323597)	5.490369 / 2.268929 (3.221440)	2.905264 / 55.444624 (-52.539361)	2.452935 / 6.876477 (-4.423542)	2.700760 / 2.142072 (0.558688)	0.700407 / 4.805227 (-4.104820)	0.159349 / 6.500664 (-6.341315)	0.074605 / 0.075469 (-0.000864)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.517803 / 1.841788 (-0.323985)	22.343700 / 8.074308 (14.269392)	16.411639 / 10.191392 (6.220247)	0.169816 / 0.680424 (-0.510608)	0.021532 / 0.534201 (-0.512668)	0.470161 / 0.579283 (-0.109122)	0.473412 / 0.434364 (0.039048)	0.539690 / 0.540337 (-0.000647)	0.774011 / 1.386936 (-0.612925)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007629 / 0.011353 (-0.003724)	0.004651 / 0.011008 (-0.006357)	0.075162 / 0.038508 (0.036654)	0.085365 / 0.023109 (0.062256)	0.493272 / 0.275898 (0.217374)	0.535776 / 0.323480 (0.212296)	0.006323 / 0.007986 (-0.001663)	0.003785 / 0.004328 (-0.000544)	0.076161 / 0.004250 (0.071911)	0.065982 / 0.037052 (0.028929)	0.513355 / 0.258489 (0.254866)	0.549219 / 0.293841 (0.255378)	0.038052 / 0.128546 (-0.090494)	0.010055 / 0.075646 (-0.065592)	0.083744 / 0.419271 (-0.335527)	0.056708 / 0.043533 (0.013175)	0.496273 / 0.255139 (0.241135)	0.523709 / 0.283200 (0.240509)	0.026502 / 0.141683 (-0.115181)	1.793032 / 1.452155 (0.340877)	1.870534 / 1.492716 (0.377817)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.252288 / 0.018006 (0.234281)	0.490380 / 0.000490 (0.489890)	0.005884 / 0.000200 (0.005684)	0.000109 / 0.000054 (0.000054)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.038238 / 0.037411 (0.000827)	0.110010 / 0.014526 (0.095485)	0.125497 / 0.176557 (-0.051059)	0.188154 / 0.737135 (-0.548981)	0.126112 / 0.296338 (-0.170227)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.515837 / 0.215209 (0.300628)	5.135153 / 2.077655 (3.057498)	2.761740 / 1.504120 (1.257620)	2.552718 / 1.541195 (1.011523)	2.636425 / 1.468490 (1.167935)	0.588442 / 4.584777 (-3.996335)	4.220833 / 3.745712 (0.475120)	3.874637 / 5.269862 (-1.395225)	2.424668 / 4.565676 (-2.141009)	0.069979 / 0.424275 (-0.354296)	0.009349 / 0.007607 (0.001742)	0.608936 / 0.226044 (0.382891)	6.081209 / 2.268929 (3.812280)	3.348067 / 55.444624 (-52.096557)	2.919130 / 6.876477 (-3.957347)	3.159093 / 2.142072 (1.017020)	0.704059 / 4.805227 (-4.101169)	0.158417 / 6.500664 (-6.342247)	0.071321 / 0.075469 (-0.004148)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.595287 / 1.841788 (-0.246501)	23.096619 / 8.074308 (15.022311)	17.258041 / 10.191392 (7.066649)	0.186197 / 0.680424 (-0.494227)	0.023633 / 0.534201 (-0.510567)	0.472181 / 0.579283 (-0.107102)	0.493817 / 0.434364 (0.059453)	0.567657 / 0.540337 (0.027320)	0.793789 / 1.386936 (-0.593147)

github-actions · 2023-09-18T18:12:45Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007084 / 0.011353 (-0.004268)	0.004093 / 0.011008 (-0.006915)	0.086395 / 0.038508 (0.047887)	0.087734 / 0.023109 (0.064625)	0.356936 / 0.275898 (0.081038)	0.386413 / 0.323480 (0.062933)	0.005531 / 0.007986 (-0.002454)	0.003462 / 0.004328 (-0.000866)	0.065503 / 0.004250 (0.061252)	0.058973 / 0.037052 (0.021920)	0.354151 / 0.258489 (0.095662)	0.398812 / 0.293841 (0.104971)	0.031508 / 0.128546 (-0.097038)	0.008537 / 0.075646 (-0.067109)	0.290942 / 0.419271 (-0.128329)	0.053537 / 0.043533 (0.010004)	0.352067 / 0.255139 (0.096928)	0.375142 / 0.283200 (0.091943)	0.025658 / 0.141683 (-0.116025)	1.468496 / 1.452155 (0.016341)	1.556926 / 1.492716 (0.064210)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.238858 / 0.018006 (0.220852)	0.460018 / 0.000490 (0.459528)	0.009613 / 0.000200 (0.009414)	0.000326 / 0.000054 (0.000272)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030333 / 0.037411 (-0.007078)	0.088431 / 0.014526 (0.073905)	0.098130 / 0.176557 (-0.078427)	0.155160 / 0.737135 (-0.581975)	0.099963 / 0.296338 (-0.196375)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.385769 / 0.215209 (0.170560)	3.836723 / 2.077655 (1.759069)	1.861065 / 1.504120 (0.356945)	1.685159 / 1.541195 (0.143965)	1.780679 / 1.468490 (0.312189)	0.491865 / 4.584777 (-4.092912)	3.581139 / 3.745712 (-0.164573)	3.366278 / 5.269862 (-1.903584)	2.093094 / 4.565676 (-2.472583)	0.058063 / 0.424275 (-0.366212)	0.007903 / 0.007607 (0.000296)	0.464866 / 0.226044 (0.238821)	4.647754 / 2.268929 (2.378825)	2.316466 / 55.444624 (-53.128158)	1.984079 / 6.876477 (-4.892398)	2.235020 / 2.142072 (0.092948)	0.592591 / 4.805227 (-4.212636)	0.135586 / 6.500664 (-6.365078)	0.061434 / 0.075469 (-0.014035)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.282940 / 1.841788 (-0.558848)	19.635975 / 8.074308 (11.561667)	14.426135 / 10.191392 (4.234743)	0.166732 / 0.680424 (-0.513692)	0.018438 / 0.534201 (-0.515763)	0.393173 / 0.579283 (-0.186110)	0.417291 / 0.434364 (-0.017073)	0.459188 / 0.540337 (-0.081149)	0.632568 / 1.386936 (-0.754368)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007166 / 0.011353 (-0.004187)	0.004254 / 0.011008 (-0.006754)	0.064667 / 0.038508 (0.026159)	0.085142 / 0.023109 (0.062033)	0.410081 / 0.275898 (0.134183)	0.445803 / 0.323480 (0.122323)	0.005600 / 0.007986 (-0.002385)	0.003520 / 0.004328 (-0.000809)	0.064148 / 0.004250 (0.059897)	0.059869 / 0.037052 (0.022817)	0.407439 / 0.258489 (0.148950)	0.451169 / 0.293841 (0.157329)	0.032619 / 0.128546 (-0.095927)	0.008706 / 0.075646 (-0.066940)	0.071230 / 0.419271 (-0.348041)	0.048499 / 0.043533 (0.004966)	0.416401 / 0.255139 (0.161262)	0.430737 / 0.283200 (0.147537)	0.022511 / 0.141683 (-0.119172)	1.517296 / 1.452155 (0.065141)	1.581704 / 1.492716 (0.088988)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.220738 / 0.018006 (0.202732)	0.454026 / 0.000490 (0.453536)	0.004695 / 0.000200 (0.004495)	0.000087 / 0.000054 (0.000033)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033202 / 0.037411 (-0.004209)	0.097506 / 0.014526 (0.082980)	0.106661 / 0.176557 (-0.069896)	0.160554 / 0.737135 (-0.576581)	0.109203 / 0.296338 (-0.187135)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.432013 / 0.215209 (0.216804)	4.310399 / 2.077655 (2.232744)	2.296529 / 1.504120 (0.792409)	2.139929 / 1.541195 (0.598734)	2.227432 / 1.468490 (0.758942)	0.493697 / 4.584777 (-4.091080)	3.639877 / 3.745712 (-0.105835)	3.323165 / 5.269862 (-1.946697)	2.084527 / 4.565676 (-2.481150)	0.058812 / 0.424275 (-0.365463)	0.007813 / 0.007607 (0.000206)	0.512366 / 0.226044 (0.286321)	5.119660 / 2.268929 (2.850732)	2.783819 / 55.444624 (-52.660806)	2.490669 / 6.876477 (-4.385808)	2.696653 / 2.142072 (0.554581)	0.627161 / 4.805227 (-4.178066)	0.137032 / 6.500664 (-6.363632)	0.064040 / 0.075469 (-0.011429)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.369578 / 1.841788 (-0.472210)	20.421182 / 8.074308 (12.346873)	15.719347 / 10.191392 (5.527955)	0.166150 / 0.680424 (-0.514274)	0.020262 / 0.534201 (-0.513939)	0.395645 / 0.579283 (-0.183638)	0.430363 / 0.434364 (-0.004001)	0.477843 / 0.540337 (-0.062494)	0.638501 / 1.386936 (-0.748435)

github-actions · 2023-09-18T19:05:42Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006141 / 0.011353 (-0.005211)	0.003683 / 0.011008 (-0.007325)	0.081127 / 0.038508 (0.042618)	0.064285 / 0.023109 (0.041176)	0.323038 / 0.275898 (0.047140)	0.347280 / 0.323480 (0.023800)	0.003518 / 0.007986 (-0.004467)	0.002958 / 0.004328 (-0.001370)	0.063093 / 0.004250 (0.058843)	0.050682 / 0.037052 (0.013629)	0.321222 / 0.258489 (0.062733)	0.359266 / 0.293841 (0.065425)	0.027515 / 0.128546 (-0.101032)	0.007964 / 0.075646 (-0.067682)	0.261305 / 0.419271 (-0.157966)	0.044897 / 0.043533 (0.001365)	0.320684 / 0.255139 (0.065545)	0.335722 / 0.283200 (0.052522)	0.023378 / 0.141683 (-0.118305)	1.418211 / 1.452155 (-0.033943)	1.523728 / 1.492716 (0.031011)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.222316 / 0.018006 (0.204310)	0.426943 / 0.000490 (0.426454)	0.008785 / 0.000200 (0.008585)	0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024716 / 0.037411 (-0.012695)	0.075341 / 0.014526 (0.060816)	0.089532 / 0.176557 (-0.087024)	0.147638 / 0.737135 (-0.589498)	0.085697 / 0.296338 (-0.210641)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.396395 / 0.215209 (0.181186)	3.947280 / 2.077655 (1.869625)	1.894762 / 1.504120 (0.390642)	1.712094 / 1.541195 (0.170899)	1.779049 / 1.468490 (0.310559)	0.509206 / 4.584777 (-4.075571)	3.073951 / 3.745712 (-0.671761)	2.886826 / 5.269862 (-2.383035)	1.894444 / 4.565676 (-2.671232)	0.059519 / 0.424275 (-0.364756)	0.006951 / 0.007607 (-0.000656)	0.468213 / 0.226044 (0.242169)	4.667134 / 2.268929 (2.398206)	2.342516 / 55.444624 (-53.102108)	1.992047 / 6.876477 (-4.884430)	2.142059 / 2.142072 (-0.000014)	0.600507 / 4.805227 (-4.204720)	0.128982 / 6.500664 (-6.371682)	0.062100 / 0.075469 (-0.013369)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.234500 / 1.841788 (-0.607288)	17.951646 / 8.074308 (9.877338)	13.862763 / 10.191392 (3.671371)	0.143133 / 0.680424 (-0.537291)	0.016643 / 0.534201 (-0.517558)	0.333174 / 0.579283 (-0.246109)	0.366956 / 0.434364 (-0.067408)	0.384569 / 0.540337 (-0.155769)	0.546830 / 1.386936 (-0.840106)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006146 / 0.011353 (-0.005207)	0.003725 / 0.011008 (-0.007283)	0.062099 / 0.038508 (0.023591)	0.064117 / 0.023109 (0.041008)	0.456100 / 0.275898 (0.180202)	0.490794 / 0.323480 (0.167314)	0.005652 / 0.007986 (-0.002334)	0.002897 / 0.004328 (-0.001432)	0.061909 / 0.004250 (0.057659)	0.050634 / 0.037052 (0.013582)	0.454422 / 0.258489 (0.195933)	0.493208 / 0.293841 (0.199367)	0.028822 / 0.128546 (-0.099724)	0.008115 / 0.075646 (-0.067531)	0.067214 / 0.419271 (-0.352058)	0.041529 / 0.043533 (-0.002004)	0.458016 / 0.255139 (0.202877)	0.476059 / 0.283200 (0.192859)	0.019926 / 0.141683 (-0.121757)	1.465345 / 1.452155 (0.013190)	1.533518 / 1.492716 (0.040802)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.218830 / 0.018006 (0.200823)	0.418869 / 0.000490 (0.418380)	0.005154 / 0.000200 (0.004954)	0.000080 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027648 / 0.037411 (-0.009763)	0.083842 / 0.014526 (0.069316)	0.092300 / 0.176557 (-0.084257)	0.146098 / 0.737135 (-0.591037)	0.093441 / 0.296338 (-0.202898)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.464426 / 0.215209 (0.249217)	4.632705 / 2.077655 (2.555051)	2.642091 / 1.504120 (1.137971)	2.461768 / 1.541195 (0.920573)	2.535554 / 1.468490 (1.067064)	0.507506 / 4.584777 (-4.077271)	3.095485 / 3.745712 (-0.650227)	2.884261 / 5.269862 (-2.385601)	1.908943 / 4.565676 (-2.656734)	0.058622 / 0.424275 (-0.365653)	0.006892 / 0.007607 (-0.000715)	0.536045 / 0.226044 (0.310001)	5.377448 / 2.268929 (3.108519)	3.076023 / 55.444624 (-52.368602)	2.745586 / 6.876477 (-4.130890)	2.939582 / 2.142072 (0.797510)	0.595639 / 4.805227 (-4.209589)	0.125086 / 6.500664 (-6.375578)	0.061075 / 0.075469 (-0.014394)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.342820 / 1.841788 (-0.498968)	18.326240 / 8.074308 (10.251932)	15.007094 / 10.191392 (4.815702)	0.133037 / 0.680424 (-0.547387)	0.018702 / 0.534201 (-0.515499)	0.330245 / 0.579283 (-0.249038)	0.381494 / 0.434364 (-0.052870)	0.393705 / 0.540337 (-0.146633)	0.533676 / 1.386936 (-0.853260)

albertvillanova

Thanks. Some comments below.

Maybe we should update the docstring of get_data_patterns accordingly? Currently it only gives examples of outputs with ** not in a single path segment (i.e. not with a / as prefix or suffix)

tests/test_load.py

albertvillanova · 2023-09-19T09:10:35Z

src/datasets/data_files.py

+    glob.glob, Path.glob, Path.match or fnmatch do not support ** with a prefix/suffix other than a forward slash /.
+    For instance, this means **.json is the same as *.json. On the contrary, the fsspec glob has no limits regarding the ** prefix/suffix,
+    resulting in **.json being equivalent to **/*.json.
+


I agree that it seems that the only documented usage of "**" is within a single path segment (between slashes). Before, fsspec was more lax with it, but they seem that they are aligning with the former's. So I agree we better align as well and use double asterisk only between slashes.

albertvillanova · 2023-09-19T09:20:12Z

src/datasets/data_files.py

+    KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["**/*[{sep}/]{keyword}[{sep}]*", "{keyword}[{sep}]*"]
+    KEYWORDS_IN_DIR_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]**", "**/*[{sep}/]{keyword}[{sep}/]**"]


With the current behavior (double asterisk only between slashes meaning zero or more path segments) I find confusing several patterns here:

why do we add the slash to the sep characters here: [{sep}/]? I think it would be clearer to remove it as we already prefix the pattern with **/

as they are:

the first pattern matches dir/train.csv, my-train.csv and dir/my-train.csv,

whereas the second pattern only matches train.csv

I think it would be clearer to use these patterns: ["**/*[{sep}]{keyword}[{sep}]*", "**/{keyword}[{sep}]*"], so that

the first pattern matches: my-train.csv and dir/my-train.csv

and the second: train.csv and dir/train.csv

The current patterns are indeed hard to read. Unfortunately, fsspec's ad-hoc conversion from a glob to a regex pattern doesn't work (as expected) for ["**/*[{sep}]{keyword}[{sep}]*", "**/{keyword}[{sep}]*"] - for instance, it converts "**/{keyword}[{sep}]*".format(keyword="eval", sep="-._ 0-9") to "^.*eval[-\\._ 0-9][^/]*$", which leads to a failure in the data_files tests as this matches "data/seqeval_results.txt", but it shouldn't.

So I suggest reporting this behavior in fsspec (IMO, they should use fnmatch.translate for the conversion) and merging this PR in the current state (and improving this when/if it's fixed).

I see... they introduced new bugs... 😕

github-actions · 2023-09-19T15:09:35Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007644 / 0.011353 (-0.003709)	0.004759 / 0.011008 (-0.006249)	0.100569 / 0.038508 (0.062061)	0.089645 / 0.023109 (0.066536)	0.376679 / 0.275898 (0.100781)	0.413214 / 0.323480 (0.089735)	0.006087 / 0.007986 (-0.001899)	0.003832 / 0.004328 (-0.000496)	0.075892 / 0.004250 (0.071641)	0.064635 / 0.037052 (0.027582)	0.376874 / 0.258489 (0.118385)	0.436756 / 0.293841 (0.142915)	0.036372 / 0.128546 (-0.092174)	0.010047 / 0.075646 (-0.065599)	0.345073 / 0.419271 (-0.074198)	0.062092 / 0.043533 (0.018559)	0.380503 / 0.255139 (0.125364)	0.414800 / 0.283200 (0.131600)	0.028274 / 0.141683 (-0.113409)	1.732463 / 1.452155 (0.280308)	1.859049 / 1.492716 (0.366333)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.267129 / 0.018006 (0.249123)	0.509109 / 0.000490 (0.508619)	0.012329 / 0.000200 (0.012130)	0.000432 / 0.000054 (0.000377)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033773 / 0.037411 (-0.003638)	0.102800 / 0.014526 (0.088274)	0.114256 / 0.176557 (-0.062300)	0.182048 / 0.737135 (-0.555087)	0.118225 / 0.296338 (-0.178113)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.457553 / 0.215209 (0.242344)	4.588212 / 2.077655 (2.510557)	2.184138 / 1.504120 (0.680018)	2.003570 / 1.541195 (0.462375)	2.093217 / 1.468490 (0.624727)	0.585679 / 4.584777 (-3.999098)	4.175319 / 3.745712 (0.429607)	3.914168 / 5.269862 (-1.355693)	2.452992 / 4.565676 (-2.112684)	0.068363 / 0.424275 (-0.355912)	0.009314 / 0.007607 (0.001707)	0.543640 / 0.226044 (0.317595)	5.440853 / 2.268929 (3.171925)	2.782415 / 55.444624 (-52.662210)	2.332359 / 6.876477 (-4.544118)	2.628520 / 2.142072 (0.486448)	0.696838 / 4.805227 (-4.108389)	0.160653 / 6.500664 (-6.340012)	0.075599 / 0.075469 (0.000130)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.545305 / 1.841788 (-0.296483)	23.073174 / 8.074308 (14.998866)	16.974977 / 10.191392 (6.783585)	0.183719 / 0.680424 (-0.496705)	0.021633 / 0.534201 (-0.512568)	0.471202 / 0.579283 (-0.108081)	0.479385 / 0.434364 (0.045021)	0.550872 / 0.540337 (0.010535)	0.766825 / 1.386936 (-0.620111)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007918 / 0.011353 (-0.003435)	0.004793 / 0.011008 (-0.006215)	0.077273 / 0.038508 (0.038765)	0.092079 / 0.023109 (0.068969)	0.483269 / 0.275898 (0.207371)	0.524919 / 0.323480 (0.201439)	0.006273 / 0.007986 (-0.001713)	0.004018 / 0.004328 (-0.000310)	0.077188 / 0.004250 (0.072937)	0.067891 / 0.037052 (0.030839)	0.478531 / 0.258489 (0.220042)	0.526956 / 0.293841 (0.233115)	0.038309 / 0.128546 (-0.090237)	0.010133 / 0.075646 (-0.065513)	0.083892 / 0.419271 (-0.335379)	0.057369 / 0.043533 (0.013836)	0.509427 / 0.255139 (0.254288)	0.506574 / 0.283200 (0.223374)	0.027987 / 0.141683 (-0.113696)	1.897469 / 1.452155 (0.445314)	1.893102 / 1.492716 (0.400385)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.243003 / 0.018006 (0.224997)	0.500267 / 0.000490 (0.499777)	0.007442 / 0.000200 (0.007242)	0.000110 / 0.000054 (0.000055)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.039266 / 0.037411 (0.001855)	0.114438 / 0.014526 (0.099912)	0.124528 / 0.176557 (-0.052029)	0.189399 / 0.737135 (-0.547736)	0.126703 / 0.296338 (-0.169635)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.518139 / 0.215209 (0.302930)	5.162058 / 2.077655 (3.084403)	2.835111 / 1.504120 (1.330991)	2.640919 / 1.541195 (1.099724)	2.736800 / 1.468490 (1.268310)	0.582813 / 4.584777 (-4.001964)	4.246269 / 3.745712 (0.500557)	3.891161 / 5.269862 (-1.378701)	2.445392 / 4.565676 (-2.120285)	0.068943 / 0.424275 (-0.355332)	0.009248 / 0.007607 (0.001641)	0.604859 / 0.226044 (0.378815)	6.030660 / 2.268929 (3.761731)	3.409778 / 55.444624 (-52.034846)	2.990488 / 6.876477 (-3.885988)	3.281317 / 2.142072 (1.139245)	0.697705 / 4.805227 (-4.107523)	0.159502 / 6.500664 (-6.341162)	0.072471 / 0.075469 (-0.002999)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.625428 / 1.841788 (-0.216360)	23.602509 / 8.074308 (15.528201)	18.091474 / 10.191392 (7.900082)	0.172816 / 0.680424 (-0.507608)	0.023708 / 0.534201 (-0.510493)	0.473768 / 0.579283 (-0.105515)	0.493713 / 0.434364 (0.059349)	0.566326 / 0.540337 (0.025989)	0.788670 / 1.386936 (-0.598266)

albertvillanova

Thanks. Any comment on my comment below?

Maybe we should update the docstring of get_data_patterns accordingly? Currently it only gives examples of outputs with ** not in a single path segment (i.e. not with a / as prefix or suffix).

lhoestq

LGTM ! I think the fix is future proof so we should be fine

tests/test_data_files.py

lhoestq · 2023-09-20T15:37:10Z

Thanks. Any comment on my comment below?

Maybe we should update the docstring of get_data_patterns accordingly? Currently it only gives examples of outputs with ** not in a single path segment (i.e. not with a / as prefix or suffix).

Yea right we need to update it indeed, the outputs are the ones from older versions of fsspec, and from older patterns that we don't use anymore.

In general in docstrings I also think we should encourage users to use **/* instead of ** (which has a behavior that is unique to fsspec)

lhoestq · 2023-09-20T15:54:39Z

Also just noticed that KEYWORDS_IN_DIR_NAME_BASE_PATTERNS seems to include KEYWORDS_IN_FILENAME_BASE_PATTERNS. I guess we can try to remove the filename one in another PR to remove this redundancy

(noticed this by checking that the data pattern is the same for both the dir name and filename examples in the get_data_patterns docstring)

Co-authored-by: Quentin Lhoest <[email protected]>

github-actions · 2023-09-22T16:22:06Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006922 / 0.011353 (-0.004431)	0.004459 / 0.011008 (-0.006549)	0.084742 / 0.038508 (0.046234)	0.089002 / 0.023109 (0.065893)	0.310886 / 0.275898 (0.034988)	0.340518 / 0.323480 (0.017038)	0.007011 / 0.007986 (-0.000975)	0.004566 / 0.004328 (0.000237)	0.067260 / 0.004250 (0.063009)	0.066349 / 0.037052 (0.029297)	0.324029 / 0.258489 (0.065540)	0.373785 / 0.293841 (0.079944)	0.031780 / 0.128546 (-0.096766)	0.009208 / 0.075646 (-0.066438)	0.288871 / 0.419271 (-0.130401)	0.054548 / 0.043533 (0.011015)	0.313344 / 0.255139 (0.058205)	0.336430 / 0.283200 (0.053231)	0.029037 / 0.141683 (-0.112646)	1.483797 / 1.452155 (0.031642)	1.581884 / 1.492716 (0.089167)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.370520 / 0.018006 (0.352514)	0.796720 / 0.000490 (0.796230)	0.009329 / 0.000200 (0.009129)	0.000109 / 0.000054 (0.000055)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033002 / 0.037411 (-0.004410)	0.083442 / 0.014526 (0.068916)	0.106468 / 0.176557 (-0.070088)	0.165315 / 0.737135 (-0.571820)	0.103048 / 0.296338 (-0.193291)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.386800 / 0.215209 (0.171591)	3.843312 / 2.077655 (1.765658)	1.848953 / 1.504120 (0.344834)	1.679508 / 1.541195 (0.138313)	1.733578 / 1.468490 (0.265088)	0.488455 / 4.584777 (-4.096322)	3.613594 / 3.745712 (-0.132118)	3.533334 / 5.269862 (-1.736528)	2.176216 / 4.565676 (-2.389460)	0.056915 / 0.424275 (-0.367360)	0.007349 / 0.007607 (-0.000258)	0.465132 / 0.226044 (0.239088)	4.638479 / 2.268929 (2.369550)	2.354741 / 55.444624 (-53.089883)	1.991777 / 6.876477 (-4.884700)	2.249823 / 2.142072 (0.107751)	0.582748 / 4.805227 (-4.222480)	0.133829 / 6.500664 (-6.366835)	0.060949 / 0.075469 (-0.014520)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.252027 / 1.841788 (-0.589760)	20.660234 / 8.074308 (12.585926)	14.328496 / 10.191392 (4.137104)	0.164872 / 0.680424 (-0.515552)	0.018867 / 0.534201 (-0.515334)	0.392850 / 0.579283 (-0.186433)	0.425684 / 0.434364 (-0.008679)	0.461776 / 0.540337 (-0.078562)	0.663688 / 1.386936 (-0.723248)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007010 / 0.011353 (-0.004343)	0.004791 / 0.011008 (-0.006217)	0.064738 / 0.038508 (0.026230)	0.088648 / 0.023109 (0.065539)	0.418106 / 0.275898 (0.142208)	0.446767 / 0.323480 (0.123287)	0.006761 / 0.007986 (-0.001224)	0.004649 / 0.004328 (0.000320)	0.066345 / 0.004250 (0.062094)	0.068326 / 0.037052 (0.031274)	0.423426 / 0.258489 (0.164937)	0.463160 / 0.293841 (0.169319)	0.032689 / 0.128546 (-0.095858)	0.009299 / 0.075646 (-0.066347)	0.071321 / 0.419271 (-0.347951)	0.048752 / 0.043533 (0.005219)	0.418932 / 0.255139 (0.163793)	0.440673 / 0.283200 (0.157473)	0.027898 / 0.141683 (-0.113785)	1.531860 / 1.452155 (0.079705)	1.620456 / 1.492716 (0.127739)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.354917 / 0.018006 (0.336911)	0.792432 / 0.000490 (0.791943)	0.006626 / 0.000200 (0.006426)	0.000124 / 0.000054 (0.000070)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036190 / 0.037411 (-0.001222)	0.093052 / 0.014526 (0.078526)	0.111927 / 0.176557 (-0.064629)	0.165571 / 0.737135 (-0.571564)	0.112159 / 0.296338 (-0.184180)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.437798 / 0.215209 (0.222589)	4.367166 / 2.077655 (2.289511)	2.343292 / 1.504120 (0.839172)	2.169298 / 1.541195 (0.628103)	2.224471 / 1.468490 (0.755981)	0.487317 / 4.584777 (-4.097460)	3.627825 / 3.745712 (-0.117887)	3.500914 / 5.269862 (-1.768947)	2.175862 / 4.565676 (-2.389815)	0.057975 / 0.424275 (-0.366300)	0.007509 / 0.007607 (-0.000098)	0.517389 / 0.226044 (0.291345)	5.169694 / 2.268929 (2.900766)	2.850993 / 55.444624 (-52.593631)	2.473111 / 6.876477 (-4.403366)	2.746731 / 2.142072 (0.604659)	0.586597 / 4.805227 (-4.218630)	0.134082 / 6.500664 (-6.366582)	0.061035 / 0.075469 (-0.014434)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.375186 / 1.841788 (-0.466602)	20.960817 / 8.074308 (12.886509)	15.035071 / 10.191392 (4.843679)	0.169494 / 0.680424 (-0.510930)	0.020654 / 0.534201 (-0.513547)	0.398047 / 0.579283 (-0.181236)	0.438117 / 0.434364 (0.003753)	0.483896 / 0.540337 (-0.056441)	0.690728 / 1.386936 (-0.696208)

github-actions · 2023-09-22T17:20:45Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006892 / 0.011353 (-0.004461)	0.004087 / 0.011008 (-0.006921)	0.084695 / 0.038508 (0.046187)	0.078084 / 0.023109 (0.054975)	0.322976 / 0.275898 (0.047078)	0.355332 / 0.323480 (0.031852)	0.004235 / 0.007986 (-0.003750)	0.003450 / 0.004328 (-0.000879)	0.065355 / 0.004250 (0.061104)	0.058593 / 0.037052 (0.021541)	0.335761 / 0.258489 (0.077272)	0.370392 / 0.293841 (0.076551)	0.031720 / 0.128546 (-0.096827)	0.008611 / 0.075646 (-0.067036)	0.288213 / 0.419271 (-0.131059)	0.053374 / 0.043533 (0.009842)	0.321863 / 0.255139 (0.066724)	0.341587 / 0.283200 (0.058387)	0.025694 / 0.141683 (-0.115989)	1.470502 / 1.452155 (0.018348)	1.565068 / 1.492716 (0.072352)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.231063 / 0.018006 (0.213057)	0.464996 / 0.000490 (0.464506)	0.007316 / 0.000200 (0.007116)	0.000288 / 0.000054 (0.000233)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029244 / 0.037411 (-0.008167)	0.086303 / 0.014526 (0.071777)	0.097281 / 0.176557 (-0.079276)	0.153552 / 0.737135 (-0.583583)	0.098488 / 0.296338 (-0.197850)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.382753 / 0.215209 (0.167544)	3.826503 / 2.077655 (1.748848)	1.848439 / 1.504120 (0.344319)	1.688519 / 1.541195 (0.147324)	1.787867 / 1.468490 (0.319377)	0.489708 / 4.584777 (-4.095069)	3.576780 / 3.745712 (-0.168932)	3.341536 / 5.269862 (-1.928325)	2.108787 / 4.565676 (-2.456889)	0.057409 / 0.424275 (-0.366866)	0.007325 / 0.007607 (-0.000282)	0.459536 / 0.226044 (0.233492)	4.590609 / 2.268929 (2.321681)	2.313005 / 55.444624 (-53.131620)	1.972389 / 6.876477 (-4.904087)	2.218511 / 2.142072 (0.076439)	0.613817 / 4.805227 (-4.191410)	0.133846 / 6.500664 (-6.366818)	0.062190 / 0.075469 (-0.013279)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.279860 / 1.841788 (-0.561928)	19.549777 / 8.074308 (11.475469)	14.225844 / 10.191392 (4.034452)	0.164682 / 0.680424 (-0.515741)	0.018321 / 0.534201 (-0.515880)	0.389874 / 0.579283 (-0.189409)	0.408597 / 0.434364 (-0.025767)	0.454327 / 0.540337 (-0.086011)	0.645571 / 1.386936 (-0.741365)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007021 / 0.011353 (-0.004332)	0.004119 / 0.011008 (-0.006889)	0.065393 / 0.038508 (0.026885)	0.085005 / 0.023109 (0.061896)	0.412221 / 0.275898 (0.136323)	0.438266 / 0.323480 (0.114786)	0.005594 / 0.007986 (-0.002392)	0.003499 / 0.004328 (-0.000829)	0.065053 / 0.004250 (0.060802)	0.060608 / 0.037052 (0.023555)	0.413938 / 0.258489 (0.155449)	0.446192 / 0.293841 (0.152351)	0.032232 / 0.128546 (-0.096314)	0.008617 / 0.075646 (-0.067029)	0.071296 / 0.419271 (-0.347976)	0.048756 / 0.043533 (0.005223)	0.404977 / 0.255139 (0.149838)	0.426801 / 0.283200 (0.143602)	0.023650 / 0.141683 (-0.118033)	1.526928 / 1.452155 (0.074773)	1.627504 / 1.492716 (0.134787)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.224318 / 0.018006 (0.206312)	0.469717 / 0.000490 (0.469227)	0.005539 / 0.000200 (0.005339)	0.000098 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034240 / 0.037411 (-0.003171)	0.096449 / 0.014526 (0.081923)	0.107309 / 0.176557 (-0.069247)	0.160246 / 0.737135 (-0.576889)	0.107595 / 0.296338 (-0.188743)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434266 / 0.215209 (0.219057)	4.325571 / 2.077655 (2.247916)	2.324066 / 1.504120 (0.819946)	2.140238 / 1.541195 (0.599044)	2.244593 / 1.468490 (0.776103)	0.486259 / 4.584777 (-4.098518)	3.644120 / 3.745712 (-0.101592)	3.372330 / 5.269862 (-1.897531)	2.074779 / 4.565676 (-2.490897)	0.057154 / 0.424275 (-0.367121)	0.007304 / 0.007607 (-0.000303)	0.516944 / 0.226044 (0.290899)	5.174300 / 2.268929 (2.905372)	2.816269 / 55.444624 (-52.628356)	2.462943 / 6.876477 (-4.413534)	2.735851 / 2.142072 (0.593779)	0.589028 / 4.805227 (-4.216200)	0.131804 / 6.500664 (-6.368860)	0.060173 / 0.075469 (-0.015296)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.354540 / 1.841788 (-0.487248)	20.436511 / 8.074308 (12.362203)	15.541981 / 10.191392 (5.350589)	0.168399 / 0.680424 (-0.512025)	0.020716 / 0.534201 (-0.513485)	0.396275 / 0.579283 (-0.183008)	0.427232 / 0.434364 (-0.007132)	0.475121 / 0.540337 (-0.065216)	0.648579 / 1.386936 (-0.738357)

github-actions · 2023-09-22T17:24:03Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009071 / 0.011353 (-0.002282)	0.005820 / 0.011008 (-0.005188)	0.119974 / 0.038508 (0.081466)	0.092145 / 0.023109 (0.069036)	0.445349 / 0.275898 (0.169451)	0.442488 / 0.323480 (0.119008)	0.005352 / 0.007986 (-0.002634)	0.004332 / 0.004328 (0.000003)	0.084397 / 0.004250 (0.080147)	0.064624 / 0.037052 (0.027572)	0.430938 / 0.258489 (0.172448)	0.503574 / 0.293841 (0.209733)	0.047900 / 0.128546 (-0.080647)	0.014237 / 0.075646 (-0.061409)	0.366145 / 0.419271 (-0.053127)	0.066344 / 0.043533 (0.022811)	0.424582 / 0.255139 (0.169443)	0.451845 / 0.283200 (0.168646)	0.041409 / 0.141683 (-0.100274)	1.886998 / 1.452155 (0.434843)	2.011676 / 1.492716 (0.518960)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.301008 / 0.018006 (0.283001)	0.608670 / 0.000490 (0.608180)	0.011963 / 0.000200 (0.011763)	0.000117 / 0.000054 (0.000063)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031996 / 0.037411 (-0.005415)	0.102274 / 0.014526 (0.087748)	0.121437 / 0.176557 (-0.055120)	0.181647 / 0.737135 (-0.555489)	0.121634 / 0.296338 (-0.174704)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.597070 / 0.215209 (0.381861)	5.973808 / 2.077655 (3.896154)	2.486345 / 1.504120 (0.982225)	2.125395 / 1.541195 (0.584201)	2.270864 / 1.468490 (0.802374)	0.880031 / 4.584777 (-3.704746)	5.396522 / 3.745712 (1.650809)	4.702005 / 5.269862 (-0.567857)	3.023087 / 4.565676 (-1.542589)	0.097093 / 0.424275 (-0.327182)	0.008457 / 0.007607 (0.000850)	0.712164 / 0.226044 (0.486120)	7.112867 / 2.268929 (4.843938)	3.364509 / 55.444624 (-52.080115)	2.646953 / 6.876477 (-4.229524)	2.795967 / 2.142072 (0.653894)	1.067182 / 4.805227 (-3.738046)	0.218297 / 6.500664 (-6.282368)	0.071720 / 0.075469 (-0.003750)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.640477 / 1.841788 (-0.201311)	24.875163 / 8.074308 (16.800855)	22.125706 / 10.191392 (11.934314)	0.247267 / 0.680424 (-0.433157)	0.033717 / 0.534201 (-0.500484)	0.492422 / 0.579283 (-0.086862)	0.578323 / 0.434364 (0.143959)	0.579503 / 0.540337 (0.039165)	0.816721 / 1.386936 (-0.570215)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009372 / 0.011353 (-0.001981)	0.005449 / 0.011008 (-0.005559)	0.095371 / 0.038508 (0.056863)	0.086320 / 0.023109 (0.063211)	0.539573 / 0.275898 (0.263675)	0.580338 / 0.323480 (0.256858)	0.007028 / 0.007986 (-0.000958)	0.004196 / 0.004328 (-0.000133)	0.082710 / 0.004250 (0.078460)	0.064336 / 0.037052 (0.027284)	0.521490 / 0.258489 (0.263001)	0.567942 / 0.293841 (0.274101)	0.049659 / 0.128546 (-0.078887)	0.017297 / 0.075646 (-0.058350)	0.093874 / 0.419271 (-0.325398)	0.061664 / 0.043533 (0.018131)	0.524476 / 0.255139 (0.269337)	0.563255 / 0.283200 (0.280055)	0.039990 / 0.141683 (-0.101693)	1.854438 / 1.452155 (0.402283)	1.819321 / 1.492716 (0.326605)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.298817 / 0.018006 (0.280811)	0.629381 / 0.000490 (0.628891)	0.006259 / 0.000200 (0.006059)	0.000690 / 0.000054 (0.000635)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.041009 / 0.037411 (0.003598)	0.123845 / 0.014526 (0.109319)	0.138606 / 0.176557 (-0.037951)	0.215042 / 0.737135 (-0.522093)	0.129572 / 0.296338 (-0.166767)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.668823 / 0.215209 (0.453614)	6.596762 / 2.077655 (4.519108)	3.275429 / 1.504120 (1.771309)	2.921747 / 1.541195 (1.380553)	2.963748 / 1.468490 (1.495258)	0.897588 / 4.584777 (-3.687188)	5.683618 / 3.745712 (1.937906)	5.051102 / 5.269862 (-0.218760)	3.178855 / 4.565676 (-1.386822)	0.107446 / 0.424275 (-0.316829)	0.008967 / 0.007607 (0.001360)	0.785577 / 0.226044 (0.559532)	8.236556 / 2.268929 (5.967628)	3.914725 / 55.444624 (-51.529899)	3.129068 / 6.876477 (-3.747409)	3.368383 / 2.142072 (1.226310)	1.004307 / 4.805227 (-3.800920)	0.204788 / 6.500664 (-6.295876)	0.078250 / 0.075469 (0.002780)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.778574 / 1.841788 (-0.063213)	25.583659 / 8.074308 (17.509351)	23.505866 / 10.191392 (13.314474)	0.228759 / 0.680424 (-0.451665)	0.038348 / 0.534201 (-0.495853)	0.468980 / 0.579283 (-0.110303)	0.630194 / 0.434364 (0.195830)	0.587535 / 0.540337 (0.047198)	0.831761 / 1.386936 (-0.555175)

mariosasko · 2023-09-25T13:08:10Z

I've addressed the comments. Let me know if it looks all good now :)

lhoestq · 2023-09-25T17:23:45Z

Actually just found out that the current **/*[-._ 0-9/]train[-._ 0-9/]** doesn't match data/train.csv in bash (but does match in fsspec right now).

So there might be a risk that this pattern breaks in the future no ?

mariosasko · 2023-09-26T14:08:53Z

@lhoestq fsspec has tests to check their specific (non-posix) behavior, so I think merging in the current state is fine. And if they make a breaking change in the future, we can align the patterns once again :)

lhoestq · 2023-09-26T15:07:56Z

Yea after more thoughts I also think it's fine. Feel free to merge !

github-actions · 2023-09-26T15:41:38Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006920 / 0.011353 (-0.004433)	0.004182 / 0.011008 (-0.006826)	0.084629 / 0.038508 (0.046121)	0.086052 / 0.023109 (0.062943)	0.326062 / 0.275898 (0.050164)	0.344190 / 0.323480 (0.020710)	0.005393 / 0.007986 (-0.002593)	0.003410 / 0.004328 (-0.000918)	0.064327 / 0.004250 (0.060076)	0.056556 / 0.037052 (0.019504)	0.319255 / 0.258489 (0.060766)	0.357943 / 0.293841 (0.064102)	0.032097 / 0.128546 (-0.096450)	0.008778 / 0.075646 (-0.066868)	0.291057 / 0.419271 (-0.128215)	0.053225 / 0.043533 (0.009692)	0.307713 / 0.255139 (0.052574)	0.350058 / 0.283200 (0.066858)	0.024380 / 0.141683 (-0.117303)	1.459482 / 1.452155 (0.007328)	1.555711 / 1.492716 (0.062994)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.239487 / 0.018006 (0.221480)	0.467604 / 0.000490 (0.467114)	0.010742 / 0.000200 (0.010542)	0.000285 / 0.000054 (0.000230)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029394 / 0.037411 (-0.008018)	0.087404 / 0.014526 (0.072879)	0.098701 / 0.176557 (-0.077855)	0.154145 / 0.737135 (-0.582990)	0.099726 / 0.296338 (-0.196612)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.389008 / 0.215209 (0.173799)	3.873165 / 2.077655 (1.795510)	1.860676 / 1.504120 (0.356556)	1.679668 / 1.541195 (0.138474)	1.782347 / 1.468490 (0.313857)	0.489469 / 4.584777 (-4.095308)	3.678706 / 3.745712 (-0.067006)	3.404076 / 5.269862 (-1.865785)	2.110972 / 4.565676 (-2.454704)	0.057478 / 0.424275 (-0.366797)	0.007443 / 0.007607 (-0.000164)	0.464780 / 0.226044 (0.238736)	4.643606 / 2.268929 (2.374678)	2.355744 / 55.444624 (-53.088881)	1.993992 / 6.876477 (-4.882485)	2.245520 / 2.142072 (0.103447)	0.592773 / 4.805227 (-4.212454)	0.135369 / 6.500664 (-6.365295)	0.062478 / 0.075469 (-0.012991)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.257537 / 1.841788 (-0.584251)	19.828010 / 8.074308 (11.753702)	14.709260 / 10.191392 (4.517868)	0.168359 / 0.680424 (-0.512065)	0.018907 / 0.534201 (-0.515294)	0.397223 / 0.579283 (-0.182060)	0.421760 / 0.434364 (-0.012604)	0.464597 / 0.540337 (-0.075740)	0.665905 / 1.386936 (-0.721031)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007247 / 0.011353 (-0.004106)	0.004104 / 0.011008 (-0.006904)	0.065008 / 0.038508 (0.026500)	0.083485 / 0.023109 (0.060376)	0.399808 / 0.275898 (0.123910)	0.433374 / 0.323480 (0.109894)	0.005453 / 0.007986 (-0.002532)	0.003479 / 0.004328 (-0.000850)	0.065126 / 0.004250 (0.060876)	0.059945 / 0.037052 (0.022893)	0.402018 / 0.258489 (0.143529)	0.437927 / 0.293841 (0.144086)	0.032654 / 0.128546 (-0.095892)	0.008717 / 0.075646 (-0.066929)	0.071737 / 0.419271 (-0.347534)	0.048903 / 0.043533 (0.005370)	0.402107 / 0.255139 (0.146968)	0.417602 / 0.283200 (0.134402)	0.024821 / 0.141683 (-0.116862)	1.474471 / 1.452155 (0.022316)	1.559571 / 1.492716 (0.066855)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.232010 / 0.018006 (0.214003)	0.460768 / 0.000490 (0.460278)	0.005250 / 0.000200 (0.005050)	0.000109 / 0.000054 (0.000055)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033839 / 0.037411 (-0.003573)	0.101617 / 0.014526 (0.087091)	0.107984 / 0.176557 (-0.068573)	0.160923 / 0.737135 (-0.576212)	0.110367 / 0.296338 (-0.185971)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.433087 / 0.215209 (0.217878)	4.324100 / 2.077655 (2.246445)	2.312937 / 1.504120 (0.808817)	2.159903 / 1.541195 (0.618708)	2.240235 / 1.468490 (0.771745)	0.500659 / 4.584777 (-4.084118)	3.743801 / 3.745712 (-0.001911)	3.441350 / 5.269862 (-1.828512)	2.141370 / 4.565676 (-2.424306)	0.059078 / 0.424275 (-0.365197)	0.007468 / 0.007607 (-0.000139)	0.508108 / 0.226044 (0.282064)	5.076738 / 2.268929 (2.807809)	2.825939 / 55.444624 (-52.618685)	2.467762 / 6.876477 (-4.408715)	2.705079 / 2.142072 (0.563006)	0.603363 / 4.805227 (-4.201864)	0.136267 / 6.500664 (-6.364397)	0.062887 / 0.075469 (-0.012582)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.359344 / 1.841788 (-0.482443)	20.581510 / 8.074308 (12.507202)	15.534489 / 10.191392 (5.343097)	0.192068 / 0.680424 (-0.488356)	0.020831 / 0.534201 (-0.513370)	0.403330 / 0.579283 (-0.175953)	0.429536 / 0.434364 (-0.004828)	0.479906 / 0.540337 (-0.060431)	0.674170 / 1.386936 (-0.712766)

* Add support for `fsspec>=2023.9.0` * Fixes * Style * Fix mock fs for files in nested directories * Nit * More fixes * Nit * Remove print * Update tests/test_data_files.py Co-authored-by: Quentin Lhoest <[email protected]> * Address some more comments --------- Co-authored-by: Quentin Lhoest <[email protected]>

Add support for fsspec>=2023.9.0

887a854

Fixes

278a567

mariosasko added 3 commits September 18, 2023 00:37

Style

f611e58

Fix mock fs for files in nested directories

d0519c6

Nit

e0bd844

mariosasko marked this pull request as draft September 18, 2023 13:09

More fixes

c89e60c

Nit

4529127

mariosasko marked this pull request as ready for review September 18, 2023 21:50

mariosasko requested a review from lhoestq September 18, 2023 21:50

albertvillanova reviewed Sep 19, 2023

View reviewed changes

Remove print

1ee2359

albertvillanova reviewed Sep 20, 2023

View reviewed changes

lhoestq approved these changes Sep 20, 2023

View reviewed changes

lhoestq reviewed Sep 20, 2023

View reviewed changes

tests/test_data_files.py Show resolved Hide resolved

Update tests/test_data_files.py

3e7fc64

Co-authored-by: Quentin Lhoest <[email protected]>

mariosasko added 2 commits September 22, 2023 19:11

Address some more comments

4fa138f

Merge branch 'main' of github.com:huggingface/datasets into fix-6214

68f4f84

mariosasko merged commit 33ac74c into main Sep 26, 2023
11 of 13 checks passed

mariosasko deleted the fix-6214 branch September 26, 2023 15:32

mariosasko mentioned this pull request Sep 28, 2023

.glob("**/filename") returns incorrect results fsspec/filesystem_spec#1380

Closed

lhoestq mentioned this pull request Oct 1, 2023

Duplicate data_files when named <split>/<split>.parquet #6272

Closed

mariosasko mentioned this pull request Mar 1, 2024

Improve default patterns resolution #6704

Merged

		KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["*/[{sep}/]{keyword}[{sep}]", "{keyword}[{sep}]"]
		KEYWORDS_IN_DIR_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]", "/[{sep}/]{keyword}[{sep}/]*"]

Add support for fsspec>=2023.9.0 #6244

Add support for fsspec>=2023.9.0 #6244

Conversation

mariosasko commented Sep 15, 2023

HuggingFaceDocBuilderDev commented Sep 15, 2023 • edited Loading

github-actions bot commented Sep 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Sep 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Sep 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Sep 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Sep 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Sep 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Sep 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Add support for `fsspec>=2023.9.0` #6244

Add support for `fsspec>=2023.9.0` #6244

HuggingFaceDocBuilderDev commented Sep 15, 2023 •

edited

Loading

lhoestq commented Sep 20, 2023 •

edited

Loading

lhoestq commented Sep 20, 2023 •

edited

Loading

lhoestq commented Sep 25, 2023 •

edited

Loading