Disable sequential prefetching for non Parquet objects #225

vaibhav5140 · 2025-02-14T15:55:07Z

Description of change

Added support for selective sequential prefetching based on file type. This change enables sequential prefetching specifically for Parquet files and disabling it for non Parquet

Does this contribution introduce any breaking changes to the existing APIs or behaviours?

Yes, Disabling sequential prefetching for non Parquet objects

Does this contribution introduce any new public APIs or behaviours?

No

How was the contribution tested?

Updated unit tests for BlockManager, PhysicalIOImpl and S3SeekableInputStreamFactory
Verified behaviour with Parquet and non-Parquet files
Modified existing tests to handle OpenFileInformation

Does this contribution need a changelog entry?

No

ahmarsuhail

looks good overall,

i'm not sure if block store tests cases are quite right, or maybe i misunderstood them. let's discuss

ahmarsuhail · 2025-02-14T16:06:57Z

...ream/src/main/java/software/amazon/s3/analyticsaccelerator/S3SeekableInputStreamFactory.java

            telemetry,
            configuration.getLogicalIOConfiguration(),
            parquetColumnPrefetchStore);

      default:
+        OpenFileInformation effectiveInfo = openFileInformation;


can you move this to a private function in utils, call it something like setInputPolicy(), and then just call it

new PhysicalIOImpl( s3URI, objectMetadataStore, objectBlobStore, telemetry, setInputPolicy())

why do we need to create a new OpenFileInformation object again, can we not set the input policy in the existing object itself?

ahmarsuhail · 2025-02-14T16:08:23Z

...eam/src/main/java/software/amazon/s3/analyticsaccelerator/io/physical/data/BlockManager.java

-    this.streamContext = streamContext;
+    this.streamContext = openFileInformation.getStreamContext();
+    this.isSequential =
+        openFileInformation.getInputPolicy() != null


move to a function, maybe in utils. keeps the constructor clean of any logic code

ahmarsuhail · 2025-02-14T16:12:55Z

.../src/test/java/software/amazon/s3/analyticsaccelerator/S3SeekableInputStreamFactoryTest.java

@@ -54,6 +54,19 @@ public class S3SeekableInputStreamFactoryTest {

  private static final S3URI TEST_URI = S3URI.of("test-bucket", "test-key");

+  private static OpenFileInformation createMockOpenFileInfo() {


why do you need a mock object here? you can just create OpenFileInformation object

ahmarsuhail · 2025-02-14T16:13:45Z

.../src/test/java/software/amazon/s3/analyticsaccelerator/S3SeekableInputStreamFactoryTest.java

@@ -99,7 +112,7 @@ void testCreateDefaultStream() throws IOException {

    inputStream =
        s3SeekableInputStreamFactory.createStream(
-            S3URI.of("bucket", "key"), mock(OpenFileInformation.class));
+            S3URI.of("bucket", "key"), createMockOpenFileInfo());


basically here you can just pass in OpenFileInformation.DEFAULT, instead of the mock object

ahmarsuhail · 2025-02-14T16:22:31Z

...src/test/java/software/amazon/s3/analyticsaccelerator/io/physical/data/BlockManagerTest.java

+  }
+
+  @Test
+  void testNonParquetBehavior() throws IOException {


this is a good start, but I'm not sure if you're testing your desired behaviour here.

What I think you want to test for is when you do

csvBlockManager.makeRangeAvailable(0L, 100L, ReadMode.SYNC);
csvBlockManager.makeRangeAvailable(101L, 500L, ReadMode.SYNC);
csvBlockManager.makeRangeAvailable(501L, 500L, ReadMode.SYNC);

that is request sequential blocks, no prefetching requests are made. So only 3 GET requests should ever be made, with the exact ranges you asked for

SanjayMarreddi · 2025-02-14T16:24:25Z

...stream/src/main/java/software/amazon/s3/analyticsaccelerator/io/physical/data/BlobStore.java

+  public Blob get(
+      ObjectKey objectKey,
+      ObjectMetadata metadata,
+      StreamContext streamContext,


streamContext This is no longer being passed to BlockManager right? Should we remove it?

SanjayMarreddi · 2025-02-14T16:25:49Z

...src/test/java/software/amazon/s3/analyticsaccelerator/io/physical/data/BlockManagerTest.java

+    // Given: Test setup for CSV file
+    S3URI csvUri = S3URI.of("bucket", "test.csv"); // Non-parquet extension
+    ObjectClient objectClient = mock(ObjectClient.class);
+    ObjectMetadata metadata =
+        ObjectMetadata.builder().contentLength(16L * ONE_MB).etag(ETAG).build();
+
+    PhysicalIOConfiguration testConfig =
+        PhysicalIOConfiguration.builder()
+            .readAheadBytes(ONE_KB)
+            .sequentialPrefetchBase(2.0)
+            .build();
+
+    // For CSV file
+    OpenFileInformation csvInfo = OpenFileInformation.builder().objectMetadata(metadata).build();
+
+    // Create BlockManager for CSV
+    BlockManager csvBlockManager =
+        new BlockManager(
+            ObjectKey.builder().s3URI(csvUri).etag(ETAG).build(),
+            objectClient,
+            metadata,
+            TestTelemetry.DEFAULT,
+            testConfig,
+            csvInfo);
+
+    // Setup mock response
+    when(objectClient.getObject(any(GetRequest.class), any()))
+        .thenReturn(
+            CompletableFuture.completedFuture(
+                ObjectContent.builder().stream(new ByteArrayInputStream(new byte[1024])).build()));
+
+    // Make sequential requests
+    ArgumentCaptor<GetRequest> requestCaptor = ArgumentCaptor.forClass(GetRequest.class);
+    csvBlockManager.makeRangeAvailable(0L, 100L, ReadMode.SYNC);
+    csvBlockManager.makeRangeAvailable(512L, 100L, ReadMode.SYNC);
+
+    verify(objectClient, atLeast(1)).getObject(requestCaptor.capture(), any());
+
+    // Verify all requests are limited to readAheadBytes
+    requestCaptor
+        .getAllValues()
+        .forEach(
+            request ->
+                assertTrue(
+                    request.getRange().getLength() <= testConfig.getReadAheadBytes(),
+                    "Non-Parquet requests should not exceed readAheadBytes"));
+  }
+
+  @Test
+  void testParquetBehavior() throws IOException {
+    // Given: Test setup for Parquet file
+    S3URI parquetUri = S3URI.of("bucket", "test.parquet"); // Parquet extension
+    ObjectClient objectClient = mock(ObjectClient.class);
+    ObjectMetadata metadata =
+        ObjectMetadata.builder().contentLength(16L * ONE_MB).etag(ETAG).build();
+
+    PhysicalIOConfiguration testConfig =
+        PhysicalIOConfiguration.builder()
+            .readAheadBytes(ONE_KB)
+            .sequentialPrefetchBase(2.0)
+            .build();
+
+    // For Parquet file - set Sequential policy
+    OpenFileInformation parquetInfo =
+        OpenFileInformation.builder()
+            .objectMetadata(metadata)
+            .inputPolicy(InputPolicy.Sequential) // This is the key change
+            .build();
+
+    // Create BlockManager for Parquet
+    BlockManager parquetBlockManager =
+        new BlockManager(
+            ObjectKey.builder().s3URI(parquetUri).etag(ETAG).build(),
+            objectClient,
+            metadata,
+            TestTelemetry.DEFAULT,
+            testConfig,
+            parquetInfo);
+
+    // Setup mock response
+    when(objectClient.getObject(any(GetRequest.class), any()))
+        .thenReturn(
+            CompletableFuture.completedFuture(
+                ObjectContent.builder().stream(new ByteArrayInputStream(new byte[1024])).build()));
+
+    // Make sequential requests
+    ArgumentCaptor<GetRequest> requestCaptor = ArgumentCaptor.forClass(GetRequest.class);
+    parquetBlockManager.makeRangeAvailable(0L, 100L, ReadMode.SYNC);
+    parquetBlockManager.makeRangeAvailable(512L, 100L, ReadMode.SYNC);
+
+    verify(objectClient, atLeast(1)).getObject(requestCaptor.capture(), any());
+
+    // Verify requests use sequential prefetching
+    requestCaptor
+        .getAllValues()
+        .forEach(
+            request ->
+                assertTrue(
+                    request.getRange().getLength() <= testConfig.getReadAheadBytes(),
+                    "Parquet requests should show sequential prefetching"));


I think we should move the common code between these tests into separate functions - so that tests will be easy to follow

stubz151 · 2025-02-14T16:28:45Z

common/src/main/java/software/amazon/s3/analyticsaccelerator/util/OpenFileInformation.java

@@ -25,7 +25,7 @@
 * information and callbacks when opening the file.
 */
 @Value
-@Builder
+@Builder(toBuilder = true)


what does this do? and why are we doing it?

stubz151 · 2025-02-14T16:33:46Z

.../src/test/java/software/amazon/s3/analyticsaccelerator/S3SeekableInputStreamFactoryTest.java

@@ -17,8 +17,7 @@

 import static org.junit.jupiter.api.Assertions.*;
 import static org.mockito.ArgumentMatchers.any;
-import static org.mockito.Mockito.mock;
-import static org.mockito.Mockito.when;
+import static org.mockito.Mockito.*;


don't use * imports turn this off in intellij
https://stackoverflow.com/questions/3348816/intellij-never-use-wildcard-imports

rajdchak · 2025-02-14T17:46:56Z

...m/src/main/java/software/amazon/s3/analyticsaccelerator/io/physical/impl/PhysicalIOImpl.java

   */
  public PhysicalIOImpl(
      @NonNull S3URI s3URI,
      @NonNull MetadataStore metadataStore,
      @NonNull BlobStore blobStore,
      @NonNull Telemetry telemetry,
-      StreamContext streamContext)
+      OpenFileInformation openFileInformation)


add the @nonnull check

rajdchak · 2025-02-14T17:48:56Z

...m/src/main/java/software/amazon/s3/analyticsaccelerator/io/physical/impl/PhysicalIOImpl.java

      throws IOException {
    this.metadataStore = metadataStore;
    this.blobStore = blobStore;
    this.telemetry = telemetry;
-    this.streamContext = streamContext;
+    this.openFileInformation = openFileInformation;
+    this.streamContext = openFileInformation.getStreamContext();


we can remove streamContext from here, only pass openFileInformation to blobStore which will have the streamContext.

Disable sequential prefetching for non Parquet objects

c9dcfdb

ahmarsuhail requested changes Feb 14, 2025

View reviewed changes

SanjayMarreddi reviewed Feb 14, 2025

View reviewed changes

stubz151 reviewed Feb 14, 2025

View reviewed changes

rajdchak reviewed Feb 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable sequential prefetching for non Parquet objects #225

Disable sequential prefetching for non Parquet objects #225

vaibhav5140 commented Feb 14, 2025

ahmarsuhail left a comment

ahmarsuhail Feb 14, 2025

rajdchak Feb 14, 2025

ahmarsuhail Feb 14, 2025

ahmarsuhail Feb 14, 2025

ahmarsuhail Feb 14, 2025

ahmarsuhail Feb 14, 2025

SanjayMarreddi Feb 14, 2025

rajdchak Feb 14, 2025

SanjayMarreddi Feb 14, 2025

stubz151 Feb 14, 2025

stubz151 Feb 14, 2025

rajdchak Feb 14, 2025 •

edited

Loading

rajdchak Feb 14, 2025

		@@ -54,6 +54,19 @@ public class S3SeekableInputStreamFactoryTest {

		private static final S3URI TEST_URI = S3URI.of("test-bucket", "test-key");

		private static OpenFileInformation createMockOpenFileInfo() {

Disable sequential prefetching for non Parquet objects #225

Are you sure you want to change the base?

Disable sequential prefetching for non Parquet objects #225

Conversation

vaibhav5140 commented Feb 14, 2025

Description of change

Does this contribution introduce any breaking changes to the existing APIs or behaviours?

Does this contribution introduce any new public APIs or behaviours?

How was the contribution tested?

Does this contribution need a changelog entry?

ahmarsuhail left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajdchak Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajdchak Feb 14, 2025 •

edited

Loading