ENH: Non-batched linear regression for high-dimensional problems #3058

david-cortes-intel · 2025-02-03T09:51:52Z

Description

This PR adds a preliminary version of a non-batched version of the linear regression normal regression algorithm for high-dimensional problems, which calculates the aggregates $\mathbf{X}^T \mathbf{X}$ and $\mathbf{X}^T \mathbf{y}$ separately through corresponding calls to BLAS functions.

The code route here is only meant for high-dimensional problems where the current approach does not lead to better performance. The idea is to make these thresholds configurable through the same parameters system used for e.g. the batch size in the regular route, but adding new configurable parameters appears to be a much trickier job so I'm starting with this PR with hard-coded thresholds in the meantime.

A few notes:

The function that computes the aggregates takes a flag initializeResult, but if this flag is false and the results are not zeroed-out beforehand, it will lead to failing some bazel tests when the code route here is used. This was due to a bug in the PR code, has been fixed by now.
There are methods to calculate statistics such as column sums from the NumericTable class in which the data is passed, but the inputs have const qualifiers and those methods that add the statistics to them modify the inputs, so cannot be used with const data.
- Hence, I am calculating those by vector-matrix products with an array of all-ones. ~~I wasn't sure if there's dedicated procedures to allocate aligned arrays, so I'm just using a regular unique pointer for them.~~ Changed to use the dedicated TArray class.
While the new code route being introduced here is meant for high-dimensional data only, for testing purposes, I'm enabling it for all input sizes. It works only for row-major data as it requires contiguous arrays though. Modified the tests to test both the regular route and this route for all linear regression cases on CPU.

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.

Tests will be left for a future PR once the thresholds for triggering this mode are made configurable.

Performance

Performance comparisons were shared internally, without sklearn_bench as the situations to trigger it are rather specific.

david-cortes-intel · 2025-02-03T09:54:28Z

/intelci: run

cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i

Vika-F · 2025-02-03T12:32:31Z

cpp/daal/src/externals/service_blas_ref.h

@@ -46,14 +46,14 @@ struct OpenBlas<double, cpu>
 {
    typedef DAAL_INT SizeType;

-    static void xsyrk(char * uplo, char * trans, DAAL_INT * p, DAAL_INT * n, double * alpha, double * a, DAAL_INT * lda, double * beta, double * ata,
-                      DAAL_INT * ldata)
+    static void xsyrk(const char * uplo, const char * trans, const DAAL_INT * p, const DAAL_INT * n, const double * alpha, const double * a,


I am Ok with the change. But strange that it hadn't triggered any compilation warnings.

…_update_impl.i Co-authored-by: Victoriya Fedotova <[email protected]>

david-cortes-intel · 2025-02-03T13:27:18Z

/intelci: run

david-cortes-intel · 2025-02-03T14:59:43Z

Looks like this route generates some test failures in fp32 in sklearnex, but they are due to comparisons after the 5th decimal or so, which is quite expectable for fp32.

david-cortes-intel · 2025-02-04T16:05:18Z

/intelci: run

david-cortes-intel · 2025-02-07T17:05:04Z

@Vika-F I've now made the thresholds for choosing the non-batched route configurable through the parameters system, by adding 3 new parameters, but I'm not sure that I've correctly added them in all the places where they are necessary - could you take a look?

Also, is there some way that the tests could be expanded to execute the exact same linear regression tests but with custom parameters? (so that it'd trigger this route).

david-cortes-intel · 2025-02-12T08:57:57Z

/intelci: run

preliminary version of non-batched linear regression

cfb2e1d

david-cortes-intel added the enhancement label Feb 3, 2025

david-cortes-intel requested a review from Vika-F February 3, 2025 09:52

david-cortes-intel added 3 commits February 3, 2025 11:01

typo

c6dce0d

linter

6948260

duplicate

3d2c7da

Vika-F reviewed Feb 3, 2025

View reviewed changes

david-cortes-intel and others added 7 commits February 3, 2025 13:34

Update cpp/daal/src/algorithms/linear_model/linear_model_train_normeq…

7f19f9a

…_update_impl.i Co-authored-by: Victoriya Fedotova <[email protected]>

move initialize result condition to single place

b2e87b0

use TArray instead of unique_ptr

c32239a

fix overwritten intercept ssq

6e0cae3

respect initializeResult

fc2cd15

missing const

2ecaa92

more missing const

b7749b8

remove condition on layout for non-batched mode

6c204f9

add extra parameters for thresholds of non-batched route

3906d85

david-cortes-intel added 4 commits February 7, 2025 18:06

remove outdated comment

cb8a91e

linter

a125706

test both batched and non-batched modes

fbf2e78

test non-batched also for online

c1cbbf0

david-cortes-intel mentioned this pull request Feb 12, 2025

[On Hold] ENH: Add new parameters for linear regression uxlfoundation/scikit-learn-intelex#2316

Draft

6 tasks

david-cortes-intel changed the title ~~[WIP] ENH: Preliminary version of non-batched linear regression~~ ENH: Preliminary version of non-batched linear regression Feb 12, 2025

david-cortes-intel marked this pull request as ready for review February 12, 2025 10:36

david-cortes-intel requested a review from Alexsandruss as a code owner February 12, 2025 10:36

david-cortes-intel requested review from samir-nasibli and Alexandr-Solovev as code owners February 12, 2025 10:36

david-cortes-intel requested review from ahuber21, avolkov-intel and Vika-F February 12, 2025 10:37

david-cortes-intel changed the title ~~ENH: Preliminary version of non-batched linear regression~~ ENH: Non-batched linear regression for high-dimensional problems Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Non-batched linear regression for high-dimensional problems #3058

ENH: Non-batched linear regression for high-dimensional problems #3058

david-cortes-intel commented Feb 3, 2025 •

edited

Loading

david-cortes-intel commented Feb 3, 2025

Vika-F Feb 3, 2025

david-cortes-intel commented Feb 3, 2025

david-cortes-intel commented Feb 3, 2025

david-cortes-intel commented Feb 4, 2025

david-cortes-intel commented Feb 7, 2025

david-cortes-intel commented Feb 12, 2025

ENH: Non-batched linear regression for high-dimensional problems #3058

Are you sure you want to change the base?

ENH: Non-batched linear regression for high-dimensional problems #3058

Conversation

david-cortes-intel commented Feb 3, 2025 • edited Loading

Description

david-cortes-intel commented Feb 3, 2025

Vika-F Feb 3, 2025

Choose a reason for hiding this comment

david-cortes-intel commented Feb 3, 2025

david-cortes-intel commented Feb 3, 2025

david-cortes-intel commented Feb 4, 2025

david-cortes-intel commented Feb 7, 2025

david-cortes-intel commented Feb 12, 2025

david-cortes-intel commented Feb 3, 2025 •

edited

Loading