Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Non-batched linear regression for high-dimensional problems #3058

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

david-cortes-intel
Copy link
Contributor

@david-cortes-intel david-cortes-intel commented Feb 3, 2025

Description

This PR adds a preliminary version of a non-batched version of the linear regression normal regression algorithm for high-dimensional problems, which calculates the aggregates $\mathbf{X}^T \mathbf{X}$ and $\mathbf{X}^T \mathbf{y}$ separately through corresponding calls to BLAS functions.

The code route here is only meant for high-dimensional problems where the current approach does not lead to better performance. The idea is to make these thresholds configurable through the same parameters system used for e.g. the batch size in the regular route, but adding new configurable parameters appears to be a much trickier job so I'm starting with this PR with hard-coded thresholds in the meantime.

A few notes:

  • The function that computes the aggregates takes a flag initializeResult, but if this flag is false and the results are not zeroed-out beforehand, it will lead to failing some bazel tests when the code route here is used. This was due to a bug in the PR code, has been fixed by now.
  • There are methods to calculate statistics such as column sums from the NumericTable class in which the data is passed, but the inputs have const qualifiers and those methods that add the statistics to them modify the inputs, so cannot be used with const data.
    • Hence, I am calculating those by vector-matrix products with an array of all-ones. I wasn't sure if there's dedicated procedures to allocate aligned arrays, so I'm just using a regular unique pointer for them. Changed to use the dedicated TArray class.
  • While the new code route being introduced here is meant for high-dimensional data only, for testing purposes, I'm enabling it for all input sizes. It works only for row-major data as it requires contiguous arrays though. Modified the tests to test both the regular route and this route for all linear regression cases on CPU.

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.

Tests will be left for a future PR once the thresholds for triggering this mode are made configurable.

Performance

Performance comparisons were shared internally, without sklearn_bench as the situations to trigger it are rather specific.

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@@ -46,14 +46,14 @@ struct OpenBlas<double, cpu>
{
typedef DAAL_INT SizeType;

static void xsyrk(char * uplo, char * trans, DAAL_INT * p, DAAL_INT * n, double * alpha, double * a, DAAL_INT * lda, double * beta, double * ata,
DAAL_INT * ldata)
static void xsyrk(const char * uplo, const char * trans, const DAAL_INT * p, const DAAL_INT * n, const double * alpha, const double * a,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am Ok with the change. But strange that it hadn't triggered any compilation warnings.

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

Looks like this route generates some test failures in fp32 in sklearnex, but they are due to comparisons after the 5th decimal or so, which is quite expectable for fp32.

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

@Vika-F I've now made the thresholds for choosing the non-batched route configurable through the parameters system, by adding 3 new parameters, but I'm not sure that I've correctly added them in all the places where they are necessary - could you take a look?

Also, is there some way that the tests could be expanded to execute the exact same linear regression tests but with custom parameters? (so that it'd trigger this route).

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel david-cortes-intel changed the title [WIP] ENH: Preliminary version of non-batched linear regression ENH: Preliminary version of non-batched linear regression Feb 12, 2025
@david-cortes-intel david-cortes-intel marked this pull request as ready for review February 12, 2025 10:36
@david-cortes-intel david-cortes-intel changed the title ENH: Preliminary version of non-batched linear regression ENH: Non-batched linear regression for high-dimensional problems Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants