Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-46248: Few small improvements for the new query system #1146

Merged
merged 5 commits into from
Jan 31, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion python/lsst/daf/butler/direct_query_driver/_driver.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,9 @@ def materialize(
#
sql_select, _ = plan.finish_select(return_columns=False)
table = self._exit_stack.enter_context(
self.db.temporary_table(make_table_spec(plan.final_columns, self.db, plan.postprocessing))
self.db.temporary_table(
make_table_spec(plan.final_columns, self.db, plan.postprocessing, make_indices=True)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall you saying earlier that adding the indexes before doing the inserts sometimes made the inserts much more expensive. Did that get resolved by other changes, or is it just an overall win even if it's occasionally worse.

And would it be worth it to try to move index creation to just after insert in this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a false alarm - I have added indices and filling temp table became terribly slow, the real reason was that query planner went into a dumb state on the same exact query (this was before I dropped DISTINCT). With disabled DISTINCT I have not noticed any significant slow down due to indices.
It is true that for bulk data insert it is better to load the data before adding indices. I tried to implement it, but it is not trivial with our code now, the simple idea did not work, and I did not want to start restructuring a bunch of code just for that. Still, if we see it later, it should be possible to delay index creation until after filling the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case we want to do index creation separately later - my initial idea was that I run the same _convertTableSpec but do not add Index instances to Table constructor arguments, instead the indices are returned as a separate list so you can later iterate and create them manually. It happens, that even if you do not add an Index to the Table() arguments, the metadata already knows about all indices for that table (as Index is created for a specific table) and will create that index anyways. So delaying just did not work.

)
self.db.insert(table, select=sql_select)
if key is None:
Expand Down
26 changes: 24 additions & 2 deletions python/lsst/daf/butler/direct_query_driver/_sql_builders.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@
import sqlalchemy

from .. import ddl
from ..dimensions import DimensionGroup
from ..dimensions._group import SortedSequenceSet
from ..nonempty_mapping import NonemptyMapping
from ..queries import tree as qt
from ._postprocessing import Postprocessing
Expand Down Expand Up @@ -638,7 +640,7 @@ def to_select_builder(


def make_table_spec(
columns: qt.ColumnSet, db: Database, postprocessing: Postprocessing | None
columns: qt.ColumnSet, db: Database, postprocessing: Postprocessing | None, *, make_indices: bool = False
) -> ddl.TableSpec:
"""Make a specification that can be used to create a table to store
this query's outputs.
Expand All @@ -652,18 +654,22 @@ def make_table_spec(
postprocessing : `Postprocessing`
Struct representing post-query processing in Python, which may
require additional columns in the query results.
make_indices : `bool`, optional
If `True` add indices for groups of columns.

Returns
-------
spec : `.ddl.TableSpec`
Table specification for this query's result columns (including
those from `postprocessing` and `SqlJoinsBuilder.special`).
"""
indices = _make_table_indices(columns.dimensions) if make_indices else []
results = ddl.TableSpec(
[
columns.get_column_spec(logical_table, field).to_sql_spec(name_shrinker=db.name_shrinker)
for logical_table, field in columns
]
],
indexes=indices,
)
if postprocessing:
for element in postprocessing.iter_missing(columns):
Expand All @@ -679,3 +685,19 @@ def make_table_spec(
ddl.FieldSpec(name=SqlSelectBuilder.EMPTY_COLUMNS_NAME, dtype=SqlSelectBuilder.EMPTY_COLUMNS_TYPE)
)
return results


def _make_table_indices(dimensions: DimensionGroup) -> list[ddl.IndexSpec]:

index_columns: list[SortedSequenceSet] = []
for dimension in dimensions.required:
minimal_group = dimensions.universe[dimension].minimal_group.required

for idx in range(len(index_columns)):
if index_columns[idx] <= minimal_group:
index_columns[idx] = minimal_group
break
else:
index_columns.append(minimal_group)

return [ddl.IndexSpec(*columns) for columns in index_columns]
Loading