Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate data size concerns for adding a new index to solo_scores #17

Closed
Tracked by #15
peppy opened this issue Aug 31, 2023 · 3 comments
Closed
Tracked by #15

Investigate data size concerns for adding a new index to solo_scores #17

peppy opened this issue Aug 31, 2023 · 3 comments
Assignees

Comments

@peppy
Copy link
Member

peppy commented Aug 31, 2023

Database structure testing

As per #15, we need an index on (user_id, ruleset_id, beatmap_id) if using the database for operations like ppy/osu-queue-score-statistics#149 (and in the future, ranked score processing and probably more).

I want to test on production data (scaled down slightly) to get a real-world idea of how adding/changing indices affect the size of this table.

Test whether changing primary key composition affects index size

TL;DR it doesn't seem to.

-- original
ALTER TABLE `solo_scores`
DROP PRIMARY KEY,
ADD PRIMARY KEY (`id`, `preserve`, `ruleset_id`, `created_at`);

MySQL root@(none):osu> SELECT
                    ->     table_name AS `Table`,
                    ->     ROUND(data_length / 1024 / 1024) AS `Data MB`,
                    ->     ROUND(index_length / 1024 / 1024) AS `Index MB`
                    -> FROM
                    ->     information_schema.tables
                    -> WHERE
                    ->     table_schema = 'osu' AND
                    ->     table_name = 'solo_scores';
+-------------+---------+----------+
| Table       | Data MB | Index MB |
+-------------+---------+----------+
| solo_scores | 1868    | 237      |
+-------------+---------+----------+


MySQL root@(none):osu> SHOW TABLE STATUS LIKE 'solo_scores';
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
| Name        | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time | Check_time | Collation          | Checksum | Create_options                                     | Comment |
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
| solo_scores | InnoDB | 10      | Compressed | 9523175 | 205            | 1958215680  | 0               | 248758272    | 3932160   | 2406267336     | 2023-09-01 01:24:18 | <null>      | <null>     | utf8mb4_0900_ai_ci | <null>   | row_format=COMPRESSED KEY_BLOCK_SIZE=4 partitioned |         |
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
-- remove ruleset_id and see if index size changes
ALTER TABLE `solo_scores`
DROP PRIMARY KEY,
ADD PRIMARY KEY (`id`, `preserve`, `created_at`)

MySQL root@(none):osu> SELECT
                    ->     table_name AS `Table`,
                    ->     ROUND(data_length / 1024 / 1024) AS `Data MB`,
                    ->     ROUND(index_length / 1024 / 1024) AS `Index MB`
                    -> FROM
                    ->     information_schema.tables
                    -> WHERE
                    ->     table_schema = 'osu' AND
                    ->     table_name = 'solo_scores';
+-------------+---------+----------+
| Table       | Data MB | Index MB |
+-------------+---------+----------+
| solo_scores | 1867    | 236      |
+-------------+---------+----------+

MySQL root@(none):osu> SHOW TABLE STATUS LIKE 'solo_scores';
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
| Name        | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time | Check_time | Collation          | Checksum | Create_options                                     | Comment |
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
| solo_scores | InnoDB | 10      | Compressed | 9577016 | 204            | 1957953536  | 0               | 247709696    | 5505024   | 2406267336     | 2023-09-01 02:58:21 | <null>      | <null>     | utf8mb4_0900_ai_ci | <null>   | row_format=COMPRESSED KEY_BLOCK_SIZE=4 partitioned |         |
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+

Test whether existing index needs id DESC at end (or whether it can use the implicity primary key)

TL;DR it doesn't, it can use the primary key

explain SELECT * FROM solo_scores WHERE user_id = 2 AND ruleset_id = 0 ORDER BY id DESC LIMIT 50;

DROP INDEX `user_ruleset_id_index`;
ADD INDEX `user_ruleset_id_index` (`user_id`,`ruleset_id`);

explain SELECT * FROM solo_scores WHERE user_id = 2 AND ruleset_id = 0 ORDER BY id DESC LIMIT 50;

Test how badly inserting beatmap_id into the existing index (KEY user_ruleset_id_index (user_id,ruleset_id,id DESC)) breaks things (100% it will)

TL;DR it does, as expected. We can get around this by changing the existing KEY beatmap_id to include user_id at the end. This comes at almost zero storage cost due to the high cardinality of users per beatmap (almost completely unique). So we're just changing the ordering of the index rather than adding new overheads.

explain SELECT * FROM solo_scores WHERE user_id = 2 AND ruleset_id = 0 ORDER BY id DESC LIMIT 50;


MySQL root@(none):osu> explain SELECT `id` FROM `solo_scores_p` WHERE `user_id` = 19743981 AND `ruleset_id` = 0 and preserve in (0,1) ORDER BY `id`limit 10;
+----+-------------+---------------+------------+------+---------------+---------------+---------+-------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type | possible_keys | key           | key_len | ref         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+------+---------------+---------------+---------+-------------+---------+----------+--------------------------+
| 1  | SIMPLE      | solo_scores_p | p0catch,p1 | ref  | user_id_index | user_id_index | 6       | const,const | 1700684 | 20.0     | Using where; Using index |
+----+-------------+---------------+------------+------+---------------+---------------+---------+-------------+---------+----------+--------------------------+

10 rows in set
Time: 0.005s


DROP INDEX `user_ruleset_id_index`;
ADD INDEX `user_ruleset_id_index` (`user_id`,`ruleset_id`, `beatmap_id`, `id` DESC);

explain SELECT * FROM solo_scores WHERE user_id = 2 AND ruleset_id = 0 ORDER BY id DESC LIMIT 50;

MySQL root@(none):osu> explain SELECT `id` FROM `solo_scores_p` WHERE `user_id` = 19743981 AND `ruleset_id` = 0 and preserve
                    -> in (0,1) ORDER BY `id`limit 10;
+----+-------------+---------------+------------+-------+---------------+---------+---------+--------+------+----------+-------------+
| id | select_type | table         | partitions | type  | possible_keys | key     | key_len | ref    | rows | filtered | Extra       |
+----+-------------+---------------+------------+-------+---------------+---------+---------+--------+------+----------+-------------+
| 1  | SIMPLE      | solo_scores_p | p0catch,p1 | index | user_id_index | PRIMARY | 16      | <null> | 62   | 3.18     | Using where |
+----+-------------+---------------+------------+-------+---------------+---------+---------+--------+------+----------+-------------+

10 rows in set
Time: 0.498s

Test whether including ruleset_id in the constructed partitions (as touched on in #16) means we can remove it from indices

TL;DR correct, but I'm avoiding this because of concerns for future extensibility (custom rulesets). Also there's no storage saving on disk.

explain select * from solo_scores where user_id = 4937439 and ruleset_id = 3 and beatmap_id = 75;

ALTER TABLE `solo_scores`
    PARTITION BY RANGE COLUMNS(`preserve`, `ruleset_id`, created_at)
        (
        PARTITION p0r0catch VALUES LESS THAN (0,0,MAXVALUE) ENGINE = InnoDB,
        PARTITION p0r1catch VALUES LESS THAN (0,1,MAXVALUE) ENGINE = InnoDB,
        PARTITION p0r2catch VALUES LESS THAN (0,2,MAXVALUE) ENGINE = InnoDB,
        PARTITION p0r3catch VALUES LESS THAN (0,3,MAXVALUE) ENGINE = InnoDB,
        PARTITION p1r0 VALUES LESS THAN (1,0,MAXVALUE) ENGINE = InnoDB,
        PARTITION p1r1 VALUES LESS THAN (1,1,MAXVALUE) ENGINE = InnoDB,
        PARTITION p1r2 VALUES LESS THAN (1,2,MAXVALUE) ENGINE = InnoDB,
        PARTITION p1r3 VALUES LESS THAN (1,3,MAXVALUE) ENGINE = InnoDB
        );

-- working alternative?
ALTER TABL�E `solo_scores`
 PARTITION BY list COLUMNS(`preserve`, `ruleset_id`)
     (
     PARTITION p0r0catch VALUES IN ((0,0)),
     PARTITION p0r1catch VALUES IN ((0,1)),
     PARTITION p0r2catch VALUES IN ((0,2)),
     PARTITION p0r3catch VALUES IN ((0,3)),
     PARTITION p1r0 VALUES IN ((1,0)),
     PARTITION p1r1 VALUES IN ((1,1)),
     PARTITION p1r2 VALUES IN ((1,2)),
     PARTITION p1r3 VALUES IN ((1,3))
     );

+-------------+----------------+---------+----------+
| Table       | AVG_ROW_LENGTH | Data MB | Index MB |
+-------------+----------------+---------+----------+
| solo_scores | 178            | 1635    | 342      |
+-------------+----------------+---------+----------+

MySQL root@(none):osu> explain select * from solo_scores where user_id = 4937439 and ruleset_id = 3 and beatmap_id = 75 and preserve = 1\G;
***************************[ 1. row ]***************************
id            | 1
select_type   | SIMPLE
table         | solo_scores
partitions    | p1r3
type          | ref
possible_keys | user_ruleset_id_index,beatmap_id
key           | user_ruleset_id_index
key_len       | 6
ref           | const,const
rows          | 1
filtered      | 5.0
Extra         | Using index condition; Using where

ALTER TABLE `solo_scores`
    DROP INDEX `user_ruleset_id_index`,
    ADD KEY `user_id_index` (`user_id`,`id` DESC);


+-------------+----------------+---------+----------+
| Table       | AVG_ROW_LENGTH | Data MB | Index MB |
+-------------+----------------+---------+----------+
| solo_scores | 178            | 1635    | 342      |
+-------------+----------------+---------+----------+

explain select * from solo_scores where user_id = 4937439 and ruleset_id = 3 and beatmap_id = 75;


Check why we have a beatmap_id index

  • Used when a beatmap changes approved state (osu-web).

Notes

  • Using LIST partitioning, mysql can't determine it should be using specific partitions when preserve= is not in the query:
MySQL root@(none):osu> explain select id from solo_scores where user_id = 19743981 and ruleset_id = 0 order by id desc limit 10\G;
***************************[ 1. row ]***************************
id            | 1
select_type   | SIMPLE
table         | solo_scores
partitions    | p0r0,p0r1,p0r2,p0r3,p1r0,p1r1,p1r2,p1r3
type          | index
possible_keys | user_id_index
key           | PRIMARY
key_len       | 16
ref           | <null>
rows          | 64
filtered      | 1.56
Extra         | Using where; Backward index scan


MySQL root@(none):osu> explain select id from solo_scores where user_id = 19743981 and ruleset_id = 0 and preserve in (0,1) order by id desc limit 10\G;
***************************[ 1. row ]***************************
id            | 1
select_type   | SIMPLE
table         | solo_scores
partitions    | p0r0,p1r0
type          | index
possible_keys | user_id_index
key           | PRIMARY
key_len       | 16
ref           | <null>
rows          | 56
filtered      | 0.35
Extra         | Using where; Backward index scan
  • Enabling partitioning seems to reduce DATA_SIZE by around 33%. But there's no documentation anywhere saying that this should be the case. What gives?
+-----------------+----------------+---------+----------+
| Table           | AVG_ROW_LENGTH | Data MB | Index MB |
+-----------------+----------------+---------+----------+
| solo_scores     | 300            | 2835    | 369      |
| solo_scores_p   | 178            | 1780    | 359      |
| solo_scores_p_r | 178            | 1782    | 359      |
+-----------------+----------------+---------+----------+

(fixed by running ALTER TABLE solo_scores FORCE...)

@peppy peppy changed the title Test data size concerns for adding a new index Investigate data size concerns for adding a new index to solo_scores Aug 31, 2023
@peppy peppy self-assigned this Aug 31, 2023
@peppy
Copy link
Member Author

peppy commented Sep 1, 2023

Based on investigations above, the change in structure I'm looking to apply is:

--- old.sql	2023-09-01 18:17:38
+++ new.sql	2023-09-01 18:18:47
@@ -8,9 +8,9 @@
   `preserve` tinyint(1) NOT NULL DEFAULT '0',
   `created_at` datetime NOT NULL,
   `updated_at` timestamp NULL DEFAULT NULL,
-  PRIMARY KEY (`id`,`ruleset_id`,`preserve`,`created_at`),
-  KEY `user_ruleset_id_index` (`user_id`,`ruleset_id`,`id` DESC),
-  KEY `beatmap_id` (`beatmap_id`)
+  PRIMARY KEY (`id`,`preserve`,`created_at`),
+  KEY `user_ruleset_index` (`user_id`,`ruleset_id`),
+  KEY `beatmap_user_index` (`beatmap_id`,`user_id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4
 /*!50500 PARTITION BY RANGE  COLUMNS(`preserve`,created_at)
 (PARTITION p0catch VALUES LESS THAN (0,MAXVALUE) ENGINE = InnoDB,
  • Index added to allow lookups of a user's scores on a specific beatmap (as required for Add support for increasing rank counts osu-queue-score-statistics#149). Note that the order of this index keeps beatmap_id first to aid in operations on a whole beatmap (ie. deleting all scores).
  • id DESC removed from user_ruleset_index – it's not required as it's in the primary key and implicitly available at the end of the index. This doesn't reduce the index size, so MySQL was likely doing this optimisation internally.
  • Removed ruleset_id from primary key. It wasn't being used and will not be used in partitioning scheme due to extensibility concerns (see OP for more commentary).

This is a first checkpoint, I still have some further smaller changes to test:

TODO:

@peppy
Copy link
Member Author

peppy commented Sep 1, 2023

Updated with TODO changes applied:

--- old.sql	2023-09-01 18:17:38
+++ new.sql	2023-09-01 20:03:04
@@ -6,12 +6,12 @@
   `data` json NOT NULL,
   `has_replay` tinyint(1) DEFAULT '0',
   `preserve` tinyint(1) NOT NULL DEFAULT '0',
-  `created_at` datetime NOT NULL,
-  `updated_at` timestamp NULL DEFAULT NULL,
-  PRIMARY KEY (`id`,`ruleset_id`,`preserve`,`created_at`),
-  KEY `user_ruleset_id_index` (`user_id`,`ruleset_id`,`id` DESC),
-  KEY `beatmap_id` (`beatmap_id`)
+  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
+  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+  PRIMARY KEY (`id`,`preserve`,`updated_at`),
+  KEY `user_ruleset_index` (`user_id`,`ruleset_id`),
+  KEY `beatmap_user_index` (`beatmap_id`,`user_id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4
-/*!50500 PARTITION BY RANGE  COLUMNS(`preserve`,created_at)
+/*!50500 PARTITION BY RANGE COLUMNS(`preserve`,`updated_at`)
 (PARTITION p0catch VALUES LESS THAN (0,MAXVALUE) ENGINE = InnoDB,
  PARTITION p1 VALUES LESS THAN (MAXVALUE,MAXVALUE) ENGINE = InnoDB) */

@peppy
Copy link
Member Author

peppy commented Sep 4, 2023

Bonus round: compression testing to ensure we have things fine tuned

Compression testing

Time values are from ALTER. In general this seems very optimised, so the actual overheads for INSERT etc. operations would be higher.

# key_size=16
# Time: 268.081s
-rw-r-----  1 dean  admin  6069157888 Sep  4 12:35 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  3179282432 Sep  4 12:35 solo_scores_p#p#p1.ibd

# key_size=8
# Time: 195.931s

-rw-r-----  1 dean  admin  3036676096 Sep  4 11:33 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  1598029824 Sep  4 11:33 solo_scores_p#p#p1.ibd

# key_size=4
# Time: 204.383s

-rw-r-----  1 dean  admin  1606418432 Sep  4 12:20 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin   872415232 Sep  4 12:20 solo_scores_p#p#p1.ibd

# key_size=2
# Time: 249.902s

-rw-r-----  1 dean  admin  1602224128 Sep  4 12:24 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  1019215872 Sep  4 12:24 solo_scores_p#p#p1.ibd

# row_format=compact
# Time: 122.149s

-rw-r-----  1 dean  admin  6069157888 Sep  4 12:35 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  3179282432 Sep  4 12:35 solo_scores_p#p#p1.ibd

# row_format=dynamic
# Time: 58.805s

-rw-r-----  1 dean  admin  6480199680 Sep  4 12:42 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  3376414720 Sep  4 12:42 solo_scores_p#p#p1.ibd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant