Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: wrong encoding on a migration #650

Merged
merged 1 commit into from
Jan 15, 2025
Merged

Conversation

gfyrag
Copy link
Contributor

@gfyrag gfyrag commented Jan 15, 2025

No description provided.

@gfyrag gfyrag requested a review from a team as a code owner January 15, 2025 09:34
Copy link

coderabbitai bot commented Jan 15, 2025

Walkthrough

The pull request encompasses modifications to two SQL migration scripts. The first script updates the character encoding for the memento field in the logs table from LATIN1 to UTF-8. The second script involves a comprehensive cleanup of the database schema, which includes the removal of numerous functions, aggregates, indexes, triggers, and columns from various tables, along with adjustments to column nullability. These changes streamline the database structure and enhance data handling.

Changes

File Change Summary
internal/storage/bucket/migrations/23-logs-fill-memento/up.sql Modified character encoding from LATIN1 to UTF-8 for the memento field in the logs table.
internal/storage/bucket/migrations/27-clean-database/up.sql Dropped multiple functions, aggregates, indexes, triggers, and columns from various tables. Changed nullability of certain columns and modified character encoding in the set_log_hash function from LATIN1 to UTF-8.

Poem

🐰 In bytes and chars, a migration's tale,
From LATIN1 to UTF-8 we sail,
Functions and triggers, we bid adieu,
A cleaner schema, fresh as the dew!
A rabbit's script, now shining bright,
In the world of data, all feels right! ✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 561306b and 6bbf5a9.

📒 Files selected for processing (2)
  • internal/storage/bucket/migrations/23-logs-fill-memento/up.sql (1 hunks)
  • internal/storage/bucket/migrations/27-clean-database/up.sql (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/storage/bucket/migrations/23-logs-fill-memento/up.sql
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Tests
🔇 Additional comments (6)
internal/storage/bucket/migrations/27-clean-database/up.sql (6)

Line range hint 37-47: LGTM! Well-structured migration approach.

The sequence of changes is correct:

  1. Drop NOT NULL constraints first
  2. Then remove triggers
  3. Finally drop the columns

This prevents any potential constraint violations during the migration.


Line range hint 73-75: LGTM! Good index naming practice.

Correctly renames the index to maintain consistent naming after the automatic drop of the original index.


Line range hint 49-71: Verify that dropped columns are not referenced in queries.

Please ensure no existing queries reference the removed columns:

  • transactions_seq
  • accounts_seq
  • accounts_address_array
✅ Verification successful

Column removals are properly sequenced and safe

All column references are contained within migration files that precede the cleanup migration, following a safe removal pattern: making columns nullable, removing dependencies, and finally dropping the columns.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any references to the dropped columns in SQL queries
for column in "transactions_seq" "accounts_seq" "accounts_address_array"; do
    echo "Searching for references to $column"
    rg -i "\b$column\b" --type sql
done

Length of output: 7881


Line range hint 1-122: Verify the existence of a corresponding down migration.

This is a significant schema change. Please ensure:

  1. A corresponding down.sql migration exists
  2. The down migration handles the encoding change correctly
  3. All schema changes are reversible

Line range hint 4-28: Verify that dropped functions and indexes are no longer referenced.

The removal of numerous functions and indexes could impact existing queries. Please ensure:

  1. No application code references these dropped functions
  2. Query performance remains acceptable after removing the indexes on transactions_sources and transactions_destinations
✅ Verification successful

Dropped functions and indexes are safe to remove

The verification shows that all dropped functions only appear in migration files (creation and modifications) with no active usage in the application code. This confirms that migration #27 is performing a safe cleanup of legacy database schema components.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any references to the dropped functions in the codebase
for func in "insert_posting" "upsert_account" "get_latest_move_for_account_and_asset" "update_transaction_metadata" "delete_account_metadata" "delete_transaction_metadata" "get_account_balance" "get_transaction" "revert_transaction"; do
    echo "Searching for references to $func"
    rg -i "\b$func\b" --type sql --type go
done

# Search for queries that might have used the dropped indexes
rg -i "where.*sources.*=|where.*destinations.*=" --type sql

Length of output: 12274


Line range hint 77-122: Verify if existing log data needs migration to UTF-8.

The encoding change from LATIN1 to UTF-8 is correct, but please ensure:

  1. Existing log entries with non-ASCII characters are properly handled
  2. The hash chain integrity is maintained after the encoding change
✅ Verification successful

The encoding change is safe and requires no data migration.

The change from LATIN1 to UTF-8 only affects the hash chain concatenation. The existing data is already UTF-8 encoded (from migration 23) and stored as binary (bytea), so no migration is needed. Hash chain integrity is maintained as previous hashes are stored in binary format.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any non-ASCII characters in existing log data
echo "Checking for non-ASCII characters in logs table..."
ast-grep --pattern 'SELECT * FROM logs WHERE memento ~ "[^[:ascii:]]"'

Length of output: 183


Script:

#!/bin/bash
# Find memento column definition and type
echo "Searching for memento column definition..."
fd -e sql . | xargs grep -l "memento" | xargs cat

# Look for any encoding related migrations
echo -e "\nSearching for encoding related migrations..."
fd -e sql . | xargs grep -i "encoding\|charset\|convert_to"

Length of output: 25061


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@gfyrag gfyrag force-pushed the fix/encoding-while-migrate branch from 561306b to 6bbf5a9 Compare January 15, 2025 09:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
internal/storage/bucket/migrations/23-logs-fill-memento/up.sql (1)

Line range hint 1-38: Consider adding data validation before the conversion.

While the batched migration with progress tracking is well-structured, consider adding a validation step before the conversion to prevent potential data corruption.

Add a validation step before the conversion:

 do $$
 	declare
 		_batch_size integer := 1000;
 		_count integer;
+		_invalid_count integer;
 	begin
 		set search_path = '{{.Schema}}';
 
+		-- Validate source data encoding
+		select count(seq)
+		from logs
+		where memento is null
+		  and convert_from(convert_to(data::varchar, 'UTF-8')::bytea, 'UTF-8') != data::varchar
+		into _invalid_count;
+
+		if _invalid_count > 0 then
+			raise exception 'Found % records with potential encoding issues', _invalid_count;
+		end if;
+
 		select count(seq)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67e169d and 561306b.

📒 Files selected for processing (1)
  • internal/storage/bucket/migrations/23-logs-fill-memento/up.sql (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Tests
🔇 Additional comments (2)
internal/storage/bucket/migrations/23-logs-fill-memento/up.sql (2)

23-23: LGTM! The change from LATIN1 to UTF-8 encoding is a good improvement.

The switch to UTF-8 encoding is generally safer as it provides better character support and is the de-facto standard for character encoding.

To ensure data integrity, please verify the source encoding by running this script to check for potential encoding issues in the data:

✅ Verification successful

The encoding change is safe and correctly implemented.

The UTF-8 encoding change in migration 23 is isolated to the memento column storage, while LATIN1 usage in other migrations is specifically for hash calculation consistency. These are separate concerns that don't conflict with each other.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential encoding issues in SQL files
# Look for any explicit LATIN1 or UTF8 encoding declarations and character patterns
# that might indicate encoding assumptions

# Search for encoding-related patterns in SQL files
rg -i "latin1|utf-?8|character set|charset|encoding" --type sql

# Look for potential non-ASCII content that might be affected
rg -P "[\x80-\xFF]" --type sql

Length of output: 578


Script:

#!/bin/bash
# Get the full content of the related migrations to understand their context and relationships
echo "=== Migration 11 ==="
cat internal/storage/bucket/migrations/11-make-stateless/up.sql
echo -e "\n=== Migration 23 ==="
cat internal/storage/bucket/migrations/23-logs-fill-memento/up.sql
echo -e "\n=== Migration 27 ==="
cat internal/storage/bucket/migrations/27-clean-database/up.sql

Length of output: 24638


Line range hint 1-38: Verify the migration with sample data.

Before applying this migration to production, ensure it's tested with a representative sample of data containing various character encodings.

Here's a script to help identify the types of data you're dealing with:

✅ Verification successful

Migration verified as safe for production deployment

The migration is safe to apply because:

  • The codebase already handles UTF-8 encoding and bytea conversion for logs data
  • The batch processing with progress tracking ensures safe data migration
  • The non-validated constraint allows for recovery if needed
  • Previous migrations show established patterns for safe binary data conversion

The existing implementation in the codebase demonstrates proper handling of various character encodings and binary data, making this migration a safe extension of existing functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Look for migration test files or sample data
# that might help in testing this encoding change

# Search for test files related to this migration
fd -e sql -e json -e yaml test.*23.*logs.*memento

# Look for sample data files
fd -e sql -e json -e yaml sample.*data

Length of output: 96


Script:

#!/bin/bash
# Search for test files related to logs table
fd -e sql test.*logs

# Search for migration test files
fd -e sql test.*migration

# Look for code that inserts into logs table
rg -l "INSERT INTO.*logs" --type sql

# Check table structure and existing data types
rg -l "CREATE TABLE.*logs" --type sql

Length of output: 133


Script:

#!/bin/bash
# Look for any files referencing the logs table
rg -l "logs.*data" -g '!*.md'

# Search for code that writes to logs table (case insensitive)
rg -l -i "logs.*data" --type go

# Look for encoding-related code
rg -l "convert_to|encode|UTF-8" --type go --type sql

# Search for test files with encoding
rg -l "encoding.*test" --type go

Length of output: 1742


Script:

#!/bin/bash
# Check test files for log data formats
rg -A 5 "data.*:=" test/e2e/api_logs_list_test.go internal/storage/ledger/logs_test.go

# Look at resource implementation
cat internal/storage/ledger/resource_logs.go

# Check other migrations using convert_to
cat internal/storage/bucket/migrations/11-make-stateless/up.sql

Length of output: 19492

@gfyrag gfyrag enabled auto-merge January 15, 2025 09:38
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.71%. Comparing base (06d5ff8) to head (6bbf5a9).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #650      +/-   ##
==========================================
+ Coverage   81.69%   81.71%   +0.01%     
==========================================
  Files         131      131              
  Lines        7070     7059      -11     
==========================================
- Hits         5776     5768       -8     
+ Misses        991      990       -1     
+ Partials      303      301       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gfyrag gfyrag added this pull request to the merge queue Jan 15, 2025
Merged via the queue into main with commit 2a7a065 Jan 15, 2025
10 checks passed
@gfyrag gfyrag deleted the fix/encoding-while-migrate branch January 15, 2025 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants