forked from apache/arrow
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
apacheGH-37096: [MATLAB] Add utility which makes valid MATLAB table v…
…ariable names from an arbitrary list of strings (apache#37098) ### Rationale for this change To make it possible to safely convert Arrow Schema field names to corresponding MATLAB `table` variable names, it would be helpful to add a utility which can take an arbitrary list of strings and return a set of valid MATLAB `table` variable names, which are (1) unique, (2) non-empty, and (3) do not conflict with the "reserved" variable names "Properties", "VariableNames", "RowNames", and ":". An additional restriction is that variable names must have 63 or less characters. ### What changes are included in this PR? 1. Added a new function called `arrow.tabular.internal.makeValidVariableNames` that accepts an arbitrary list of strings and returns valid MATLAB `table` variable names. ```matlab >> originalVarNames = ["", "Properties", ":", "ValidVar", "ValidVar"]; >> validVarNames = arrow.tabular.internal.makeValidVariableNames(originalVarNames) validVarNames = 1×5 string array "Var1" "Properties_1" ":_1" "ValidVar" "ValidVar_1" ``` 3. Added a new function called `arrow.tabular.internal.makeValidDimensionNames` that returns valid table dimension names with respect to a list of valid variable names. In MATLAB the default `table` dimension names are `"Row"` and `"Variables"`, but they must not conflict with any variables names. In other words, they must be unique with respect to the variable names. ```matlab >> validVarNames = ["Row" "Test" "Variables"]; >> validDimNames = arrow.tabular.internal.makeValidDimensionNames(validVarNames) validDimNames = 1×2 string array "Row_1" "Variables_1" ``` To summarize, MATLAB `table`s cannot have arbitrary variable names. For example, `"Properties"`, `"RowNames"`, `"VariableNames"`, and `":"` are all disallowed. Variable names must also be unique and must be between 1 and 63 characters in length. They also must be unique with respect to each other. ### Are these changes tested? Yes. Added the following new test classes: 1. `tMakeValidVariableNames.m` 2. `tMakeValidDimensionNames.m` ### Are there any user-facing changes? No. ### Future Directions 1. In a follow-up PR, we will integrate `makeValidVariableNames` and `makeValidDimensionNames` into the `table()` and `toMATLAB()` methods of `arrow.tabular.RecordBatch`. ### Notes Thanks to @ kevingurney for help writing the test cases! * Closes: apache#37096 Lead-authored-by: Sarah Gilmore <[email protected]> Co-authored-by: Kevin Gurney <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
- Loading branch information
1 parent
d9fdb0a
commit 7ecc709
Showing
4 changed files
with
385 additions
and
0 deletions.
There are no files selected for viewing
28 changes: 28 additions & 0 deletions
28
matlab/src/matlab/+arrow/+tabular/+internal/makeValidDimensionNames.m
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
%MAKEVALIDDIMENSIONNAMES Makes valid table dimension names with | ||
% respect to the variable names. | ||
|
||
% Licensed to the Apache Software Foundation (ASF) under one or more | ||
% contributor license agreements. See the NOTICE file distributed with | ||
% this work for additional information regarding copyright ownership. | ||
% The ASF licenses this file to you under the Apache License, Version | ||
% 2.0 (the "License"); you may not use this file except in compliance | ||
% with the License. You may obtain a copy of the License at | ||
% | ||
% http://www.apache.org/licenses/LICENSE-2.0 | ||
% | ||
% Unless required by applicable law or agreed to in writing, software | ||
% distributed under the License is distributed on an "AS IS" BASIS, | ||
% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or | ||
% implied. See the License for the specific language governing | ||
% permissions and limitations under the License. | ||
function dimnames = makeValidDimensionNames(varnames) | ||
|
||
dimnames = ["Row" "Variables"]; | ||
|
||
numvars = numel(varnames); | ||
indicesToUniqify = [numvars + 1 numvars + 2]; | ||
|
||
strs = matlab.lang.makeUniqueStrings([varnames dimnames], indicesToUniqify); | ||
dimnames = strs(indicesToUniqify); | ||
end | ||
|
36 changes: 36 additions & 0 deletions
36
matlab/src/matlab/+arrow/+tabular/+internal/makeValidVariableNames.m
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
%MAKEVALIDVARIABLENAMES Makes valid table variable names. | ||
|
||
% Licensed to the Apache Software Foundation (ASF) under one or more | ||
% contributor license agreements. See the NOTICE file distributed with | ||
% this work for additional information regarding copyright ownership. | ||
% The ASF licenses this file to you under the Apache License, Version | ||
% 2.0 (the "License"); you may not use this file except in compliance | ||
% with the License. You may obtain a copy of the License at | ||
% | ||
% http://www.apache.org/licenses/LICENSE-2.0 | ||
% | ||
% Unless required by applicable law or agreed to in writing, software | ||
% distributed under the License is distributed on an "AS IS" BASIS, | ||
% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or | ||
% implied. See the License for the specific language governing | ||
% permissions and limitations under the License. | ||
function [varnames, modified] = makeValidVariableNames(varnames) | ||
arguments | ||
varnames(1, :) string | ||
end | ||
|
||
reservedNames = ["Properties", "VariableNames", "RowNames", ":"]; | ||
|
||
[varnames, replacedVars] = replaceEmptyVariableNames(varnames); | ||
[varnames, madeUnique] = matlab.lang.makeUniqueStrings(varnames, reservedNames, 63); | ||
|
||
modified = replacedVars || any(madeUnique); | ||
end | ||
|
||
function [varnames, modified] = replaceEmptyVariableNames(varnames) | ||
emptyIndices = find(varnames == ""); | ||
modified = any(emptyIndices); | ||
if modified | ||
varnames(emptyIndices) = compose("Var%d", emptyIndices); | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
%TMAKEVALIDDIMENSIONNAMES Unit tests for | ||
% arrow.tabular.internal.makeValidDimensionNames. | ||
|
||
% Licensed to the Apache Software Foundation (ASF) under one or more | ||
% contributor license agreements. See the NOTICE file distributed with | ||
% this work for additional information regarding copyright ownership. | ||
% The ASF licenses this file to you under the Apache License, Version | ||
% 2.0 (the "License"); you may not use this file except in compliance | ||
% with the License. You may obtain a copy of the License at | ||
% | ||
% http://www.apache.org/licenses/LICENSE-2.0 | ||
% | ||
% Unless required by applicable law or agreed to in writing, software | ||
% distributed under the License is distributed on an "AS IS" BASIS, | ||
% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or | ||
% implied. See the License for the specific language governing | ||
% permissions and limitations under the License. | ||
|
||
classdef tMakeValidDimensionNames < matlab.unittest.TestCase | ||
|
||
methods(Test) | ||
|
||
function VariableNamedRow(testCase) | ||
% Verify the default dimension name "Row" is replaced with "Row_1" | ||
% if one of the variables is named "Row". | ||
import arrow.tabular.internal.* | ||
|
||
varnames = ["Row" "Var2"]; | ||
dimnames = makeValidDimensionNames(varnames); | ||
testCase.verifyEqual(dimnames, ["Row_1", "Variables"]); | ||
end | ||
|
||
function VariableNamedVariables(testCase) | ||
% Verify the default dimension name "Variables" is replaced with | ||
% "Variables_1" if one of the variables is named "Variables". | ||
import arrow.tabular.internal.* | ||
|
||
varnames = ["Var1" "Variables"]; | ||
dimnames = makeValidDimensionNames(varnames); | ||
testCase.verifyEqual(dimnames, ["Row", "Variables_1"]); | ||
end | ||
|
||
function VariablesWithConflictingNumericSuffix(testCase) | ||
% Verify that conflicting numeric suffixes (e.g. "Variables" | ||
% and "Variables_1") are resolved as expected. | ||
|
||
import arrow.tabular.internal.* | ||
|
||
varnames = ["A" "Variables_1" "Variables"]; | ||
dimnames = makeValidDimensionNames(varnames); | ||
testCase.verifyEqual(dimnames, ["Row", "Variables_2"]); | ||
end | ||
|
||
function RowWithConflictingNumericSuffix(testCase) | ||
% Verify that conflicting numeric suffixes (e.g. "Row" | ||
% and "Row_1") are resolved as expected. | ||
|
||
import arrow.tabular.internal.* | ||
|
||
varnames = ["Row_1" "Row" "Row_3" "Test"]; | ||
dimnames = makeValidDimensionNames(varnames); | ||
testCase.verifyEqual(dimnames, ["Row_2", "Variables"]); | ||
end | ||
|
||
function DefaultDimensionNamesOk(testCase) | ||
% Verify the dimension names are set to the default values | ||
% ("Row" and "Variables") if they are not one of the variable | ||
% names. | ||
|
||
import arrow.tabular.internal.* | ||
|
||
varnames = ["row" "variables"]; | ||
dimnames = makeValidDimensionNames(varnames); | ||
testCase.verifyEqual(dimnames, ["Row", "Variables"]); | ||
|
||
varnames = ["A" "B" "C"]; | ||
dimnames = makeValidDimensionNames(varnames); | ||
testCase.verifyEqual(dimnames, ["Row", "Variables"]); | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
%TMAKEVALIDVARIABLENAMES Unit tests for | ||
% arrow.tabular.internal.makeValidVariableNames. | ||
|
||
% Licensed to the Apache Software Foundation (ASF) under one or more | ||
% contributor license agreements. See the NOTICE file distributed with | ||
% this work for additional information regarding copyright ownership. | ||
% The ASF licenses this file to you under the Apache License, Version | ||
% 2.0 (the "License"); you may not use this file except in compliance | ||
% with the License. You may obtain a copy of the License at | ||
% | ||
% http://www.apache.org/licenses/LICENSE-2.0 | ||
% | ||
% Unless required by applicable law or agreed to in writing, software | ||
% distributed under the License is distributed on an "AS IS" BASIS, | ||
% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or | ||
% implied. See the License for the specific language governing | ||
% permissions and limitations under the License. | ||
classdef tMakeValidVariableNames < matlab.unittest.TestCase | ||
|
||
methods(Test) | ||
|
||
function Colon(testCase) | ||
% Verify that ":" becomes ":_1". | ||
import arrow.tabular.internal.* | ||
|
||
original = ":"; | ||
expected = ":_1"; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
function RowNames(testCase) | ||
% Verify that "RowNames" becomes "RowNames_1". | ||
import arrow.tabular.internal.* | ||
|
||
original = "RowNames"; | ||
expected = "RowNames_1"; | ||
[actual, modified] = makeValidVariableNames(original); | ||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
function Properties(testCase) | ||
% Verify that "Properties" becomes "Properties_1". | ||
import arrow.tabular.internal.* | ||
|
||
original = "Properties"; | ||
expected = "Properties_1"; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
function VariableNames(testCase) | ||
% Verify that "VariableNames" becomes VariableNames_1. | ||
import arrow.tabular.internal.* | ||
|
||
original = "VariableNames"; | ||
expected = "VariableNames_1"; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
function ValidVariableNames(testCase) | ||
% Verify that when all of the input strings | ||
% are valid table variable names, that none of them | ||
% are modified. | ||
import arrow.tabular.internal.* | ||
|
||
original = ["A", "B", "C"]; | ||
expected = original; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyFalse(modified); | ||
end | ||
|
||
function ValidVariableNamesUnicode(testCase) | ||
% Verify that when all of the input strings are valid Unicode | ||
% table variable names, that none of them are modified. | ||
import arrow.tabular.internal.* | ||
|
||
smiley = "😀"; | ||
tree = "🌲"; | ||
mango = "🥭"; | ||
|
||
original = [smiley, tree, mango]; | ||
expected = original; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyFalse(modified); | ||
end | ||
|
||
function PropertiesWithConflictingNumericSuffix(testCase) | ||
% Verify that conflicting numeric suffixes (e.g. "Properties" | ||
% and "Properties_1") are resolved as expected. | ||
import arrow.tabular.internal.* | ||
|
||
original = ["Properties", "Properties_1"]; | ||
expected = ["Properties_2", "Properties_1"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
|
||
original = ["Properties_1", "Properties", "Properties_4"]; | ||
expected = ["Properties_1", "Properties_2", "Properties_4"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
function RowNamesWithConflictingNumericSuffix(testCase) | ||
% Verify that conflicting numeric suffixes (e.g. "RowNames" | ||
% and "RowNames_1") are resolved as expected. | ||
import arrow.tabular.internal.* | ||
|
||
original = ["RowNames", "RowNames_1"]; | ||
expected = ["RowNames_2", "RowNames_1"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
|
||
original = ["RowNames_1", "RowNames", "RowNames_4"]; | ||
expected = ["RowNames_1", "RowNames_2", "RowNames_4"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
function VariableNamesWithConflictingNumericSuffix(testCase) | ||
% Verify that conflicting numeric suffixes (e.g. "VariableNames" | ||
% and "VariableNames_1") are resolved as expected. | ||
import arrow.tabular.internal.* | ||
|
||
original = ["VariableNames", "VariableNames_1"]; | ||
expected = ["VariableNames_2", "VariableNames_1"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
|
||
original = ["VariableNames_1", "VariableNames", "VariableNames_4"]; | ||
expected = ["VariableNames_1", "VariableNames_2", "VariableNames_4"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
function ColonWithConflictingSuffix(testCase) | ||
% Verify that conflicting suffixes (e.g. ":" | ||
% and "x_") are resolved as expected. | ||
import arrow.tabular.internal.* | ||
|
||
original = [":", ":_1"]; | ||
expected = [":_2", ":_1"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
|
||
original = [":_1", ":", ":_4"]; | ||
expected = [":_1", ":_2", ":_4"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
function EmptyStrings(testCase) | ||
% Verify that empty strings are mapped to Var1, ..., Vari, ..., | ||
% VarN as expected and that conflicting names are resolved as | ||
% expected. | ||
import arrow.tabular.internal.* | ||
|
||
original = ""; | ||
expected = "Var1"; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
|
||
original = ["", "Var1", ""]; | ||
expected = ["Var1", "Var1_1", "Var3"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
|
||
original = ["", "Var1", "Var1_1"]; | ||
expected = ["Var1", "Var1_2", "Var1_1"]; | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
function VariableNameLengthMax(testCase) | ||
% Verify strings whose character length exceeds 63 | ||
% are truncated to the max variable name length (63). | ||
import arrow.tabular.internal.* | ||
|
||
original = string(repmat('a', [1 64])); | ||
expected = extractBefore(original, 64); | ||
|
||
[actual, modified] = makeValidVariableNames(original); | ||
|
||
testCase.verifyEqual(actual, expected); | ||
testCase.verifyTrue(modified); | ||
end | ||
|
||
end | ||
|
||
end |