jupiter

Detecting correlated columns in DBMS systems

On a database with a large number of relations, there can exist many correlated columns or groups of correlated columns. Finding out such columns can provide useful business analytics and insight into the data. This can also help in data lineage to track the origin of the data and subsequent transformations. Some of these relationships are pretty obvious like primary keys, foreign keys, other indexes, etc. But the main challenges are to find out the hidden relationships. We picked the Pearson correlation as a measurement of correlation between numerical columns (numerical, date, timestamp, boolean types). For non-numerical columns (string type), we used LSH with minhashing to find strongly related columns with high probability. As LSH with minhashing is a probabilistic approach, there can be false positives and negatives as well. Similarly for Pearson correlation calculation, we have used simple random sampling to reduce the overall compute cost. However our solution is expected to find strongly correlated columns with high probability. We tried to scale it up to run on relations with up to 10000 columns.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
data		data
src		src
.gitignore		.gitignore
Jupyter-muhammad.ipynb		Jupyter-muhammad.ipynb
Jupyter.ipynb		Jupyter.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jupiter

Detecting correlated columns in DBMS systems

About

Releases

Packages

Contributors 2

Languages

License

amitkp57/dbms-correlated-columns-detection

Folders and files

Latest commit

History

Repository files navigation

jupiter

Detecting correlated columns in DBMS systems

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages