Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling missing data? #4

Open
kmundnic opened this issue May 16, 2020 · 3 comments
Open

Handling missing data? #4

kmundnic opened this issue May 16, 2020 · 3 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@kmundnic
Copy link

I'm trying to follow the examples with my data which is incomplete, but the function uniqueness doesn't handle Union{Int, Missing}. According to your paper, your method is able to handle missing data, so I'm wondering if this was implemented?

Here's a minimal working example of the code throwing an error:

using StatsBase
using DataFrames
using CorrectMatch: Copula, Uniqueness, Individual
using Distributions

function checkrows(df::DataFrame)
    for row in eachrow(df)
        @assert !all([ismissing(i) for i in row])
    end
end

function extract_marginal_ordered(row::AbstractVector)
  cm = collect(values(countmap(row; alg=:dict)))
  Categorical(cm / sum(cm))
end

N = 100; M = 3
df = DataFrame(a = rand(1:2, N), b = rand(1:10, N), c = rand(1:5, N))

# Hopefully you won't get an invalid row with all missing values
p = 0.95
mask = convert(Matrix{Union{Int, Missing}}, rand(Bernoulli(p), N, M))
replace!(mask, 0 => missing)

df = df .* mask
checkrows(df) # If assertion error, run again

data = convert(Matrix, df)

marginals = [extract_marginal_ordered(data[:, i]) for i=1:M];

G = fit_mle(Copula.GaussianCopula, marginals, data);

for indiv in eachrow(data)
    shifted_indiv = indiv - [minimum(collect(skipmissing(col))) for col in eachcol(data)] .+ 1
    println(Individual.individual_uniqueness(G, shifted_indiv, N))
end

which throws the following error:

ERROR: LoadError: MethodError: no method matching individual_uniqueness(::CorrectMatch.Copula.GaussianCopula, ::Array{Union{Missing, Int64},1}, ::Int64)
Closest candidates are:
  individual_uniqueness(::CorrectMatch.Copula.GaussianCopula, ::AbstractArray{Int64,1}, ::Int64; iter) at /Users/karel/.julia/packages/CorrectMatch/Hf9Rq/src/Individual.jl:49
Stacktrace:
 [1] top-level scope at /Users/karel/.julia/dev/CorrectMatch/examples/missing_data.jl:36
 [2] include(::String) at ./client.jl:439
 [3] top-level scope at REPL[76]:1
 [4] eval(::Module, ::Any) at ./boot.jl:331
 [5] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [6] run_backend(::REPL.REPLBackend) at /Users/karel/.julia/packages/Revise/AMRie/src/Revise.jl:1023
 [7] top-level scope at none:0
in expression starting at /Users/karel/.julia/dev/CorrectMatch/examples/missing_data.jl:34

I've installed the latest version using ] add CorrectMatch.

Thanks for your help!

@cynddl
Copy link
Collaborator

cynddl commented May 16, 2020

Hi @kmundnic, thanks for the detailed reproducible example, highly appreciate! I don't think we stated that the method works with missing data, but with incomplete samples (that is, a subsample from a larger population).

Indeed, the fit_mle function cannot currently estimate a correlation matrix with missing cells. However, you can fit the marginals with missing data (as you did) then then drop rows that are not full:

data = convert(Matrix, df[completecases(df), :])
G = fit_mle(Copula.GaussianCopula, marginals, data)

Since we only look at pairwise correlations, we could adapt the optimisation to run with Missing data easily. Rows with at least two non-missing cells will always provide more information than if they're dropped. Happy to help you with that if you want to contribute?

@cynddl cynddl added enhancement New feature or request good first issue Good for newcomers labels May 16, 2020
@kmundnic
Copy link
Author

Thanks for your quick reply @cynddl. The missing data/incomplete data misunderstanding is clear now :-) (I also re-read those portions of the paper and it makes sense).

My data has missing values for many subjects, so I can't afford removing the rows with missing entries, and my sample is "small" (slightly over 200 subjects). Therefore, I need to adapt the code.

When you say:

the fit_mle function cannot currently estimate a correlation matrix with missing cells

do you mean that it can, but producing a biased estimation? If I understand correctly, you're using Mutual Information (MI) to estimate the pairwise correlations, so I would need to adapt the estimation of the MI to have a (hopefully) unbiased estimator of \Sigma. However, in my MWE fit_mle still works (in terms of the code running), so I'd expect the estimator to be biased.

When this is solved, my understanding is that Individual.individual_uniqueness would need to be changed as well to be able to handle missing values. Are you implementing the equation between (23) and (24) of your paper to estimate individual uniqueness?

Thanks for your help!

@cynddl
Copy link
Collaborator

cynddl commented May 29, 2020

Sorry for the late reply. At the moment, the MI matrix is computed here:
https://github.com/cynddl/Discreet.jl/blob/904793f59e0dde96539dc986c597191a6be098fe/src/mutual_information.jl#L38

A simple trick would be, when iterating over couples of columns (x, y), to drop the rows where one of these two has missing values. The only issue I see is that we may compute the marginal entropies with more rows than the pairwise entropies. So in practice it may require using a better entropy estimator than the naive one. Fortunately, they are already implemented in Discreet.jl (Chao-Shen and Shrinkage).


The plan would be:

  1. Modify Discreet.jl to handle missing values when computing entropy or mutual information.
  2. Make sure the types in CorrectMatch.jl cover missing values.
  3. Check what needs to be done for individual_uniqueness. I seems to be it should work as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants