Handling missing data? #4

kmundnic · 2020-05-16T01:46:25Z

I'm trying to follow the examples with my data which is incomplete, but the function uniqueness doesn't handle Union{Int, Missing}. According to your paper, your method is able to handle missing data, so I'm wondering if this was implemented?

Here's a minimal working example of the code throwing an error:

using StatsBase
using DataFrames
using CorrectMatch: Copula, Uniqueness, Individual
using Distributions

function checkrows(df::DataFrame)
    for row in eachrow(df)
        @assert !all([ismissing(i) for i in row])
    end
end

function extract_marginal_ordered(row::AbstractVector)
  cm = collect(values(countmap(row; alg=:dict)))
  Categorical(cm / sum(cm))
end

N = 100; M = 3
df = DataFrame(a = rand(1:2, N), b = rand(1:10, N), c = rand(1:5, N))

# Hopefully you won't get an invalid row with all missing values
p = 0.95
mask = convert(Matrix{Union{Int, Missing}}, rand(Bernoulli(p), N, M))
replace!(mask, 0 => missing)

df = df .* mask
checkrows(df) # If assertion error, run again

data = convert(Matrix, df)

marginals = [extract_marginal_ordered(data[:, i]) for i=1:M];

G = fit_mle(Copula.GaussianCopula, marginals, data);

for indiv in eachrow(data)
    shifted_indiv = indiv - [minimum(collect(skipmissing(col))) for col in eachcol(data)] .+ 1
    println(Individual.individual_uniqueness(G, shifted_indiv, N))
end

which throws the following error:

ERROR: LoadError: MethodError: no method matching individual_uniqueness(::CorrectMatch.Copula.GaussianCopula, ::Array{Union{Missing, Int64},1}, ::Int64)
Closest candidates are:
  individual_uniqueness(::CorrectMatch.Copula.GaussianCopula, ::AbstractArray{Int64,1}, ::Int64; iter) at /Users/karel/.julia/packages/CorrectMatch/Hf9Rq/src/Individual.jl:49
Stacktrace:
 [1] top-level scope at /Users/karel/.julia/dev/CorrectMatch/examples/missing_data.jl:36
 [2] include(::String) at ./client.jl:439
 [3] top-level scope at REPL[76]:1
 [4] eval(::Module, ::Any) at ./boot.jl:331
 [5] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [6] run_backend(::REPL.REPLBackend) at /Users/karel/.julia/packages/Revise/AMRie/src/Revise.jl:1023
 [7] top-level scope at none:0
in expression starting at /Users/karel/.julia/dev/CorrectMatch/examples/missing_data.jl:34

I've installed the latest version using ] add CorrectMatch.

Thanks for your help!

The text was updated successfully, but these errors were encountered:

cynddl · 2020-05-16T11:10:42Z

Hi @kmundnic, thanks for the detailed reproducible example, highly appreciate! I don't think we stated that the method works with missing data, but with incomplete samples (that is, a subsample from a larger population).

Indeed, the fit_mle function cannot currently estimate a correlation matrix with missing cells. However, you can fit the marginals with missing data (as you did) then then drop rows that are not full:

data = convert(Matrix, df[completecases(df), :])
G = fit_mle(Copula.GaussianCopula, marginals, data)

Since we only look at pairwise correlations, we could adapt the optimisation to run with Missing data easily. Rows with at least two non-missing cells will always provide more information than if they're dropped. Happy to help you with that if you want to contribute?

kmundnic · 2020-05-16T20:04:29Z

Thanks for your quick reply @cynddl. The missing data/incomplete data misunderstanding is clear now :-) (I also re-read those portions of the paper and it makes sense).

My data has missing values for many subjects, so I can't afford removing the rows with missing entries, and my sample is "small" (slightly over 200 subjects). Therefore, I need to adapt the code.

When you say:

the fit_mle function cannot currently estimate a correlation matrix with missing cells

do you mean that it can, but producing a biased estimation? If I understand correctly, you're using Mutual Information (MI) to estimate the pairwise correlations, so I would need to adapt the estimation of the MI to have a (hopefully) unbiased estimator of \Sigma. However, in my MWE fit_mle still works (in terms of the code running), so I'd expect the estimator to be biased.

When this is solved, my understanding is that Individual.individual_uniqueness would need to be changed as well to be able to handle missing values. Are you implementing the equation between (23) and (24) of your paper to estimate individual uniqueness?

Thanks for your help!

cynddl · 2020-05-29T10:12:42Z

Sorry for the late reply. At the moment, the MI matrix is computed here:
https://github.com/cynddl/Discreet.jl/blob/904793f59e0dde96539dc986c597191a6be098fe/src/mutual_information.jl#L38

A simple trick would be, when iterating over couples of columns (x, y), to drop the rows where one of these two has missing values. The only issue I see is that we may compute the marginal entropies with more rows than the pairwise entropies. So in practice it may require using a better entropy estimator than the naive one. Fortunately, they are already implemented in Discreet.jl (Chao-Shen and Shrinkage).

The plan would be:

Modify Discreet.jl to handle missing values when computing entropy or mutual information.
Make sure the types in CorrectMatch.jl cover missing values.
Check what needs to be done for individual_uniqueness. I seems to be it should work as is.

cynddl added enhancement New feature or request good first issue Good for newcomers labels May 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling missing data? #4

Handling missing data? #4

kmundnic commented May 16, 2020

cynddl commented May 16, 2020

kmundnic commented May 16, 2020

cynddl commented May 29, 2020

Handling missing data? #4

Handling missing data? #4

Comments

kmundnic commented May 16, 2020

cynddl commented May 16, 2020

kmundnic commented May 16, 2020

cynddl commented May 29, 2020