Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistências na base de dados #180

Closed
danielfsbarreto opened this issue Jun 28, 2020 · 13 comments · Fixed by #181
Closed

Inconsistências na base de dados #180

danielfsbarreto opened this issue Jun 28, 2020 · 13 comments · Fixed by #181

Comments

@danielfsbarreto
Copy link
Contributor

danielfsbarreto commented Jun 28, 2020

Estava ocorrendo um problema recorrente com a execução da action do goodtables do projeto, que foi resolvido em #178. Agora é preciso resolver todas as inconsistências que se acumularam no decorrer desse tempo.

Job: https://github.com/turicas/covid19-br/runs/814979053?check_suite_focus=true#step:3:911

2020-06-28T01:10:34.8665625Z DATASET
2020-06-28T01:10:34.8667294Z =======
2020-06-28T01:10:34.8668860Z {'error-count': 35,
2020-06-28T01:10:34.8670649Z  'preset': 'nested',
2020-06-28T01:10:34.8671296Z  'table-count': 10,
2020-06-28T01:10:34.8671754Z  'time': 54.346,
2020-06-28T01:10:34.8672417Z  'valid': False}
2020-06-28T01:10:34.8672565Z 
2020-06-28T01:10:34.8672771Z TABLE [1]
2020-06-28T01:10:34.8672978Z =========
2020-06-28T01:10:34.8673469Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8673924Z  'encoding': 'no',
2020-06-28T01:10:34.8674388Z  'error-count': 3,
2020-06-28T01:10:34.8674832Z  'format': 'inline',
2020-06-28T01:10:34.8675342Z  'headers': ['date', 'notes', 'state', 'url'],
2020-06-28T01:10:34.8675807Z  'resource-name': 'boletim',
2020-06-28T01:10:34.8676259Z  'row-count': 3310,
2020-06-28T01:10:34.8676716Z  'schema': 'table-schema',
2020-06-28T01:10:34.8677166Z  'scheme': 'inline',
2020-06-28T01:10:34.8677658Z  'source': '/app/data/output/boletim.csv',
2020-06-28T01:10:34.8678134Z  'time': 2.043,
2020-06-28T01:10:34.8678569Z  'valid': False}
2020-06-28T01:10:34.8678974Z ---------
2020-06-28T01:10:34.8679588Z [-,2] [non-matching-header] Header in column 2 doesn't match field name "state" in the schema
2020-06-28T01:10:34.8680249Z [-,3] [non-matching-header] Header in column 3 doesn't match field name "url" in the schema
2020-06-28T01:10:34.8680894Z [-,4] [non-matching-header] Header in column 4 doesn't match field name "notes" in the schema
2020-06-28T01:10:34.8681058Z 
2020-06-28T01:10:34.8681247Z TABLE [2]
2020-06-28T01:10:34.8681715Z =========
2020-06-28T01:10:34.8682221Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8684061Z  'encoding': 'no',
2020-06-28T01:10:34.8684569Z  'error-count': 0,
2020-06-28T01:10:34.8685012Z  'format': 'inline',
2020-06-28T01:10:34.8685457Z  'headers': ['date',
2020-06-28T01:10:34.8685887Z              'state',
2020-06-28T01:10:34.8686329Z              'city',
2020-06-28T01:10:34.8686785Z              'place_type',
2020-06-28T01:10:34.8687478Z              'confirmed',
2020-06-28T01:10:34.8687928Z              'deaths',
2020-06-28T01:10:34.8688404Z              'order_for_place',
2020-06-28T01:10:34.8688861Z              'is_last',
2020-06-28T01:10:34.8689345Z              'estimated_population_2019',
2020-06-28T01:10:34.8689826Z              'city_ibge_code',
2020-06-28T01:10:34.8690342Z              'confirmed_per_100k_inhabitants',
2020-06-28T01:10:34.8690812Z              'death_rate'],
2020-06-28T01:10:34.8691271Z  'resource-name': 'caso',
2020-06-28T01:10:34.8691732Z  'row-count': 263945,
2020-06-28T01:10:34.8692198Z  'schema': 'table-schema',
2020-06-28T01:10:34.8692627Z  'scheme': 'inline',
2020-06-28T01:10:34.8693119Z  'source': '/app/data/output/caso.csv',
2020-06-28T01:10:34.8693577Z  'time': 54.154,
2020-06-28T01:10:34.8694012Z  'valid': True}
2020-06-28T01:10:34.8694133Z 
2020-06-28T01:10:34.8694335Z TABLE [3]
2020-06-28T01:10:34.8694537Z =========
2020-06-28T01:10:34.8695006Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8695470Z  'encoding': 'no',
2020-06-28T01:10:34.8695908Z  'error-count': 0,
2020-06-28T01:10:34.8696350Z  'format': 'inline',
2020-06-28T01:10:34.8696779Z  'headers': ['state',
2020-06-28T01:10:34.8697247Z              'state_ibge_code',
2020-06-28T01:10:34.8697721Z              'city_ibge_code',
2020-06-28T01:10:34.8698169Z              'city',
2020-06-28T01:10:34.8698647Z              'estimated_population'],
2020-06-28T01:10:34.8699143Z  'resource-name': 'populacao-estimada',
2020-06-28T01:10:34.8699600Z  'row-count': 5571,
2020-06-28T01:10:34.8700050Z  'schema': 'table-schema',
2020-06-28T01:10:34.8700496Z  'scheme': 'inline',
2020-06-28T01:10:34.8701313Z  'source': '/app/data/populacao-estimada-2019.csv',
2020-06-28T01:10:34.8701799Z  'time': 0.907,
2020-06-28T01:10:34.8702229Z  'valid': True}
2020-06-28T01:10:34.8702344Z 
2020-06-28T01:10:34.8702547Z TABLE [4]
2020-06-28T01:10:34.8702750Z =========
2020-06-28T01:10:34.8703220Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8703681Z  'encoding': 'no',
2020-06-28T01:10:34.8704125Z  'error-count': 0,
2020-06-28T01:10:34.8704550Z  'format': 'inline',
2020-06-28T01:10:34.8705041Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8705528Z  'resource-name': 'schema-boletim',
2020-06-28T01:10:34.8706201Z  'row-count': 5,
2020-06-28T01:10:34.8706674Z  'schema': 'table-schema',
2020-06-28T01:10:34.8707120Z  'scheme': 'inline',
2020-06-28T01:10:34.8707606Z  'source': '/app/schema/boletim.csv',
2020-06-28T01:10:34.8708043Z  'time': 0.013,
2020-06-28T01:10:34.8708479Z  'valid': True}
2020-06-28T01:10:34.8708618Z 
2020-06-28T01:10:34.8708820Z TABLE [5]
2020-06-28T01:10:34.8709008Z =========
2020-06-28T01:10:34.8709479Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8709927Z  'encoding': 'no',
2020-06-28T01:10:34.8710364Z  'error-count': 0,
2020-06-28T01:10:34.8710806Z  'format': 'inline',
2020-06-28T01:10:34.8711296Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8711836Z  'resource-name': 'schema-caso',
2020-06-28T01:10:34.8712292Z  'row-count': 13,
2020-06-28T01:10:34.8712747Z  'schema': 'table-schema',
2020-06-28T01:10:34.8713192Z  'scheme': 'inline',
2020-06-28T01:10:34.8713667Z  'source': '/app/schema/caso.csv',
2020-06-28T01:10:34.8714116Z  'time': 0.091,
2020-06-28T01:10:34.8714551Z  'valid': True}
2020-06-28T01:10:34.8714663Z 
2020-06-28T01:10:34.8714862Z TABLE [6]
2020-06-28T01:10:34.8715063Z =========
2020-06-28T01:10:34.8715465Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8715672Z  'encoding': 'no',
2020-06-28T01:10:34.8715863Z  'error-count': 0,
2020-06-28T01:10:34.8716162Z  'format': 'inline',
2020-06-28T01:10:34.8716405Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8716645Z  'resource-name': 'schema-populacao-estimada',
2020-06-28T01:10:34.8716856Z  'row-count': 6,
2020-06-28T01:10:34.8717069Z  'schema': 'table-schema',
2020-06-28T01:10:34.8717278Z  'scheme': 'inline',
2020-06-28T01:10:34.8717510Z  'source': '/app/schema/populacao-estimada-2019.csv',
2020-06-28T01:10:34.8717796Z  'time': 0.075,
2020-06-28T01:10:34.8718000Z  'valid': True}
2020-06-28T01:10:34.8718064Z 
2020-06-28T01:10:34.8718145Z TABLE [7]
2020-06-28T01:10:34.8718242Z =========
2020-06-28T01:10:34.8718461Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8718671Z  'encoding': 'no',
2020-06-28T01:10:34.8718875Z  'error-count': 0,
2020-06-28T01:10:34.8719080Z  'format': 'inline',
2020-06-28T01:10:34.8719436Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8719633Z  'resource-name': 'schema-epidemiological-week',
2020-06-28T01:10:34.8719815Z  'row-count': 4,
2020-06-28T01:10:34.8720001Z  'schema': 'table-schema',
2020-06-28T01:10:34.8720184Z  'scheme': 'inline',
2020-06-28T01:10:34.8720388Z  'source': '/app/schema/epidemiological-week.csv',
2020-06-28T01:10:34.8720572Z  'time': 0.138,
2020-06-28T01:10:34.8720745Z  'valid': True}
2020-06-28T01:10:34.8720790Z 
2020-06-28T01:10:34.8720872Z TABLE [8]
2020-06-28T01:10:34.8720959Z =========
2020-06-28T01:10:34.8721148Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8721319Z  'encoding': 'no',
2020-06-28T01:10:34.8721497Z  'error-count': 0,
2020-06-28T01:10:34.8721676Z  'format': 'inline',
2020-06-28T01:10:34.8721872Z  'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8722073Z  'resource-name': 'schema-obito_cartorio',
2020-06-28T01:10:34.8722254Z  'row-count': 35,
2020-06-28T01:10:34.8722624Z  'schema': 'table-schema',
2020-06-28T01:10:34.8722819Z  'scheme': 'inline',
2020-06-28T01:10:34.8723051Z  'source': '/app/schema/obito_cartorio.csv',
2020-06-28T01:10:34.8723261Z  'time': 0.03,
2020-06-28T01:10:34.8723468Z  'valid': True}
2020-06-28T01:10:34.8723520Z 
2020-06-28T01:10:34.8723615Z TABLE [9]
2020-06-28T01:10:34.8723712Z =========
2020-06-28T01:10:34.8723931Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8724136Z  'encoding': 'no',
2020-06-28T01:10:34.8724340Z  'error-count': 0,
2020-06-28T01:10:34.8724544Z  'format': 'inline',
2020-06-28T01:10:34.8724797Z  'headers': ['date', 'epidemiological_year', 'epidemiological_week'],
2020-06-28T01:10:34.8725048Z  'resource-name': 'epidemiological-week',
2020-06-28T01:10:34.8725261Z  'row-count': 3289,
2020-06-28T01:10:34.8725475Z  'schema': 'table-schema',
2020-06-28T01:10:34.8725681Z  'scheme': 'inline',
2020-06-28T01:10:34.8725918Z  'source': '/app/data/epidemiological-week.csv',
2020-06-28T01:10:34.8726129Z  'time': 2.168,
2020-06-28T01:10:34.8726318Z  'valid': True}
2020-06-28T01:10:34.8726385Z 
2020-06-28T01:10:34.8726482Z TABLE [10]
2020-06-28T01:10:34.8726582Z =========
2020-06-28T01:10:34.8726787Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8727572Z  'encoding': 'no',
2020-06-28T01:10:34.8727833Z  'error-count': 32,
2020-06-28T01:10:34.8728359Z  'format': 'inline',
2020-06-28T01:10:34.8728625Z  'headers': ['date',
2020-06-28T01:10:34.8728880Z              'state',
2020-06-28T01:10:34.8729220Z              'epidemiological_week_2019',
2020-06-28T01:10:34.8729509Z              'epidemiological_week_2020',
2020-06-28T01:10:34.8729789Z              'new_deaths_sars_2019',
2020-06-28T01:10:34.8730065Z              'new_deaths_pneumonia_2019',
2020-06-28T01:10:34.8730363Z              'new_deaths_respiratory_failure_2019',
2020-06-28T01:10:34.8730657Z              'new_deaths_septicemia_2019',
2020-06-28T01:10:34.8730947Z              'new_deaths_indeterminate_2019',
2020-06-28T01:10:34.8731229Z              'new_deaths_others_2019',
2020-06-28T01:10:34.8731511Z              'new_deaths_sars_2020',
2020-06-28T01:10:34.8731796Z              'new_deaths_pneumonia_2020',
2020-06-28T01:10:34.8732205Z              'new_deaths_respiratory_failure_2020',
2020-06-28T01:10:34.8732492Z              'new_deaths_septicemia_2020',
2020-06-28T01:10:34.8732782Z              'new_deaths_indeterminate_2020',
2020-06-28T01:10:34.8733062Z              'new_deaths_others_2020',
2020-06-28T01:10:34.8733340Z              'new_deaths_covid19',
2020-06-28T01:10:34.8733613Z              'deaths_sars_2019',
2020-06-28T01:10:34.8733892Z              'deaths_pneumonia_2019',
2020-06-28T01:10:34.8734260Z              'deaths_respiratory_failure_2019',
2020-06-28T01:10:34.8734530Z              'deaths_septicemia_2019',
2020-06-28T01:10:34.8734814Z              'deaths_indeterminate_2019',
2020-06-28T01:10:34.8735091Z              'deaths_others_2019',
2020-06-28T01:10:34.8735363Z              'deaths_sars_2020',
2020-06-28T01:10:34.8735641Z              'deaths_pneumonia_2020',
2020-06-28T01:10:34.8735932Z              'deaths_respiratory_failure_2020',
2020-06-28T01:10:34.8736216Z              'deaths_septicemia_2020',
2020-06-28T01:10:34.8736505Z              'deaths_indeterminate_2020',
2020-06-28T01:10:34.8736767Z              'deaths_others_2020',
2020-06-28T01:10:34.8737038Z              'deaths_covid19',
2020-06-28T01:10:34.8737315Z              'new_deaths_total_2019',
2020-06-28T01:10:34.8737595Z              'new_deaths_total_2020',
2020-06-28T01:10:34.8737866Z              'deaths_total_2019',
2020-06-28T01:10:34.8738138Z              'deaths_total_2020'],
2020-06-28T01:10:34.8738418Z  'resource-name': 'obito_cartorio',
2020-06-28T01:10:34.8738667Z  'row-count': 9882,
2020-06-28T01:10:34.8739041Z  'schema': 'table-schema',
2020-06-28T01:10:34.8739279Z  'scheme': 'inline',
2020-06-28T01:10:34.8739647Z  'source': '/app/data/output/obito_cartorio.csv',
2020-06-28T01:10:34.8739856Z  'time': 4.42,
2020-06-28T01:10:34.8740058Z  'valid': False}
2020-06-28T01:10:34.8740251Z ---------
2020-06-28T01:10:34.8740547Z [-,3] [non-matching-header] Header in column 3 doesn't match field name "new_deaths_pneumonia_2019" in the schema
2020-06-28T01:10:34.8740884Z [-,4] [non-matching-header] Header in column 4 doesn't match field name "new_deaths_pneumonia_2020" in the schema
2020-06-28T01:10:34.8741371Z [-,5] [non-matching-header] Header in column 5 doesn't match field name "new_deaths_respiratory_failure_2019" in the schema
2020-06-28T01:10:34.8741714Z [-,6] [non-matching-header] Header in column 6 doesn't match field name "new_deaths_respiratory_failure_2020" in the schema
2020-06-28T01:10:34.8742042Z [-,7] [non-matching-header] Header in column 7 doesn't match field name "new_deaths_covid19" in the schema
2020-06-28T01:10:34.8742369Z [-,8] [non-matching-header] Header in column 8 doesn't match field name "epidemiological_week_2019" in the schema
2020-06-28T01:10:34.8742691Z [-,9] [non-matching-header] Header in column 9 doesn't match field name "epidemiological_week_2020" in the schema
2020-06-28T01:10:34.8743001Z [-,10] [non-matching-header] Header in column 10 doesn't match field name "deaths_covid19" in the schema
2020-06-28T01:10:34.8743341Z [-,11] [non-matching-header] Header in column 11 doesn't match field name "deaths_respiratory_failure_2019" in the schema
2020-06-28T01:10:34.8743680Z [-,12] [non-matching-header] Header in column 12 doesn't match field name "deaths_respiratory_failure_2020" in the schema
2020-06-28T01:10:34.8744004Z [-,13] [non-matching-header] Header in column 13 doesn't match field name "deaths_pneumonia_2019" in the schema
2020-06-28T01:10:34.8744321Z [-,14] [non-matching-header] Header in column 14 doesn't match field name "deaths_pneumonia_2020" in the schema
2020-06-28T01:10:34.8744592Z [-,15] [extra-header] There is an extra header in column 15
2020-06-28T01:10:34.8744836Z [-,16] [extra-header] There is an extra header in column 16
2020-06-28T01:10:34.8745094Z [-,17] [extra-header] There is an extra header in column 17
2020-06-28T01:10:34.8745349Z [-,18] [extra-header] There is an extra header in column 18
2020-06-28T01:10:34.8745603Z [-,19] [extra-header] There is an extra header in column 19
2020-06-28T01:10:34.8745926Z [-,20] [extra-header] There is an extra header in column 20
2020-06-28T01:10:34.8746190Z [-,21] [extra-header] There is an extra header in column 21
2020-06-28T01:10:34.8746566Z [-,22] [extra-header] There is an extra header in column 22
2020-06-28T01:10:34.8746786Z [-,23] [extra-header] There is an extra header in column 23
2020-06-28T01:10:34.8747006Z [-,24] [extra-header] There is an extra header in column 24
2020-06-28T01:10:34.8747212Z [-,25] [extra-header] There is an extra header in column 25
2020-06-28T01:10:34.8747487Z [-,26] [extra-header] There is an extra header in column 26
2020-06-28T01:10:34.8747705Z [-,27] [extra-header] There is an extra header in column 27
2020-06-28T01:10:34.8747923Z [-,28] [extra-header] There is an extra header in column 28
2020-06-28T01:10:34.8748139Z [-,29] [extra-header] There is an extra header in column 29
2020-06-28T01:10:34.8748356Z [-,30] [extra-header] There is an extra header in column 30
2020-06-28T01:10:34.8748573Z [-,31] [extra-header] There is an extra header in column 31
2020-06-28T01:10:34.8748796Z [-,32] [extra-header] There is an extra header in column 32
2020-06-28T01:10:34.8749004Z [-,33] [extra-header] There is an extra header in column 33
2020-06-28T01:10:34.8749221Z [-,34] [extra-header] There is an extra header in column 34
@endersonmaia
Copy link
Collaborator

Os erros são nas tabelas :

  • boletim
  • obito_cartorio

Os schemas que são usados no projeto são mantidos aqui o que acaba criando uma duplicidade, sempre que houver manutenção aí tem que atualizar o datapackage.json

/cc @augusto-herrmann

@augusto-herrmann
Copy link
Contributor

Pois é, o ideal seria esses esquemas serem gerados a partir do datapackage.json, e não o contrário.

@endersonmaia
Copy link
Collaborator

Pois é, o ideal seria esses esquemas serem gerados a partir do datapackage.json, e não o contrário.

cabe uma issue ou PR aí, identificar onde no código tem referência aos schemas/*.csv, e o datapackage já tá nas dependências do projeto, certamente daria para automatiza isso

@endersonmaia
Copy link
Collaborator

Pois é, o ideal seria esses esquemas serem gerados a partir do datapackage.json, e não o contrário.

turicas/brasil.io#204

@augusto-herrmann
Copy link
Contributor

Essa issue aí ainda é outro caminho, diferente do que estamos sugerindo aqui.

Aqui:

  • datapackage.json -> esquemas em formato csv customizado na pasta schema/*.csv
  • datapackage.json -> documentação da API

Lá:

  • esquemas no banco de dados -> documentação da API

A questão toda passa pelo processo de desenvolvimento. Hoje, quem desenvolve é o @turicas, e parece que ele prefere começar a definir o esquema pelo banco. Enquanto continuar assim, o banco de dados é que teria que ser então o ponto de partida.

@endersonmaia
Copy link
Collaborator

endersonmaia commented Jul 2, 2020 via email

@turicas
Copy link
Owner

turicas commented Jul 2, 2020

Se o datapackage.json atender à demanda que temos hoje (já explico abaixo), então acho que o ideal seria termos apenas o datapackage.json no repositório, assim o Brasil.IO poderia consumir desse arquivo e os arquivos schema/*.csv poderiam ser gerados automaticamente a partir do datapackage.json (ou, quando a rows suportar pgimport e csv2sqlite com data package, eles poderiam ser deletados).

As demandas atualmente são:

  • Especificação dos nomes e tipos das colunas (como já existe no schema/*.csv)
  • Metadados gerais, como: nome da coluna (slug), título da coluna (com acentos, espaços etc.), descrição da coluna
  • Metadados específicos para o Brasil.IO, como: quais colunas aparecerão como filtro na interface, quais colunas serão exibidas no frontend, quais colunas serão usadas para compor o índice de busca por texto completo etc. (esses eu escolho manualmente quando vou adicionar um dataset na plataforma)

Eu não conheço muito da especificação do datapackage, mas se tiver como embutirmos metadados personalizados (esses do Brasil.IO), então podemos começar um processo de migração (ficará bem melhor se for uniformizado assim :).

@turicas
Copy link
Owner

turicas commented Jul 2, 2020

@augusto-herrmann você, que conhece mais a especificação do data package, acha que atende a essas necessidades acima? Se sim, vamos criar uma issue no repositório do Brasil.IO para tratar disso?

Sobre a geração de documentação da API: como os metadados precisam ficar armazenados na base do Brasil.IO (e não serão exatamente iguais a esse datapackage.json que propus, pois nem sempre o dataset estará super atualizado com relação ao repositório), então faz sentido a geração da documentação da API ser feita automaticamente a partir do banco de dados do Brasil.IO e não do (futuro) datapackage.json.

@endersonmaia
Copy link
Collaborator

endersonmaia commented Jul 2, 2020

acho que deveríamos estar discutindo isso lá na issue turicas/brasil.io#204

@turicas
Copy link
Owner

turicas commented Jul 2, 2020

acho que deveríamos estar discutindo isso lá na issue turicas/brasil.io#204

Concordo. Colei esses meus comentários lá.

@augusto-herrmann
Copy link
Contributor

augusto-herrmann commented Jul 27, 2020

Os testes estão dando erro novamente. Reabrir esta issue ou criar uma nova?

@endersonmaia
Copy link
Collaborator

Os testes estão dando erro novamente. Reabrir esta issue ou criar uma nova?

cria uma nova

deve ter adicionado campos ou mudado a ordem

@augusto-herrmann
Copy link
Contributor

Criada #193.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants