Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geoheader for 2009_5yr has an invalid character encoding #15

Open
peterwang998 opened this issue Nov 9, 2016 · 9 comments
Open

geoheader for 2009_5yr has an invalid character encoding #15

peterwang998 opened this issue Nov 9, 2016 · 9 comments
Labels
historic data Issues with historic data are currently lower priority

Comments

@peterwang998
Copy link
Contributor

After unzipping the 2009_5yr manually, it turns out a file in the geoheader has an incorrect encoding:

(I also removed the -q flag to see where it failed for importing geoheader)

psql:create_geoheader.sql:1: NOTICE:  table "geoheader" does not exist, skipping
Creating geoheader comments
Creating temp geoheader
psql:create_tmp_geoheader.sql:1: NOTICE:  table "tmp_geoheader" does not exist, skipping
Creating temp import tables
Importing geoheader
COPY 4946
COPY 7117
COPY 14397
COPY 32732
COPY 2707
COPY 12598
COPY 1899
COPY 15952
COPY 11356
psql:import_geoheader.sql:10: ERROR:  invalid byte sequence for encoding "UTF8": 0xf1 0x6f 0x6e 0x63
CONTEXT:  COPY tmp_geoheader, line 965
Failed importing geoheader.

@iandees
Copy link
Member

iandees commented Nov 9, 2016

I'm pretty sure these files are all latin-1.

@peterwang998
Copy link
Contributor Author

peterwang998 commented Nov 10, 2016

Good call!

WARNING: CHANGING ENCODING PREVENTS THE LOADING OF METADATA (which is encoded in UTF-8)

Just found this for future reference:
https://www.census.gov/geo/maps-data/data/tiger/char_encoding.html

Until 2014, character encoding was in Latin-1
Starting from 2015, character encoding changed to UTF-8

For future reference, if anyone else is having this issue, the safe way to resolve this is to drop the database and then create a new database with the correct encoding (LATIN1)

or use

sudo -u postgres psql -c "update pg_database set encoding = pg_char_to_encoding('LATIN1') where datname = 'census'"

if you're ok with an adhoc solution

@iandees
Copy link
Member

iandees commented Nov 10, 2016

If you're trying to get at the data, you can just download the SQL dumps here: http://censusreporter.tumblr.com/post/73727555158/easier-access-to-acs-data

@peterwang998
Copy link
Contributor Author

Unfortunately not all the tables are avaliable as a snapshot.

@peterwang998
Copy link
Contributor Author

Now i'm getting a duplicate key value error:

COPY 12012
COPY 47601
COPY 6296
COPY 5721
COPY 9645
COPY 2541
COPY 23230
COPY 1571
COPY 29396
COPY 14580
COPY 34314
COPY 17149
COPY 9524
COPY 19446
COPY 12131
COPY 15782
COPY 10689
COPY 12362
COPY 12445
COPY 37573
COPY 27692
Parsing geoheader
psql:parse_tmp_geoheader.sql:55: ERROR:  duplicate key value violates unique constraint "geoheader_pkey"
DETAIL:  Key (geoid)=(04000US49) already exists.
Failed parsing geoheader.

@iandees
Copy link
Member

iandees commented Nov 10, 2016

What tables aren't available in the snapshot?

@peterwang998
Copy link
Contributor Author

acs2009_5yr

peterwang998 added a commit to peterwang998/census-postgres-scripts that referenced this issue Nov 10, 2016
Corrected the extraction behavior to accurately extract at the correct location. 
This updated bash script should unzip All_Geographies_Not_Tracts_Block_Groups.zip into a tmp path, and then extract the files in the tmp path to group1 as well as unzipping Tracts_Block_Groups_Only.zip to group2

Please note that censusreporter#15 might be encountered.
@iandees
Copy link
Member

iandees commented Nov 11, 2016

I think you're running into the problems I ran into and didn't have time to work through earlier (since Census Reporter wasn't going to use 2009 data).

I would be extremely happy to review a pull request that fixes this, though!

@peterwang998
Copy link
Contributor Author

I'll try my best, not sure if I can find the duplicate key though....

@JoeGermuska JoeGermuska added the historic data Issues with historic data are currently lower priority label Oct 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
historic data Issues with historic data are currently lower priority
Projects
None yet
Development

No branches or pull requests

3 participants