Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use types from pudo's 'typecast' library, PEP8 etc #171

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
1390f09
Use `typecast` for type conversion.
pudo Aug 6, 2015
879dc69
Fix up type guessing tests.
pudo Aug 24, 2015
2fdaf25
Hide coverage results.
pudo Aug 24, 2015
d1d0972
Clean up imports.
pudo Aug 24, 2015
2f71c24
Get rid of old type names.
pudo Aug 24, 2015
f06a3c1
Clean out old aliases for XLSXTableSet
pudo Aug 24, 2015
92fb215
Further pieces of clean up.
pudo Aug 24, 2015
1108885
Start getting rid of the compatibility layer
pudo Aug 25, 2015
ed8cda1
Remove remaining awkward compatibility work-arounds.
pudo Aug 25, 2015
e87c774
avoid circular import
pudo Aug 25, 2015
3dd9bad
Clean up README.
pudo Aug 25, 2015
8a56e5d
fix py3 compat
pudo Aug 25, 2015
afca917
Don’t raise for 0 as a date.
pudo Jul 23, 2016
5f4d978
fix up test errors, attempt to make travis pass
pudo Jul 23, 2016
145e2ee
skip tests if en_GB is not supported
pudo Jul 23, 2016
dcdf21d
remove ambiguous var
pudo Jul 23, 2016
de3e840
dont score null values in type detection
pudo Jul 23, 2016
10576f3
Move test utilities to a specific module.
pudo Jul 23, 2016
7da15bf
Move the buffered reader to it’s own module.
pudo Jul 23, 2016
ccb094c
Move guesser class to typecast.
pudo Jul 23, 2016
2565632
Factor out CSV re-coder
pudo Jul 23, 2016
b63baeb
use cchardet
pudo Jul 23, 2016
2e4b96c
simplify the handling of CSV dialects
pudo Jul 23, 2016
f373325
try relative imports with py3
pudo Jul 23, 2016
96549a9
PEP8.
pudo Jul 23, 2016
910b6c2
Simplify JTS code.
pudo Jul 23, 2016
a4c22f3
pep8
pudo Jul 23, 2016
b7b4851
Move stuff around.
pudo Jul 23, 2016
b8f15ed
Formatting.
pudo Jul 23, 2016
3c96240
Replace CSV reader with a fully streaming implementation.
pudo Jul 24, 2016
ce3627c
Fix up Python 3 support
pudo Jul 24, 2016
7dd9e5b
confirm at least python 3.5 is working
pudo Jul 24, 2016
6cd1222
Readd Python 3.4 to Travis
StevenMaude Oct 4, 2016
506269e
Fix missing comma in setup.py
StevenMaude Oct 4, 2016
6638e58
Fix byte concatenation in Python 3.4
StevenMaude Oct 4, 2016
126630d
Merge branch 'master' into cleanup-mt2-redux
Jul 5, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
*.swp
*.egg-info
*.pyc
*.eggs
*.DS_Store
*/_build/*
*.py~
*.~lock.*#
.coverage
dist/*
.tox/*
pyenv3
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
language: python
python:
- "2.6"
- "2.7"
- "3.4"
- "3.5"
davidread marked this conversation as resolved.
Show resolved Hide resolved
install:
# Fix for html5lib, probably can be removed after the version after
# 0.999999999/1.0b10 is released.
Expand Down
30 changes: 0 additions & 30 deletions Dockerfile

This file was deleted.

12 changes: 3 additions & 9 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,10 +1,4 @@
run: build
@docker run \
--rm \
-ti \
messytables
test:
nosetests --with-coverage --cover-package=messytables --cover-erase

build:
@docker build -t messytables .

.PHONY: run build
.PHONY: run build test
12 changes: 2 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,11 @@
# Parsing for messy tables

[![Build Status](https://travis-ci.org/okfn/messytables.png?branch=master)](https://travis-ci.org/okfn/messytables)
[![Coverage Status](https://coveralls.io/repos/okfn/messytables/badge.png?branch=master)](https://coveralls.io/r/okfn/messytables?branch=master)
[![Latest Version](https://pypip.in/version/messytables/badge.svg)](https://pypi.python.org/pypi/messytables/)
[![Downloads](https://pypip.in/download/messytables/badge.svg)](https://pypi.python.org/pypi/messytables/)
[![Supported Python versions](https://pypip.in/py_versions/messytables/badge.svg)](https://pypi.python.org/pypi/ckanserviceprovider/)
[![Development Status](https://pypip.in/status/messytables/badge.svg)](https://pypi.python.org/pypi/messytables/)
[![License](https://pypip.in/license/messytables/badge.svg)](https://pypi.python.org/pypi/messytables/)
# Parsing for messy tables [![Build Status](https://travis-ci.org/okfn/messytables.png?branch=master)](https://travis-ci.org/okfn/messytables) [![Coverage Status](https://coveralls.io/repos/okfn/messytables/badge.png?branch=master)](https://coveralls.io/r/okfn/messytables?branch=master)

A library for dealing with messy tabular data in several formats, guessing types and detecting headers.

See the documentation at: https://messytables.readthedocs.io

Find the package at: https://pypi.python.org/pypi/messytables

See CONTRIBUTING.md for how to send patches, run tests.
See ``CONTRIBUTING.md`` for how to send patches, run tests.

**Contact**: Open Knowledge Labs - http://okfnlabs.org/contact/. We especially recommend the forum: http://discuss.okfn.org/category/open-knowledge-labs/
22 changes: 9 additions & 13 deletions messytables/__init__.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,21 @@

from messytables.util import offset_processor, null_processor
from messytables.headers import headers_guess, headers_processor, headers_make_unique
from messytables.headers import headers_guess, headers_processor
from messytables.headers import headers_make_unique
from messytables.types import type_guess, types_processor
from messytables.types import StringType, IntegerType, FloatType, \
DecimalType, DateType, DateUtilType, BoolType
from messytables.error import ReadError

from messytables.core import Cell, TableSet, RowSet, seekable_stream
from messytables.commas import CSVTableSet, CSVRowSet
from messytables.buffered import seekable_stream
from messytables.core import Cell, TableSet, RowSet
from messytables.commas import CSVTableSet, CSVRowSet, TSVTableSet
from messytables.ods import ODSTableSet, ODSRowSet
from messytables.excel import XLSTableSet, XLSRowSet

# XLSXTableSet has been deprecated and its functionality is now provided by
# XLSTableSet. This is to retain backwards compatibility with anyone
# constructing XLSXTableSet directly (rather than using any_tableset)
XLSXTableSet = XLSTableSet
XLSXRowSet = XLSRowSet

from messytables.zip import ZIPTableSet
from messytables.html import HTMLTableSet, HTMLRowSet
from messytables.pdf import PDFTableSet, PDFRowSet
from messytables.any import any_tableset, AnyTableSet
from messytables.any import any_tableset

from messytables.jts import rowset_as_jts, headers_and_typed_as_jts

import warnings
warnings.filterwarnings('ignore', "Coercing non-XML name")
25 changes: 9 additions & 16 deletions messytables/any.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
from messytables import (ZIPTableSet, PDFTableSet, CSVTableSet, XLSTableSet,
HTMLTableSet, ODSTableSet)
import messytables
import re

from messytables import ZIPTableSet, PDFTableSet, CSVTableSet, XLSTableSet
from messytables import HTMLTableSet, ODSTableSet, TSVTableSet
from messytables.buffered import seekable_stream
from messytables.error import ReadError


MIMELOOKUP = {'application/x-zip-compressed': 'ZIP',
'application/zip': 'ZIP',
Expand All @@ -29,10 +31,8 @@
'application/x-vnd.oasis.opendocument.spreadsheet': 'ODS',
}

def TABTableSet(fileobj):
return CSVTableSet(fileobj, delimiter='\t')

parsers = {'TAB': TABTableSet,
parsers = {'TAB': TSVTableSet,
'ZIP': ZIPTableSet,
'XLS': XLSTableSet,
'HTML': HTMLTableSet,
Expand Down Expand Up @@ -62,7 +62,7 @@ def get_mime(fileobj):
import magic
# Since we need to peek the start of the stream, make sure we can
# seek back later. If not, slurp in the contents into a StringIO.
fileobj = messytables.seekable_stream(fileobj)
fileobj = seekable_stream(fileobj)
header = fileobj.read(4096)
mimetype = magic.from_buffer(header, mime=True)
fileobj.seek(0)
Expand Down Expand Up @@ -160,13 +160,6 @@ def any_tableset(fileobj, mimetype=None, extension='', auto_detect=True, **kw):
mimetype=magic_mime))

if error:
raise messytables.ReadError('any: \n'.join(error))
raise ReadError('any: \n'.join(error))
else:
raise messytables.ReadError("any: Did not attempt any detection.")


class AnyTableSet:
'''Deprecated - use any_tableset instead.'''
@staticmethod
def from_fileobj(fileobj, mimetype=None, extension=None):
return any_tableset(fileobj, mimetype=mimetype, extension=extension)
raise ReadError("any: Did not attempt any detection.")
89 changes: 89 additions & 0 deletions messytables/buffered.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
import io

BUFFER_SIZE = 4096


def seekable_stream(fileobj):
try:
fileobj.seek(0)
# if we got here, the stream is seekable
return fileobj
except:
# otherwise seek failed, so slurp in stream and wrap
# it in a BytesIO
return BufferedFile(fileobj)


class BufferedFile(object):
"""A buffered file that preserves the beginning of a stream."""

def __init__(self, fp, buffer_size=BUFFER_SIZE + 2):
self.data = io.BytesIO()
self.fp = fp
self.offset = 0
self.len = 0
self.fp_offset = 0
self.buffer_size = buffer_size

def _next_line(self):
try:
return self.fp.readline()
except AttributeError:
return next(self.fp)

def _read(self, n):
return self.fp.read(n)

@property
def _buffer_full(self):
return self.len >= self.buffer_size

def readline(self):
if self.len < self.offset < self.fp_offset:
raise BufferError('Line is not available anymore')
if self.offset >= self.len:
line = self._next_line()
self.fp_offset += len(line)

self.offset += len(line)

if not self._buffer_full:
self.data.write(line)
self.len += len(line)
else:
line = self.data.readline()
self.offset += len(line)
return line

def read(self, n=-1):
if n == -1:
# if the request is to do a complete read, then do a complete
# read.
self.data.seek(self.offset)
return self.data.read(-1) + self.fp.read(-1)

if self.len < self.offset < self.fp_offset:
raise BufferError('Data is not available anymore')
if self.offset >= self.len:
byte = self._read(n)
self.fp_offset += len(byte)

self.offset += len(byte)

if not self._buffer_full:
self.data.write(byte)
self.len += len(byte)
else:
byte = self.data.read(n)
self.offset += len(byte)
return byte

def tell(self):
return self.offset

def seek(self, offset):
if self.len < offset < self.fp_offset:
raise BufferError('Cannot seek because data is not buffered here')
self.offset = offset
if offset < self.len:
self.data.seek(offset)
Loading