Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sprint Day One report - Brisbane #2

Open
richyvk opened this issue Jun 1, 2017 · 4 comments
Open

Sprint Day One report - Brisbane #2

richyvk opened this issue Jun 1, 2017 · 4 comments

Comments

@richyvk
Copy link
Collaborator

richyvk commented Jun 1, 2017

Hi all

So, we've had a day of talking! We've deliberated a lot. We in Brisbane have concluded that the existing lesson is too Pandas, but the Software Carpentry gapminder lesson could work really well as the basis for the LC Python lesson.

So, handing over to whoever wants to take this up, or we'll be working on it tomorrow. We've imported the SC lesson into data-lessons account, the repo url is: https://github.com/data-lessons/library-python-intro

Stuff we are intending needs doing:

  • Remove the pandas stuff from the lesson - we've deemed Pandas pout of scope for this lesson.
  • Change wording and examples to be more library relevant in the rest of the lesson.

But, we figure a lot of it can pretty much stay as it is!

We have failed to come up with one single compelling 'superpower' example to run through the lesson. But, some more ideas we've had for examples (a lot of these might be useful for certain episodes):

  • Deleting rogue punctuation from an excel..
  • Comparing two sets of data for differences, eg two sets of article IDs - one locally and one on a vendor database - and you want to know which are missing form each - ie sets.
  • Cleaning webpages of non-preferred language - eg you have a list of preferred terms for things. (think branding etc) and you identify pages that use non-preferred alternatives to this language.
  • Cleaning dates in excel.

That's pretty much it from us for today. We'll get stuck in again tomorrow with editing the lesson. But go for it in the mean time if you want to!

@libADS
Copy link

libADS commented Jun 1, 2017

"Comparing two sets of data for differences, eg two sets of article IDs - one locally and one on a vendor database - and you want to know which are missing form each - ie sets."

I just had to do this in the past few days, so this is a reasonable use case to me :)

@drjwbaker
Copy link

Pad for this at http://pad.software-carpentry.org/lc-new-python

@richyvk
Copy link
Collaborator Author

richyvk commented Jun 2, 2017

@libADS Can I ask how you did your comparing? Excel? Manually?

@libADS
Copy link

libADS commented Jun 2, 2017

@richyvk initially, in Python, after loading the csv files in memory. The tricky part came from not necessarily having exact match, for instance titles might be spelled slighty differently between two files of articles. In the end I imported the data into a local postgres instance, it allows me to try different query much faster. For fuzzy matching in Python I used:

def similar(a, b):
    from difflib import SequenceMatcher
    return SequenceMatcher(None, a, b).ratio()

Postgres has an extension to do this too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants