auto-tagger

Text classifier that suggests tags for Stackoverflow questions.

Requirements, Assumptions

This software has been tested on a relatively clean Mac OS (10.9.3), with the following components installed:

Python 2.7
Node.js/npm (0.10/1.4)

If not available, the easiest way to obtain these is using Homebrew.

For the front-end, we'll need the Bower package manager, which can be installed with npm install -g bower.

Additional Installation

The project depends on several Python and front-end packages, as well as a locally running Solr server. These can be obtained as follows:

pip install -r requirements.txt -- you may want to install these packages into a Python virtualenv for this project,
cd src/main/python/static/ and bower install will download all front-end dependencies into a bower_components directory,
brew install solr will install the Solr server. If not using Homebrew, Solr can be installed manually from Apache;

Getting Stackoverflow questions into Solr

The project contains a script for downloading questions from the Stackoverflow api. Running

python get-so-data.py

from src/main/python will download the questions into a file question_data.json. Currently, the download will consist of 250k questions and may take up to 1 hr. To modify the download size, you can change the value of questions_per_tag in the script.

Once the download is finished, it's time to start Solr. Solr requires a directory for reading its configuration and writing its data. In this case, it should be the absolute path pointing to the src/main/solr-config directory inside of this project. As such, starting Solr should look as

solr [absolute_path_to_auto_tagger_project]/src/main/solr-config

Confirm that Solr is running by checking http://localhost:8983/solr/.

Finally, the questions should be loaded into Solr by running

python load_questions.py question_data.json

For 250k questions, this should run without (excessive) swapping on a fairly modern system with 8Gb RAM available. Confirm that the questions are loaded by going to the solr admin page, selecting the collection "stackoverflow-collection", and checking the number of documents in the "statistics" pane. The number of documents will actually be smaller than 250k because of duplicate questions with overlapping tags.

Running the classifier

The classifier can be started with

python classifier_app.py

Open a browser pointing to http://localhost:5000/static/predict.html and start typing some hypothetical questions...

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

auto-tagger

Requirements, Assumptions

Additional Installation

Getting Stackoverflow questions into Solr

Running the classifier

About

Releases

Packages

License

mlindhout/auto-tagger

Folders and files

Latest commit

History

Repository files navigation

auto-tagger

Requirements, Assumptions

Additional Installation

Getting Stackoverflow questions into Solr

Running the classifier

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages