srccat: concatenate repositories into large tar archive

srccat reads files from source code repositories, stored either in a folder or a tar archive, and concatenate them into large tar archives.

This is especially useful for processing on Hadoop or Spark.

The size of a block on Hadoop Distributed File System (HDFS) is at least 64MB. Hence, intensive jobs with Spark or Hadoop's MapReduce, perform better if the small files are concatenated into more suitable larger ones.

srccat walks through the files of repositories and filters out all files that are not text or too large to be human readable code. It then creates tar archives of a size of at least 128MB with those files. All the files paths in the resulting tar archives are relative to REPO_ROOT.

srccat assumes the following directory structure, which is the one used by crawld:

REPO_ROOT
└── Language Folder
    └── Github User
        └── Repository

Build & Run

make build
java -jar srccat.jar [-j=<numJobs>] <REPO_ROOT> <OUTPUT_FOLDER>

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
project		project
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS		AUTHORS
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.sbt		build.sbt
scalastyle-config.xml		scalastyle-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

srccat: concatenate repositories into large tar archive

Build & Run

About

Releases

Packages

Languages

License

DevMine/srccat

Folders and files

Latest commit

History

Repository files navigation

srccat: concatenate repositories into large tar archive

Build & Run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages