When populating CouchDB databases, often the source of the data is initially some JSON documents in a file, or some structured CSV/TSV data from another database's export.
couchimport is designed to assist with importing such data into CouchDb efficiently. Simply pipe a file full of JSON documents into couchimport, telling the URL and database to send the data to.
Note:
couchimport
used to handle the CSV to JSON conversion, but this part is now handled by csvtojsonlines, keeping this package smaller and easier to maintain. The[email protected]
package is the last version to support CSV/TSV natively - from 2.0 onwards,couchimport
is only for pouring JSONL files into CouchDB.
Also note: the companion CSV export utility (couchexport) is now hosted at couchcsvexport.
Install using npm or another Node.js package manager:
npm install -g couchimport
couchimport can either read JSON docs (one per line) from stdin e.g.
cat myfile.json | couchimport
or by passing a filename as the last parameter:
couchimport myfile.json
couchimport's configuration parameters can be stored in environment variables or supplied as command line arguments.
Simply set the COUCH_URL
environment variable e.g. for a hosted Cloudant database
export COUCH_URL="https://myusername:[email protected]"
and define the name of the CouchDB database to write to by setting the COUCH_DATABASE
environment variable e.g.
export COUCH_DATABASE="mydatabase"
Simply pipe the text data into "couchimport":
cat mydata.jsonl | couchimport
Supply the --url
and --database
parameters as command-line parameters instead:
couchimport --url "http://user:password@localhost:5984" --database "mydata" mydata.jsonl
or by piping data into stdin:
cat mydata.jsonl | couchimport --url "http://user:password@localhost:5984" --database "mydata"
We can use another package csvtojsonlines to convert CSV/TSV files into a JSONL stream acceptable to couchimport
:
# CSV file ----> JSON lines ---> CouchDB
cat transactions.csv | csvtojsonlines --delimiter ',' | couchimport --db ledger
couchimport can be paired with datamaker to generate any amount of sample data:
# template ---> datamaker ---> 100 JSON docs ---> couchimport ---> CouchDB
echo '{"_id":"{{uuid}}","name":"{{name}}","email":"{{email true}}","dob":"{{date 1950-01-01}}"}' | datamaker -f json -i 100 | couchimport --db people
written {"docCount":100,"successCount":1,"failCount":0,"statusCodes":{"201":1}}
written {"batch":1,"batchSize":100,"docSuccessCount":100,"docFailCount":0,"statusCodes":{"201":1},"errors":{}}
Import complete
or with the template as a file:
cat template.json | datamaker -f json -i 10000 | couchimport --db people
We know if we get an HTTP 4xx/5xx response, then all of the documents failed to be written to the database. But as couchimport is writing data in bulk, the bulk request may get an HTTP 201 response that doesn't mean that all of the documents were written. Some of the document ids may have been in the database already. So the couchimport output includes counts of the number of documents that were written successfully and the number that failed, and a tally of the HTTP response codes and individual document error messages:
e.g.
written {"batch":10,"batchSize":1,"docSuccessCount":4,"docFailCount":6,"statusCodes":{"201":10},"errors":{"conflict":6}}
The above log line shows that after the tenth batch of writes, we have written 4 documents and failed to write 6 others. There were six "conflict" errors, meaning that there was a clash of document id or id/rev combination.
Older versions of couchimport supported the ability to have multiple HTTP requests in flight at any one time, but the new simplified couchimport does not. To achieve the same thing, simply split your file of JSON docs into smaller pieces and run multiple couchimport jobs:
# split large file into files 1m lines each
# this will create files xaa, xab, xac etc
split -l 1000000 massive.txt
# find all files starting with x and using xargs,
# spawn a max of 2 process at once running couchimport,
# one for each file
find . -name "x*" | xargs -t -I % -P 2 couchimport --db test %
- COUCH_URL - the url of the CouchDB instance (required, or to be supplied on the command line)
- COUCH_DATABASE - the database to deal with (required, or to be supplied on the command line)
- COUCH_BUFFER_SIZE - the number of records written to CouchDB per bulk write (defaults to 500, not required)
- IAM_API_KEY - to use IBM IAM to do authentication, set the IAM_API_KEY to your api key and a bearer token will be used in the HTTP requests.
You can also configure couchimport
using command-line parameters:
--help
- show help--url
/-u
- the url of the CouchDB instance (required, or to be supplied in the environment)--database
/--db
/-d
- the database to deal with (required, or to be supplied in the environment)--buffer
/-b
- the number of records written to CouchDB per bulk write (defaults to 500, not required)
In your project, add couchimport
into the dependencies of your package.json or run npm install --save couchimport
. In your code, require the library with
const couchimport = require('couchimport')
and your options are set in an object whose keys are the same as the command line paramters:
e.g.
const opts = { url: "http://localhost:5984", database: "mydb", rs: fs.createReadStream('myfile.json') }
await couchimport(opts)
Note:
rs
is the readstream where data will be read (default:stdin
) andws
is the write stream where the output will be written (default:stdout
)