Chinese blog about this project: 量化系列2 - 众包数据集
Table of contents generated with markdown-toc
- Download tar ball from latest release page on github
- Extract tar file to default qlib directory
wget https://github.com/chenditc/investment_data/releases/download/2023-04-20/qlib_bin.tar.gz
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
If you want to contribute to the set of scripts or the data, here is what you should do to set up a dev environment.
Follow https://github.com/dolthub/dolt
Raw data hosted on dolt: https://www.dolthub.com/repositories/chenditc/investment_data
To download as dolt database:
dolt clone chenditc/investment_data
docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
You will need tushare token to use tushare api. Get tushare token from https://tushare.pro/
export TUSHARE=<Token>
bash daily_update.sh
docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash daily_update.sh && bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
- Try to fill in missing data by combining data from multiple data source. For example, delist company's data.
- Try to correct data by cross validate against multiple data source.
The database table on dolthub is named with prefix of data source, for example ts_a_stock_eod_price
. The meaning of the prefix:
-
w(wind): high quality static data source. Only available till 2019.
-
c(caihui): high quality static data source. Only available till 2019.
-
ts: Tushare data source
-
ak: Akshare data source
-
yahoo: Use Qlib's yahoo collector https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo
-
final: Merged final data with validation and correction
- w(wind): Use one_time_db_scripts to import w_a_stock_eod_price table, used as initial price standard
- c(caihui): SQL import to c_a_stock_eod_price table
- ts(tushare):
- Use tushare/update_stock_list.sh to load stock list
- Use tushare/update_stock_price.sh to load stock price
- yahoo
- Use yahoo collector to load stock price
Currently the daily update is only using tushare data source and triggered by github action.
- I maintained a offline job whcih runs daily_update.sh every 30 mins to collect data and push to dolthub.
- A github action .github/workflows/upload_release.yml is triggered daily, which then calls bash dump_qlib_bin.sh to generate daily tar file and upload to release page.
- Use w data source as baseline, use other data source to validate against it.
- Since w data's adjclose is different from ts data's adjclose, we will use a "link date" to calculate a ratio to map ts adjclose to w adjclose. This can be the maximum first valid data for each data source. The reason we don't use a fixed value for link date is: Some stock might not be trading at specific date, and the enlist and delist date are all different. We store the link date information and adj_ratio in link_table. adj_ratio = link_adj_close / w_adj_close;
- Append ts data to final dataset, the adjclose will be ts_adj_close / ts_adj_ratio
- Generate final data by concatinate w data and ts data.
- Run validate by pair two data source:
- Compare high, low, open, close, volume absolute value
- Calcualte adjclose convert ratio use a link date for each stock.
- Calculate w data adjclose use link date's ratio, and compare it with final data.
To add a new stock index, we need to change:
- Add index weight download script. Change tushare/dump_index_eod_price.py script to dump the index info. If the index is not available in tushare, write a new script and add to the daily_update.sh script. Example commit
- Add price download script. Change tushare/dump_index_eod_price.py to add the index price. Eg. Example Commit
- Modify export script. Change the qlib dump script qlib/dump_index_weight.py#L13, so that index will be dump and renamed to a txt file for use. Example commit