Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:增加pyspark etl示例项目 #1

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
changelog:
exclude:
labels:
- ignore-for-release
authors:
- octocat
categories:
- title: Breaking Changes 🛠
labels:
- Semver-Major
178 changes: 178 additions & 0 deletions automotive_data_etl/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Created by .ignore support plugin (hsz.mobi)

### Windows template
# Windows thumbnail cache files
Thumbs.db
ehthumbs.db
ehthumbs_vista.db

# Dump file
*.stackdump

# Folder config file
[Dd]esktop.ini

# Recycle Bin used on file shares
$RECYCLE.BIN/

# Windows Installer files
*.cab
*.msi
*.msix
*.msm
*.msp

# Windows shortcuts
*.lnk
### macOS template
# General
.DS_Store
.AppleDouble
.LSOverride

# Icon must end with two \r
Icon

# Thumbnails
._*

# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent

# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk
### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
### Linux template
*~

# temporary files which can be created if a process still has a handle open of a deleted file
.fuse_hidden*

# KDE directory preferences
.directory

# Linux trash folder which might appear on any partition or disk
.Trash-*

# .nfs files are created when an open file is removed but is still being accessed
.nfs*

### VisualStudioCode template
.vscode

### Jebrains template
.idea
21 changes: 21 additions & 0 deletions automotive_data_etl/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2022 ming

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
89 changes: 89 additions & 0 deletions automotive_data_etl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# pyspark-etl-template

---

[![python](https://img.shields.io/badge/python-3.10-blue)](https://www.python.org/)
[![pyspark](https://img.shields.io/badge/pyspark-3.3.0-brightgreen)](https://spark.apache.org/docs/latest/api/python/)

pyspark etl项目工程化模板, 可参考文档:[Pyspark-etl-doc](https://pyloong.github.io/pythonic-project-guidelines/bigdata/basis/init/)

## 使用方式

### 1. 开发前准备
- Install python 3.9/3.10
```angular2html
https://www.python.org/downloads/
```
- Install Java 8
```angular2html
https://www.oracle.com/java/technologies/downloads/#java8
```
- Install hadoop 3.0+
```bash
# https://archive.apache.org/dist/hadoop/common/
tar -zxvf hadoop-3.1.1.tar.gz
```
- Install [poetry](https://python-poetry.org/docs/)
```bash
pip install poetry
```
- Install [cookiecutter](https://github.com/cookiecutter/cookiecutter)
```bash
# 安装或升级 cookiecutter
pip install -U cookiecutter
```

### 2. 创建项目骨架

使用 [cookiecutter](https://github.com/cookiecutter/cookiecutter) 加载项目模板。通过交互操作,可以选择使用的功能。

在终端运行命令:

```bash
cookiecutter https://github.com/pyloong/cookiecutter-pythonic-bigdata
```

### 3. 初始化环境
```bash
pyspark-project-template> poetry install
```

## 项目结构
```angular2html
├─src
│ │ __init__.py
│ │
│ └─pyspark_etl_template
│ │ cmdline.py
│ │ executor.py
│ │ constants.py
│ │ __init__.py
│ │
│ ├─configs
│ │ │ dev.toml
│ │ │ global.toml
│ │ │ prod.toml
│ │ └─ test.toml
│ │
│ ├─dependecies
│ │ │ log.py
│ │ │ config.py
│ │ └─ __init__.py
│ │
│ ├─tasks
│ │ │ __init__.py
│ │ │
│ │ └─abstract
│ │ │ task.py
│ │ │ transform.py
│ │ └─ __init__.py
│ │
│ └─utils
│ │ exception.py
│ └─ __init__.py
|
└─tests
│ conftest.py
│ test_version.py
└─ __init__.py
```
73 changes: 73 additions & 0 deletions automotive_data_etl/docs/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Begin

## Init project environment

- git init
- git config
- poetry install
- git commit

## Develop

- code
- git commit
- tox

## Delivery

### Run tox

Run tox to format code style and check test.

```shell script
tox
```

### Git tag

Modify package version value, then commit.

Add tag

```shell script
git tag -a v0.1.0
```

### Build

Build this tag distribution package.

```shell script
poetry build
```

### Upload index server

Upload to pypi server, or pass `--repository https://pypi.org/simple` to specify index server.

```shell script
poetry publish
```

## Develop guide

### Pycharm Configuration

Open project use Pycharm.

#### Module can not import in src

Check menu bar, click `File` --> `Settings` --> `Project Settings` --> `Project Structure` .
Mark `src` and `tests` directory as sources.

#### Enable pytest

Click `File` --> `Settings` --> `Tools` --> `Python Integrated Tools` --> `Testing` --> `Default runner`, then select
`pytest`.

If you run test by `Unittests` before, you should delete configuration. Open `Edit Run/Debug configurations dialog` in
In the upper right corner of Pycharm window, then delete configuration.

### Others

You should confirm `src` directory in `sys.path`. You can add it by `sys.path.extend(['/tmp/demo/src'])` if it not exist.
Loading