Skip to content

murillo-ro-silva/pyspark_test1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Challenge

The Problem: digital media effectiveness

The solve problem: digital media effectiveness

In summary, my analysis should answer the 4 questions as described below:

  1. What was the most expensive campaign?
  2. What was the most profitable campaign?
  3. Which ad creative is the most effective in terms of clicks?
  4. Which ad creative is the most effective in terms of generating leads?

Docker Preparation:

Jupyter + Spark:

1 - Download: Here

2 - Create dns network:

$ sudo docker network create ponte

3 - After download completed, go to load docker image.

$ sudo docker image load -i "your download directory"/jupyter_spark_murillo.tar.gz

4 - Entry in clone project and then run the command below to start an image with jupyter / spark.

$ sudo docker run --rm -p 10000:8888 --net=ponte --user root -e JUPYTER_ENABLE_LAB=yes -e GRANT_SUDO=yes -v "$PWD":/home/jovyan/work jupyter/spark

Obs:

  • It necessary to enter into project cloned for better navigation.
  • It will necessary inserts the generate key, as the sample image below:

jupyter-example


Mysql:

1 - Download: Here

2 - After download completed, go to load docker image.

$ sudo docker image load -i "your download directory"/mysql.tar.gz

3 - Run Docker

$ sudo docker run --name=mysql_murillo --net=ponte -v "clone project diretory"/mysql:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=password -d mysql/mysql-server

4 - Exec Docker

$ sudo docker exec -it mysql_murillo mysql -uroot -p"$MYSQL_ROOT_PASSWORD"

Obs: Maybe, it will going to insert the password, if went requested, please inform the pass password

After load, run e exec. Its possible acess the mysql database and see all objects, such as Schema=creditas and all tables(customer_lead_funnel,facebook_ads,google_ads,pageviews) and own registers, as image below: jupyter-example


Notebook Analysis:

Brief explanation.

  • Into folder jupyter, your can find the archive AnalisysResult.ipynb, all the explorations went realized into, as well all the processing and anwers of the questions.

  • The analysis went divided in:

    • Read original files.
    • Structuring dataframes.
    • Save dataframes in parquets (Raw and Refined visions).
    • Save dataframe em Mysql Database.

Detailed explanations:

Data exploration and structuring work has been done, which can be followed in detail by accessing the jupyter AnalisysResult.ipynb. After exploring data, context-specific dataframes were created, after structuring the data without filters was saved in the Raw directory within this project.

The following is a view of the Raw structure of the data.


1 - Structure Process.

1.1 -Post structure datadrame Google: Google

1.2 - Post structure datadrame Facebook: Facebook

1.3 - Post structure datadrame Pageviwes: page

1.4 - Post structure datadrame Customer Lead Funnel: clf

After data saved in Raw was persisted in a MySql Local database (docker), being possible to access them through DataViz tools.


2 - Persist Process.

2.1 - google_ads Table. Google

2.2 - facebook_ads Table. facebook

2.3 - pageviews Table.
page

2.4 - customer_lead_funnel Table. clf

obs: Both registers can be consulting in MySql Docker Database.


3 - Create and save visions in refined zone.

3.1 - Most Expensive Vision.
most-expens

3.2 - Most Profitable Vision.
most-profitable

3.3 - Most Effective Clicks
most-effective-clicks

3.4 - Most Effective Leads
most-effetive-leads

3.5 - Save in parquets in Refined. refined

Conclusion.

The entire process was performed locally as well as processing and persistence using docker images. As we know there are several ways to propose a solution to a problem, I tried to think of a simpler one.

The process can be streamlined by running on some cloud service (Dataproc, EMR, Hortonworks, Cloudera and whatever) and scheduling via Airflow.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published