Skip to content

Commit

Permalink
Merge pull request #9 from statisticsnorway/bucket-migration
Browse files Browse the repository at this point in the history
Fixed bucket paths
  • Loading branch information
bjornandre authored Nov 2, 2023
2 parents 5ab8930 + 1ec0c79 commit 0a28193
Show file tree
Hide file tree
Showing 10 changed files with 88 additions and 88 deletions.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "6815a21efbf30f4ad1ce04adbb0125ba",
"result": {
"markdown": "---\nfreeze: true\ntitle: Introduksjon til SparkR\n---\n\n\n\n\n\nAkkurat som PySpark så gir [SparkR](https://spark.apache.org/docs/latest/sparkr) oss et grensesnitt mot Apache Spark fra R. I denne notebooken viser vi noen eksempler hvordan du gjøre vanlige operasjoner med SparkR. \n\n## Oppsett\n\nEksemplene i notebooken bruker `SparkR (k8s cluster)` på <https://jupyter.dapla.ssb.no/>. Det vil si at den kan distribuere kjøringene på flere maskiner i Kubernetes. \n\n\n::: {.cell tags='[]' execution_count=1}\n``` {.r .cell-code}\nspark\n```\n\n::: {.cell-output .cell-output-display}\n```\nJava ref type org.apache.spark.sql.SparkSession id 1 \n```\n:::\n:::\n\n\n## Lese inn fil\n\n::: {.cell tags='[]' execution_count=2}\n``` {.r .cell-code}\nfile = read.parquet(\"gs://ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet\")\n```\n:::\n\n\n::: {#print-cell .cell tags='[]' execution_count=3}\n``` {.r .cell-code}\nselectedColumns <- select(file, \"Date\", \"Year\", \"Quarter\", \"Month\", \"serie00\", \"serie01\")\nshowDF(selectedColumns, numRows = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n+----------+----+-------+-----+------------------+------------------+\n| Date|Year|Quarter|Month| serie00| serie01|\n+----------+----+-------+-----+------------------+------------------+\n|2000-01-01|2000| 1| 01| 9.495232388801012| 19.016168503192|\n|2000-02-01|2000| 1| 02| 10.70952411634649|21.404467063442723|\n|2000-03-01|2000| 1| 03|11.118293927071951| 21.25035527677261|\n|2000-04-01|2000| 2| 04| 9.346911680164684|19.982136698759238|\n|2000-05-01|2000| 2| 05| 9.663303382177363|19.925236690504494|\n+----------+----+-------+-----+------------------+------------------+\nonly showing top 5 rows\n```\n:::\n:::\n\n\n# Skrive ut fil\n\nUnder skriver vi ut en fil og spesifiserer at vi overskriver evt filer med samme navn. \n\n::: {.cell tags='[]' execution_count=4}\n``` {.r .cell-code}\nwrite.parquet(file,\n \"gs://ssb-prod-dapla-felles-data-delt/temp/timeseries_copy.parquet\",\n mode = \"overwrite\")\n```\n:::\n\n\n",
"markdown": "---\nfreeze: true\ntitle: Introduksjon til SparkR\n---\n\n\n\n\n\nAkkurat som PySpark så gir [SparkR](https://spark.apache.org/docs/latest/sparkr) oss et grensesnitt mot Apache Spark fra R. I denne notebooken viser vi noen eksempler hvordan du gjøre vanlige operasjoner med SparkR. \n\n## Oppsett\n\nEksemplene i notebooken bruker `SparkR (k8s cluster)` på <https://jupyter.dapla.ssb.no/>. Det vil si at den kan distribuere kjøringene på flere maskiner i Kubernetes. \n\n\n::: {.cell tags='[]' execution_count=1}\n``` {.r .cell-code}\nspark\n```\n\n::: {.cell-output .cell-output-display}\n```\nJava ref type org.apache.spark.sql.SparkSession id 1 \n```\n:::\n:::\n\n\n## Lese inn fil\n\n::: {.cell tags='[]' execution_count=2}\n``` {.r .cell-code}\nfile = read.parquet(\"gs://ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet\")\n```\n:::\n\n\n::: {#print-cell .cell tags='[]' execution_count=3}\n``` {.r .cell-code}\nselectedColumns <- select(file, \"Date\", \"Year\", \"Quarter\", \"Month\", \"serie00\", \"serie01\")\nshowDF(selectedColumns, numRows = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n+----------+----+-------+-----+------------------+------------------+\n| Date|Year|Quarter|Month| serie00| serie01|\n+----------+----+-------+-----+------------------+------------------+\n|2000-01-01|2000| 1| 01| 9.495232388801012| 19.016168503192|\n|2000-02-01|2000| 1| 02| 10.70952411634649|21.404467063442723|\n|2000-03-01|2000| 1| 03|11.118293927071951| 21.25035527677261|\n|2000-04-01|2000| 2| 04| 9.346911680164684|19.982136698759238|\n|2000-05-01|2000| 2| 05| 9.663303382177363|19.925236690504494|\n+----------+----+-------+-----+------------------+------------------+\nonly showing top 5 rows\n```\n:::\n:::\n\n\n# Skrive ut fil\n\nUnder skriver vi ut en fil og spesifiserer at vi overskriver evt filer med samme navn. \n\n::: {.cell tags='[]' execution_count=4}\n``` {.r .cell-code}\nwrite.parquet(file,\n \"gs://ssb-dapla-felles-data-delt-prod/temp/timeseries_copy.parquet\",\n mode = \"overwrite\")\n```\n:::\n\n\n",
"supporting": [
"sparkr-intro_files"
],
Expand Down
52 changes: 26 additions & 26 deletions dapla-manual/notebooks/spark/deltalake-intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@
"source": [
"%%time\n",
"data.write.format(\"delta\").mode(\"overwrite\").save(\n",
" \"gs://ssb-prod-dapla-felles-data-delt/temp4\"\n",
" \"gs://ssb-dapla-felles-data-delt-prod/temp4\"\n",
")"
]
},
Expand All @@ -165,11 +165,11 @@
{
"data": {
"text/plain": [
"['ssb-prod-dapla-felles-data-delt/temp4/_delta_log',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/_delta_log/',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/_delta_log/00000000000000000000.json',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/part-00000-9b3b81a9-2771-4fb4-9f0f-659fd160d643-c000.snappy.parquet',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/part-00001-0f2f8ba5-3161-41e8-b5d1-2084128a5bed-c000.snappy.parquet']"
"['ssb-dapla-felles-data-delt-prod/temp4/_delta_log',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/_delta_log/',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/_delta_log/00000000000000000000.json',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/part-00000-9b3b81a9-2771-4fb4-9f0f-659fd160d643-c000.snappy.parquet',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/part-00001-0f2f8ba5-3161-41e8-b5d1-2084128a5bed-c000.snappy.parquet']"
]
},
"execution_count": 30,
Expand All @@ -182,7 +182,7 @@
"\n",
"fs = FileClient.get_gcs_file_system()\n",
"\n",
"fs.glob(\"gs://ssb-prod-dapla-felles-data-delt/temp4/**\")"
"fs.glob(\"gs://ssb-dapla-felles-data-delt-prod/temp4/**\")"
]
},
{
Expand Down Expand Up @@ -245,7 +245,7 @@
}
],
"source": [
"deltaTable = DeltaTable.forPath(spark, \"gs://ssb-prod-dapla-felles-data-delt/temp4\")\n",
"deltaTable = DeltaTable.forPath(spark, \"gs://ssb-dapla-felles-data-delt-prod/temp4\")\n",
"deltaTable.toDF().show()"
]
},
Expand Down Expand Up @@ -308,7 +308,7 @@
}
],
"source": [
"deltaTable2 = DeltaTable.forPath(spark, \"gs://ssb-prod-dapla-felles-data-delt/temp4\")\n",
"deltaTable2 = DeltaTable.forPath(spark, \"gs://ssb-dapla-felles-data-delt-prod/temp4\")\n",
"deltaTable2.toDF().show()"
]
},
Expand Down Expand Up @@ -411,7 +411,7 @@
"outputs": [],
"source": [
"new_df.write.format(\"delta\").mode(\"append\").save(\n",
" \"gs://ssb-prod-dapla-felles-data-delt/temp4\"\n",
" \"gs://ssb-dapla-felles-data-delt-prod/temp4\"\n",
")"
]
},
Expand Down Expand Up @@ -467,16 +467,16 @@
{
"data": {
"text/plain": [
"['ssb-prod-dapla-felles-data-delt/temp4/_delta_log',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/_delta_log/',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/_delta_log/00000000000000000000.json',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/_delta_log/00000000000000000001.json',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/_delta_log/00000000000000000002.json',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/part-00000-73e5052f-1b82-48da-ab37-2cbc01bb46c1-c000.snappy.parquet',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/part-00000-9b3b81a9-2771-4fb4-9f0f-659fd160d643-c000.snappy.parquet',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/part-00000-d04d0ca2-8e8b-42e9-a8a3-0fed9a0e4e41-c000.snappy.parquet',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/part-00001-0f2f8ba5-3161-41e8-b5d1-2084128a5bed-c000.snappy.parquet',\n",
" 'ssb-prod-dapla-felles-data-delt/temp4/part-00001-30d707e4-dd9a-4bfd-a4c7-7fbb1933e9ae-c000.snappy.parquet']"
"['ssb-dapla-felles-data-delt-prod/temp4/_delta_log',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/_delta_log/',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/_delta_log/00000000000000000000.json',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/_delta_log/00000000000000000001.json',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/_delta_log/00000000000000000002.json',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/part-00000-73e5052f-1b82-48da-ab37-2cbc01bb46c1-c000.snappy.parquet',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/part-00000-9b3b81a9-2771-4fb4-9f0f-659fd160d643-c000.snappy.parquet',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/part-00000-d04d0ca2-8e8b-42e9-a8a3-0fed9a0e4e41-c000.snappy.parquet',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/part-00001-0f2f8ba5-3161-41e8-b5d1-2084128a5bed-c000.snappy.parquet',\n",
" 'ssb-dapla-felles-data-delt-prod/temp4/part-00001-30d707e4-dd9a-4bfd-a4c7-7fbb1933e9ae-c000.snappy.parquet']"
]
},
"execution_count": 38,
Expand All @@ -487,7 +487,7 @@
"source": [
"# Lister ut filene i bøtta\n",
"fs = FileClient.get_gcs_file_system()\n",
"fs.glob(\"gs://ssb-prod-dapla-felles-data-delt/temp4/**\")"
"fs.glob(\"gs://ssb-dapla-felles-data-delt-prod/temp4/**\")"
]
},
{
Expand Down Expand Up @@ -565,7 +565,7 @@
"fs = FileClient.get_gcs_file_system()\n",
"\n",
"# Filsti\n",
"path = \"gs://ssb-prod-dapla-felles-data-delt/temp4/_delta_log/00000000000000000002.json\"\n",
"path = \"gs://ssb-dapla-felles-data-delt-prod/temp4/_delta_log/00000000000000000002.json\"\n",
"\n",
"with fs.open(path, \"r\") as f:\n",
" for line in f:\n",
Expand Down Expand Up @@ -720,7 +720,7 @@
"outputs": [],
"source": [
"# Leser inn filen\n",
"df = spark.read.format(\"delta\").load(\"gs://ssb-prod-dapla-felles-data-delt/temp4\")"
"df = spark.read.format(\"delta\").load(\"gs://ssb-dapla-felles-data-delt-prod/temp4\")"
]
},
{
Expand Down Expand Up @@ -780,7 +780,7 @@
" df.write.format(\"delta\")\n",
" .mode(\"append\")\n",
" .option(\"userMetadata\", json.dumps(metadata)) # Serialize metadata to a string\n",
" .save(\"gs://ssb-prod-dapla-felles-data-delt/temp4\")\n",
" .save(\"gs://ssb-dapla-felles-data-delt-prod/temp4\")\n",
")"
]
},
Expand All @@ -794,7 +794,7 @@
"outputs": [],
"source": [
"# Laster inn tabellen\n",
"deltaTable = DeltaTable.forPath(spark, \"gs://ssb-prod-dapla-felles-data-delt/temp4\")\n",
"deltaTable = DeltaTable.forPath(spark, \"gs://ssb-dapla-felles-data-delt-prod/temp4\")\n",
"\n",
"# Henter ut historien\n",
"history_df = deltaTable.history()"
Expand Down Expand Up @@ -933,7 +933,7 @@
"fs = FileClient.get_gcs_file_system()\n",
"\n",
"# Filsti\n",
"path = \"gs://ssb-prod-dapla-felles-data-delt/temp4/_delta_log/00000000000000000003.json\"\n",
"path = \"gs://ssb-dapla-felles-data-delt-prod/temp4/_delta_log/00000000000000000003.json\"\n",
"\n",
"with fs.open(path, \"r\") as f:\n",
" for line in f:\n",
Expand Down
20 changes: 10 additions & 10 deletions dapla-manual/notebooks/spark/pyspark-intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@
"\n",
"```python\n",
"df.write.parquet(\n",
" \"gs://ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet\"\n",
" \"gs://ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet\"\n",
")\n",
"```\n",
"\n",
Expand All @@ -294,7 +294,7 @@
"outputs": [],
"source": [
"df.write.mode(\"overwrite\").parquet(\n",
" \"gs://ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet\"\n",
" \"gs://ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet\"\n",
")"
]
},
Expand All @@ -317,12 +317,12 @@
{
"data": {
"text/plain": [
"['ssb-prod-dapla-felles-data-delt/temp/',\n",
" 'ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet',\n",
" 'ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet/',\n",
" 'ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet/_SUCCESS',\n",
" 'ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet/part-00000-b32e7299-0590-4b31-bcc2-dc3d58725529-c000.snappy.parquet',\n",
" 'ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet/part-00001-b32e7299-0590-4b31-bcc2-dc3d58725529-c000.snappy.parquet']"
"['ssb-dapla-felles-data-delt-prod/temp/',\n",
" 'ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet',\n",
" 'ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet/',\n",
" 'ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet/_SUCCESS',\n",
" 'ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet/part-00000-b32e7299-0590-4b31-bcc2-dc3d58725529-c000.snappy.parquet',\n",
" 'ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet/part-00001-b32e7299-0590-4b31-bcc2-dc3d58725529-c000.snappy.parquet']"
]
},
"execution_count": 7,
Expand All @@ -335,7 +335,7 @@
"\n",
"fs = FileClient.get_gcs_file_system()\n",
"\n",
"fs.glob(\"gs://ssb-prod-dapla-felles-data-delt/temp/**\")"
"fs.glob(\"gs://ssb-dapla-felles-data-delt-prod/temp/**\")"
]
},
{
Expand Down Expand Up @@ -386,7 +386,7 @@
],
"source": [
"df_ts = spark.read.parquet(\n",
" \"gs://ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet\"\n",
" \"gs://ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet\"\n",
")\n",
"df_ts.select(\"Date\", \"Year\", \"Quarter\", \"Month\", \"serie66\", \"serie55\").show(5)"
]
Expand Down
4 changes: 2 additions & 2 deletions dapla-manual/notebooks/spark/sparkr-intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@
},
"outputs": [],
"source": [
"file = read.parquet(\"gs://ssb-prod-dapla-felles-data-delt/temp/timeseries.parquet\")"
"file = read.parquet(\"gs://ssb-dapla-felles-data-delt-prod/temp/timeseries.parquet\")"
]
},
{
Expand Down Expand Up @@ -117,7 +117,7 @@
"outputs": [],
"source": [
"write.parquet(file,\n",
" \"gs://ssb-prod-dapla-felles-data-delt/temp/timeseries_copy.parquet\",\n",
" \"gs://ssb-dapla-felles-data-delt-prod/temp/timeseries_copy.parquet\",\n",
" mode = \"overwrite\")"
]
}
Expand Down
6 changes: 3 additions & 3 deletions dapla-manual/statistikkere/altinn3.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ from dapla import FileClient
fs = FileClient.get_gcs_file_system()

# Sett inn filstien din her
file = "gs://ssb-prod-dapla-felles-data-delt/altinn3/form_dc551844cd74.xml"
file = "gs://ssb-dapla-felles-data-delt-prod/altinn3/form_dc551844cd74.xml"

dom = parseString(fs.cat_file(file))
pretty_xml = dom.toprettyxml(indent=" ")
Expand Down Expand Up @@ -227,7 +227,7 @@ from dapla import FileClient
fs = FileClient.get_gcs_file_system()

from_path = "gs://ssb-prod-arbmark-skjema-data-kilde/ledstill/altinn/2022/11/21/"
to_path = "gs://ssb-prod-dapla-felles-data-delt/altinn3/"
to_path = "gs://ssb-dapla-felles-data-delt-prod/altinn3/"
fs.copy(from_path, to_path, recursive=True)

```
Expand All @@ -247,7 +247,7 @@ xml_files = fs.glob("gs://ra0678-01-altinn-data-prod-e17d-ssb-altinn/2023/3/10/*

# Stien du ønsker å kopiere til.
# Koden under foutsetter at du har med gs:// først
to_folder = "gs://ssb-prod-dapla-felles-data-delt/"
to_folder = "gs://ssb-dapla-felles-data-delt-prod/"

# Kopierer over filene
for file in xml_files:
Expand Down
Loading

0 comments on commit 0a28193

Please sign in to comment.