Spaces:
Paused
Paused
updated assessment ipynb
Browse files- notebooks/assesment.ipynb +38 -0
notebooks/assesment.ipynb
CHANGED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# PySpark Data Engineering Assessment\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"## Tasks\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"1. Read the CSV data (in `../data/titanic.csv`) into:\n",
|
| 12 |
+
" - a Pandas DataFrame\n",
|
| 13 |
+
" - a Spark DataFrame\n",
|
| 14 |
+
"\n",
|
| 15 |
+
"2. Perform some data cleaning (e.g., drop rows with nulls in `Age` or `Fare`).\n",
|
| 16 |
+
"\n",
|
| 17 |
+
"3. Run basic aggregations:\n",
|
| 18 |
+
" - Find the average Fare by Pclass\n",
|
| 19 |
+
" - Find survival rate by Sex and Pclass\n",
|
| 20 |
+
" - etc.\n",
|
| 21 |
+
"\n",
|
| 22 |
+
"4. Write the cleaned Spark DataFrame to a Parquet file.\n",
|
| 23 |
+
"\n",
|
| 24 |
+
"5. Bonus tasks:\n",
|
| 25 |
+
" - Create a temporary Spark SQL table/view, query it with SQL syntax.\n",
|
| 26 |
+
" - Provide quick EDA (e.g., distribution of Ages).\n",
|
| 27 |
+
"\n"
|
| 28 |
+
]
|
| 29 |
+
}
|
| 30 |
+
],
|
| 31 |
+
"metadata": {
|
| 32 |
+
"language_info": {
|
| 33 |
+
"name": "python"
|
| 34 |
+
}
|
| 35 |
+
},
|
| 36 |
+
"nbformat": 4,
|
| 37 |
+
"nbformat_minor": 2
|
| 38 |
+
}
|