Spaces:

deagar
/

spark_sandbox

Paused

deagar commited on Jan 21

Commit

e19a510

1 Parent(s): 0643282

updated assessment ipynb

Files changed (1) hide show

notebooks/assesment.ipynb CHANGED Viewed

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# PySpark Data Engineering Assessment\n",
+    "\n",
+    "## Tasks\n",
+    "\n",
+    "1. Read the CSV data (in `../data/titanic.csv`) into:\n",
+    "   - a Pandas DataFrame\n",
+    "   - a Spark DataFrame\n",
+    "\n",
+    "2. Perform some data cleaning (e.g., drop rows with nulls in `Age` or `Fare`).\n",
+    "\n",
+    "3. Run basic aggregations:\n",
+    "   - Find the average Fare by Pclass\n",
+    "   - Find survival rate by Sex and Pclass\n",
+    "   - etc.\n",
+    "\n",
+    "4. Write the cleaned Spark DataFrame to a Parquet file.\n",
+    "\n",
+    "5. Bonus tasks:\n",
+    "   - Create a temporary Spark SQL table/view, query it with SQL syntax.\n",
+    "   - Provide quick EDA (e.g., distribution of Ages).\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}