Spaces:

deagar
/

spark_sandbox

Paused

App Files Files Community

deagar commited on Jan 21

Commit

cfbd02f

1 Parent(s): 309b3f6

Updated assessment notebook, added solutions

Browse files

Files changed (2) hide show

notebooks/assesment.ipynb +314 -15
notebooks/solutions.ipynb +308 -0

notebooks/assesment.ipynb CHANGED Viewed

@@ -4,27 +4,326 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# PySpark Data Engineering Assessment\n",
     "\n",
-    "## Tasks\n",
     "\n",
-    "1. Read the CSV data (in `../data/titanic.csv`) into:\n",
-    "   - a Pandas DataFrame\n",
-    "   - a Spark DataFrame\n",
     "\n",
-    "2. Perform some data cleaning (e.g., drop rows with nulls in `Age` or `Fare`).\n",
     "\n",
-    "3. Run basic aggregations:\n",
-    "   - Find the average Fare by Pclass\n",
-    "   - Find survival rate by Sex and Pclass\n",
-    "   - etc.\n",
     "\n",
-    "4. Write the cleaned Spark DataFrame to a Parquet file.\n",
     "\n",
-    "5. Bonus tasks:\n",
-    "   - Create a temporary Spark SQL table/view, query it with SQL syntax.\n",
-    "   - Provide quick EDA (e.g., distribution of Ages).\n",
-    "\n"
    ]
   }
  ],

    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# PySpark Data Engineering Assessment (Extended)\n",
     "\n",
+    "Welcome! In this notebook, you'll practice:\n",
     "\n",
+    "1. Reading the **Titanic CSV** in **Pandas** and **PySpark**.\n",
+    "2. **Splitting** a single dataset into two DataFrames and **merging** them back together in both Pandas and Spark.\n",
+    "3. Data cleaning and aggregations in Pandas and Spark.\n",
+    "4. Writing and reading **Parquet** files.\n",
+    "5. Creating a **PySpark UDF** that leverages a **lightweight transformer model** to compute embeddings for passenger names.\n",
     "\n",
+    "---\n",
     "\n",
+    "## Dataset\n",
     "\n",
+    "- **`titanic.csv`**: This file is in the `../data/` directory, containing columns such as:\n",
+    "  - `PassengerId`, `Name`, `Sex`, `Age`, `Fare`, `Survived`, etc.\n",
     "\n",
+    "We will:\n",
+    "1. Read `titanic.csv` into Pandas and Spark.\n",
+    "2. Split the original DataFrame into two subsets (simulating two “tables”).\n",
+    "3. Demonstrate merges/joins in Pandas and Spark using these subsets.\n",
+    "4. Perform data cleaning and transformations.\n",
+    "5. Write to Parquet.\n",
+    "6. Implement a Spark UDF to generate embeddings for passenger names.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Instructions\n",
+    "\n",
+    "Throughout the notebook, you'll see `TODO` sections. Please fill in the required code. Feel free to add extra cells or explanations as needed.\n",
+    "\n",
+    "When finished, please save or export this notebook and submit according to your instructions.\n",
+    "\n",
+    "Let's begin!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 1. Imports and Spark Setup\n",
+    "\n",
+    "import os\n",
+    "import pandas as pd\n",
+    "\n",
+    "# PySpark imports\n",
+    "from pyspark.sql import SparkSession\n",
+    "from pyspark.sql import functions as F\n",
+    "from pyspark.sql.types import *\n",
+    "\n",
+    "# Create/initialize Spark session\n",
+    "spark = SparkSession.builder \\\n",
+    "    .appName(\"TitanicAssessmentExtended\") \\\n",
+    "    .getOrCreate()\n",
+    "\n",
+    "print(\"Spark version:\", spark.version)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 2. Read the Titanic CSV (Pandas & Spark)\n",
+    "# ========================================\n",
+    "\n",
+    "# Path to the CSV file\n",
+    "titanic_csv_path = os.path.join(\"..\", \"data\", \"titanic.csv\")\n",
+    "\n",
+    "# 2.1 TODO: Read 'titanic.csv' into a Pandas DataFrame (pd_df)\n",
+    "# pd_df = ?\n",
+    "\n",
+    "# Inspect the shape and first few rows\n",
+    "# print(\"pd_df shape:\", pd_df.shape)\n",
+    "# display(pd_df.head())\n",
+    "\n",
+    "# 2.2 TODO: Read 'titanic.csv' into a Spark DataFrame (spark_df)\n",
+    "# spark_df = ?\n",
+    "\n",
+    "# Check schema and row count\n",
+    "# spark_df.printSchema()\n",
+    "# print(\"spark_df count:\", spark_df.count())\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 3. Split Data into Two Subsets for Merging/Joining\n",
+    "# ==================================================\n",
+    "# Instead of using a second CSV, we'll simulate it by splitting the original dataset\n",
+    "# into two DataFrames:\n",
+    "#   df_part1: subset of columns -> PassengerId, Name, Sex, Age\n",
+    "#   df_part2: subset of columns -> PassengerId, Fare, Survived, Pclass\n",
+    "#\n",
+    "# We then merge these two separate DataFrames in both Pandas and Spark.\n",
+    "\n",
+    "# 3.1 Pandas Split\n",
+    "# ----------------\n",
+    "\n",
+    "# TODO: Create two new DataFrames from pd_df:\n",
+    "#    pd_part1 = pd_df[[\"PassengerId\", \"Name\", \"Sex\", \"Age\"]]\n",
+    "#    pd_part2 = pd_df[[\"PassengerId\", \"Fare\", \"Survived\", \"Pclass\"]]\n",
+    "\n",
+    "# pd_part1 = ?\n",
+    "# pd_part2 = ?\n",
+    "\n",
+    "# display(pd_part1.head())\n",
+    "# display(pd_part2.head())\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 3.2 Spark Split\n",
+    "# ---------------\n",
+    "# TODO: Create two new DataFrames from spark_df:\n",
+    "#    spark_part1 = spark_df.select(\"PassengerId\", \"Name\", \"Sex\", \"Age\")\n",
+    "#    spark_part2 = spark_df.select(\"PassengerId\", \"Fare\", \"Survived\", \"Pclass\")\n",
+    "\n",
+    "# spark_part1 = ?\n",
+    "# spark_part2 = ?\n",
+    "\n",
+    "# spark_part1.show(5)\n",
+    "# spark_part2.show(5)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 4. Merging / Joining the Split DataFrames\n",
+    "# =========================================\n",
+    "\n",
+    "# 4.1 Merge in Pandas\n",
+    "# -------------------\n",
+    "# TODO: Merge pd_part1 and pd_part2 on \"PassengerId\"\n",
+    "# We'll call the merged DataFrame \"pd_merged\".\n",
+    "#\n",
+    "# pd_merged = pd_part1.merge(pd_part2, on=\"PassengerId\", how=\"inner\")\n",
+    "\n",
+    "# pd_merged = ?\n",
+    "# print(\"pd_merged shape:\", pd_merged.shape)\n",
+    "# display(pd_merged.head())\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 4.2 Join in Spark\n",
+    "# -----------------\n",
+    "# TODO: Join spark_part1 with spark_part2 on \"PassengerId\"\n",
+    "# We'll call the joined DataFrame \"spark_merged\".\n",
+    "#\n",
+    "# spark_merged = spark_part1.join(spark_part2, on=\"PassengerId\", how=\"inner\")\n",
+    "\n",
+    "# spark_merged = ?\n",
+    "# print(\"spark_merged count:\", spark_merged.count())\n",
+    "# spark_merged.show(5)\n",
+    "# spark_merged.printSchema()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 5. Data Cleaning\n",
+    "# ================\n",
+    "# We'll focus on the merged DataFrames. For instance, drop rows that have missing\n",
+    "# values in certain columns like 'Age' or 'Fare'.\n",
+    "\n",
+    "# 5.1 TODO: Pandas DataFrame cleaning\n",
+    "# Create a cleaned version, 'pd_merged_clean',\n",
+    "# dropping nulls in [\"Age\", \"Fare\"].\n",
+    "\n",
+    "# pd_merged_clean = ?\n",
+    "\n",
+    "# print(\"Before dropna:\", pd_merged.shape)\n",
+    "# print(\"After dropna:\", pd_merged_clean.shape)\n",
+    "# pd_merged_clean.head()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 5.2 TODO: Spark DataFrame cleaning\n",
+    "# Create a cleaned version, 'spark_merged_clean',\n",
+    "# dropping nulls in [\"Age\", \"Fare\"].\n",
+    "\n",
+    "# spark_merged_clean = ?\n",
+    "\n",
+    "# print(\"spark_merged count BEFORE dropna:\", spark_merged.count())\n",
+    "# print(\"spark_merged_clean count AFTER dropna:\", spark_merged_clean.count())\n",
+    "# spark_merged_clean.show(5)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 6. Basic Aggregations\n",
+    "# =====================\n",
+    "# Let's do a couple of group-by queries to glean insights.\n",
+    "\n",
+    "# 6.1 TODO: Pandas - Average fare by Pclass\n",
+    "# e.g. group by 'Pclass' and compute mean fare in pd_merged_clean\n",
+    "\n",
+    "# pd_avg_fare = ?\n",
+    "# pd_avg_fare\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 6.2 TODO: Spark - Survival rate by Sex and Pclass\n",
+    "# e.g. groupBy(\"Sex\", \"Pclass\").agg(F.avg(\"Survived\"))\n",
+    "#\n",
+    "# spark_survival_rate = ?\n",
+    "# spark_survival_rate.show()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 7. Writing to Parquet\n",
+    "# =====================\n",
+    "# We'll write the cleaned Spark DataFrame to a Parquet file (e.g. \"titanic_merged_clean.parquet\").\n",
+    "\n",
+    "# 7.1 TODO: Write spark_merged_clean to Parquet\n",
+    "# e.g., spark_merged_clean.write.mode(\"overwrite\").parquet(\"titanic_merged_clean.parquet\")\n",
+    "\n",
+    "# 7.2 TODO: Read it back into a new Spark DataFrame called 'spark_parquet_df'\n",
+    "# spark_parquet_df = ?\n",
+    "\n",
+    "# print(\"spark_parquet_df count:\", spark_parquet_df.count())\n",
+    "# spark_parquet_df.show(5)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 8. Bonus 1: Create a Temp View and Query\n",
+    "# ========================================\n",
+    "# 8.1 TODO: Create a temp view with 'spark_merged_clean' (e.g. \"titanic_merged\")\n",
+    "# spark_merged_clean.createOrReplaceTempView(\"titanic_merged\")\n",
+    "\n",
+    "# 8.2 TODO: Spark SQL query example\n",
+    "# result_df = spark.sql(\"SELECT ... FROM titanic_merged GROUP BY ...\")\n",
+    "# result_df.show()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 9. Bonus 2: Transformer Embeddings UDF\n",
+    "# ======================================\n",
+    "# We'll demonstrate a simple approach using a lightweight transformer model to embed passenger names.\n",
+    "# This is optional, but shows advanced usage of Spark UDFs.\n",
+    "\n",
+    "# Requirements: e.g. \"transformers\" or \"sentence-transformers\" in your environment.\n",
+    "# from transformers import pipeline\n",
+    "# embedding_pipeline = pipeline(\"feature-extraction\", model=\"distilbert-base-uncased\")\n",
+    "# OR\n",
+    "# from sentence_transformers import SentenceTransformer\n",
+    "# model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n",
+    "\n",
+    "# 9.1 TODO: import / load the model/pipeline\n",
+    "# e.g.\n",
+    "# from transformers import pipeline\n",
+    "# embedding_pipeline = pipeline(\"feature-extraction\", model=\"distilbert-base-uncased\")\n",
+    "\n",
+    "# 9.2 Define a Python function that takes a passenger name (string) -> returns a list of floats\n",
+    "\n",
+    "# def get_name_embedding(name: str) -> List[float]:\n",
+    "#     # TODO: use embedding_pipeline or model to produce an embedding\n",
+    "#     # embedding = ?\n",
+    "#     # NOTE: verify shape (embedding might be list of lists)\n",
+    "#     return ???\n",
+    "\n",
+    "# 9.3 Wrap that function in a PySpark UDF\n",
+    "# from pyspark.sql.functions import udf\n",
+    "# from pyspark.sql.types import ArrayType, FloatType\n",
+    "# udf_get_name_embedding = udf(get_name_embedding, ArrayType(FloatType()))\n",
+    "\n",
+    "# 9.4 Apply the UDF to create a new column 'NameEmbedding' in spark_merged_clean\n",
+    "# spark_embedded = spark_merged_clean.withColumn(\"NameEmbedding\", udf_get_name_embedding(F.col(\"Name\")))\n",
+    "\n",
+    "# spark_embedded.select(\"Name\", \"NameEmbedding\").show(truncate=False)\n"
    ]
   }
  ],

notebooks/solutions.ipynb ADDED Viewed

	@@ -0,0 +1,308 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Solutions Guide"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "\n",
+    "# PySpark imports\n",
+    "from pyspark.sql import SparkSession\n",
+    "from pyspark.sql import functions as F\n",
+    "from pyspark.sql.types import *\n",
+    "\n",
+    "# Create or get Spark session\n",
+    "spark = SparkSession.builder \\\n",
+    "    .appName(\"TitanicAssessmentExtended\") \\\n",
+    "    .getOrCreate()\n",
+    "\n",
+    "print(\"Spark version:\", spark.version)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Explanation:\n",
+    "\n",
+    "    We import pandas, pyspark.sql modules, and create a Spark session named \"TitanicAssessmentExtended\".\n",
+    "    Checking spark.version helps confirm which version of Spark is running."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Read in data \n",
+    "titanic_csv_path = os.path.join(\"..\", \"data\", \"titanic.csv\")\n",
+    "\n",
+    "# 2.1 Read into a Pandas DataFrame\n",
+    "pd_df = pd.read_csv(titanic_csv_path)\n",
+    "\n",
+    "print(\"pd_df shape:\", pd_df.shape)\n",
+    "display(pd_df.head())\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We use pd.read_csv(...) to read the Titanic data into a pd.DataFrame.\n",
+    ".shape gives the (rows, columns).\n",
+    ".head() shows the top few rows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 2.2 Read into a Spark DataFrame\n",
+    "spark_df = spark.read.csv(titanic_csv_path, header=True, inferSchema=True)\n",
+    "\n",
+    "spark_df.printSchema()\n",
+    "print(\"spark_df count:\", spark_df.count())\n",
+    "spark_df.show(5)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We specify header=True so Spark knows the first row is column headers, and inferSchema=True so it automatically detects column types.\n",
+    ".printSchema() reveals the inferred schema.\n",
+    ".count() and .show() let us see row counts and sample rows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Split data into subsets\n",
+    "\n",
+    "pd_part1 = pd_df[[\"PassengerId\", \"Name\", \"Sex\", \"Age\"]]\n",
+    "pd_part2 = pd_df[[\"PassengerId\", \"Fare\", \"Survived\", \"Pclass\"]]\n",
+    "\n",
+    "display(pd_part1.head())\n",
+    "display(pd_part2.head())\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "spark_part1 = spark_df.select(\"PassengerId\", \"Name\", \"Sex\", \"Age\")\n",
+    "spark_part2 = spark_df.select(\"PassengerId\", \"Fare\", \"Survived\", \"Pclass\")\n",
+    "\n",
+    "spark_part1.show(5)\n",
+    "spark_part2.show(5)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Merging/Joining split dataframes \n",
+    "\n",
+    "pd_merged = pd_part1.merge(pd_part2, on=\"PassengerId\", how=\"inner\")\n",
+    "print(\"pd_merged shape:\", pd_merged.shape)\n",
+    "display(pd_merged.head())\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "on=\"PassengerId\" merges the two tables by the PassengerId key.\n",
+    "how=\"inner\" ensures rows only appear if they exist in both subsets (should be all matching in this case)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Join in spark\n",
+    "\n",
+    "spark_merged = spark_part1.join(spark_part2, on=\"PassengerId\", how=\"inner\")\n",
+    "print(\"spark_merged count:\", spark_merged.count())\n",
+    "spark_merged.show(5)\n",
+    "spark_merged.printSchema()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Spark uses .join(df2, on=\"PassengerId\", how=\"inner\").\n",
+    "spark_merged.show(5) and .printSchema() confirm the merge result."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Data cleaning\n",
+    "\n",
+    "pd_merged_clean = pd_merged.dropna(subset=[\"Age\", \"Fare\"])\n",
+    "print(\"Before dropna:\", pd_merged.shape)\n",
+    "print(\"After dropna:\", pd_merged_clean.shape)\n",
+    "pd_merged_clean.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Spark data cleaning\n",
+    "spark_merged_clean = spark_merged.dropna(subset=[\"Age\", \"Fare\"])\n",
+    "print(\"spark_merged count BEFORE dropna:\", spark_merged.count())\n",
+    "print(\"spark_merged_clean count AFTER dropna:\", spark_merged_clean.count())\n",
+    "spark_merged_clean.show(5)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Basic aggregations\n",
+    "\n",
+    "pd_avg_fare = pd_merged_clean.groupby(\"Pclass\")[\"Fare\"].mean()\n",
+    "pd_avg_fare"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Spark survival rate by sex and pclass\n",
+    "\n",
+    "spark_survival_rate = (\n",
+    "    spark_merged_clean\n",
+    "    .groupBy(\"Sex\", \"Pclass\")\n",
+    "    .agg(F.avg(\"Survived\").alias(\"survival_rate\"))\n",
+    ")\n",
+    "spark_survival_rate.show()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Write spark df to parquet\n",
+    "\n",
+    "spark_merged_clean.write.mode(\"overwrite\").parquet(\"titanic_merged_clean.parquet\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Read parquet back in\n",
+    "\n",
+    "spark_parquet_df = spark.read.parquet(\"titanic_merged_clean.parquet\")\n",
+    "print(\"spark_parquet_df count:\", spark_parquet_df.count())\n",
+    "spark_parquet_df.show(5)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Bonus - create a temp view/query\n",
+    "\n",
+    "spark_merged_clean.createOrReplaceTempView(\"titanic_merged\")\n",
+    "\n",
+    "result_df = spark.sql(\n",
+    "    \"\"\"\n",
+    "    SELECT Pclass,\n",
+    "        COUNT(*) AS passenger_count,\n",
+    "        AVG(Age) AS avg_age\n",
+    "    FROM titanic_merged\n",
+    "    GROUP BY Pclass\n",
+    "    ORDER BY Pclass\n",
+    "    \"\"\")\n",
+    "result_df.show()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Example imports (make sure 'transformers' is installed)\n",
+    "from transformers import pipeline\n",
+    "embedding_pipeline = pipeline(\"feature-extraction\", model=\"distilbert-base-uncased\")\n",
+    "\n",
+    "# Example function to get the name embedding\n",
+    "def get_name_embedding(name: str):\n",
+    "    # The pipeline will return a list of lists of floats.\n",
+    "    # Typically shape: (1, sequence_length, hidden_size).\n",
+    "    # We'll take the first token or perhaps average them.\n",
+    "    output = embedding_pipeline(name)\n",
+    "    # output[0] is shape [sequence_length, hidden_size]\n",
+    "    # let's do a simple average across the sequence dimension:\n",
+    "    token_embeddings = output[0]\n",
+    "    # average across tokens:\n",
+    "    mean_embedding = [float(sum(x) / len(x)) for x in zip(*token_embeddings)]\n",
+    "    return mean_embedding\n",
+    "\n",
+    "# Convert this Python function to a Spark UDF\n",
+    "from pyspark.sql.functions import udf\n",
+    "from pyspark.sql.types import ArrayType, FloatType\n",
+    "\n",
+    "udf_get_name_embedding = udf(get_name_embedding, ArrayType(FloatType()))\n",
+    "\n",
+    "# Apply it to add a new column\n",
+    "spark_embedded = spark_merged_clean.withColumn(\n",
+    "    \"NameEmbedding\",\n",
+    "    udf_get_name_embedding(F.col(\"Name\"))\n",
+    ")\n",
+    "\n",
+    "spark_embedded.select(\"Name\", \"NameEmbedding\").show(truncate=False)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}