Added a jupyter notebook for UCS classification.

2025-08-04 21:21:54 -07:00
parent daa714a9b4
commit be76ca3201
1 changed files with 393 additions and 0 deletions
--- a/Embedding.ipynb
+++ b/Embedding.ipynb
@@ -0,0 +1,393 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a8130738-4d14-4d27-9720-0110e202cdcb",
+   "metadata": {},
+   "source": [
+    "# Finding UCS categories with sentence embedding\n",
+    "\n",
+    "This is a brief example of using LLM [sentence embedding](https://en.wikipedia.org/wiki/Sentence_embedding)\n",
+    "to decide the UCS category for a sound, based on a text description."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02310d78-21bc-405e-b94d-28698fc82c12",
+   "metadata": {},
+   "source": [
+    "### Background: UCS\n",
+    "\n",
+    "[UCS][ucs] is a standard category/subject-name schedule for libraries of general sound effects, and a corresponding file naming system to allow files categornized with UCS can be organized and collated.\n",
+    "\n",
+    "There are 762 UCS categories with two levels of heirarchy: general _Categories_ such as AMBIENCE, FOLEY or WOOD, and then _SubCategories_ within these. Subcategories of FOLEY, for example, include CLOTH, PROP and HANDS. Individual categories are identified with a concatenation of these, the HANDS subcategory of the FOLEY category becomes \"FOLYHand\", the subcategory's _CatID_.\n",
+    "\n",
+    "The definition document for UCS contains _Explanations_ for each subcategory to aid a categorizer in selecting the appropriate subcategory for a particular sound, along with a long list of _Synonyms_. For example, the \"FOLYHand\" subcategory has the following explanation:\n",
+    "\n",
+    "> Used for 'performed' official and very clean Foley done to picture. For your own wild recordings use FOOTSTEPS and OBJECT categories.\n",
+    "\n",
+    "And synonymns are:\n",
+    "\n",
+    "> \"clapping\", \"flicking\", \"grab\", \"grasping\", \"handle\", \"pat\", \"patting\", \"rubbing\", \"scratching\", \"set\", \"shaking\", \"slapping\", \"snapping\", \"touching\"\n",
+    "\n",
+    "### The Problem: Choosing the appropriate UCS category for a sound\n",
+    "\n",
+    "Explanations and Synonyms are meant to be human-readable. If a file's name or metadata description contains a synonymn this __may__ mean the file falls within that category but often does not. For example, the synonym \"fire\" appears in the subcategory ALARM-BELL, ALARM-BUZZER, ALARM-SIREN, AMBIENCE-EMERGENCY and several others. The plain-English meaning of the sound, as given by its name and Explanations, must be considered.\n",
+    "\n",
+    "This is an obvious application for a sentence embedding, which can reduce the explanations for an individual subcategory to a tensor which can then be compared with a corresponding tensor derived from the sound's text description. \n",
+    "\n",
+    "[ucs]: https://universalcategorysystem.com/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e11d04dd-bde1-4d3c-80ed-9a96ad231977",
+   "metadata": {},
+   "source": [
+    "### The Implementation Plan\n",
+    "\n",
+    "We'll create a function that accepts the text description of a sound file and returns a list of CatIDs sorted in order of how much each one is most appropriate, given the description.\n",
+    "\n",
+    "In order to do this we need to do the following:\n",
+    "\n",
+    "1. Get the UCS category list\n",
+    "2. Get a sentence embedding LLM\n",
+    "3. Calculate an embedding for each UCS category that vectorizes its lingustic meaning for the model\n",
+    "4. Calculate the embedding for the file to categorize\n",
+    "5. Calculate the similarity score for each category embedding with regard to the given file description\n",
+    "6. Sort the CatIDs by this score and return them\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7527cf48-2a3f-48bd-abb0-c12e01c61513",
+   "metadata": {},
+   "source": [
+    "### Get UCS category list\n",
+    "\n",
+    "I've made JSON lists of the UCS categories, explanations and synonyms [on GitHub](https://github.com/iluvcapra/ucs-community), based on Tim Nielsen's Excel database in all of the available languages. We just need English for this example so we can download it directly:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "b3559105-949b-446a-a1e1-ca0544c5c841",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
+      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
+      "100  426k  100  426k    0     0   929k      0 --:--:-- --:--:-- --:--:--  931k\n"
+     ]
+    }
+   ],
+   "source": [
+    "! curl -o \"ucs-en.json\" \"https://raw.githubusercontent.com/iluvcapra/ucs-community/refs/heads/master/json/en.json\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "c84d5d99-ff4b-40ca-851c-ab89fe2263fd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CatID: AIRBrst\n",
+      "Explanations: Sharp air releases, pressure releases, a tennis call can popping open, a fire extinguisher\n"
+     ]
+    }
+   ],
+   "source": [
+    "import json\n",
+    "\n",
+    "with open(\"ucs-en.json\") as f:\n",
+    "    ucs = json.load(f)\n",
+    "\n",
+    "print(\"CatID:\", ucs[0]['CatID'])\n",
+    "print(\"Explanations:\", ucs[0]['Explanations'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6bd7561-12d8-4273-b868-dba2bbc96a23",
+   "metadata": {},
+   "source": [
+    "Now that we've downloaded the list, we can load the categories and their definitions. The definitions are stored in the JSON as a list of dictionaries."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a681c209-c9a1-4906-b01e-58eca229a2b7",
+   "metadata": {},
+   "source": [
+    "## Get a sentence embedding LLM\n",
+    "\n",
+    "We use the `sentence_transformers` module and select a model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "0f6a2ba0-a06b-4166-a4a3-35f0a714ed1d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sentence_transformers import SentenceTransformer\n",
+    "\n",
+    "MODEL_NAME = \"paraphrase-multilingual-mpnet-base-v2\"\n",
+    "\n",
+    "model = SentenceTransformer(MODEL_NAME)\n",
+    "\n",
+    "# sentence_transformers will emit a deprecation warning in PyTorch, we can suppress it:\n",
+    "\n",
+    "import warnings\n",
+    "warnings.simplefilter(action='ignore', category=FutureWarning)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1badfc0b-d62c-4392-a0e0-64e19067966b",
+   "metadata": {},
+   "source": [
+    "## Calculate an embedding for each UCS category that vectorizes its lingustic meaning for the model\n",
+    "\n",
+    "We generate an embedding for a category by using the  _Explanations_, _Category_, _SubCategory_, and _Synonyms_."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "42eb70a4-2abf-4868-9aa0-7ab74c7ee936",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def build_category_embedding(cat_info: list[dict] ):\n",
+    "    'Create an embedding for a single category'\n",
+    "    components = [cat_info[\"Explanations\"], cat_info[\"Category\"], cat_info[\"SubCategory\"]] + cat_info.get('Synonyms', [])\n",
+    "    composite_text = \". \".join(components)\n",
+    "    return model.encode(composite_text, convert_to_numpy=True)\n",
+    "\n",
+    "def create_embeddings() -> list:\n",
+    "    'Create embeddings for the entire UCS list'\n",
+    "    embeddings_list = []\n",
+    "    for info in ucs:\n",
+    "        embeddings_list += [{'CatID': info['CatID'], \n",
+    "                            'Embedding': build_category_embedding(info)\n",
+    "                           }]\n",
+    "\n",
+    "    return embeddings_list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "da2820ad-851b-4cb7-a496-b124275eef58",
+   "metadata": {},
+   "source": [
+    "Calculating embeddings for the entire UCS will be very time consuming (many seconds or perhaps a minute on a modern laptop), we should save them to disk once they're created so we can reuse them after calculated them the first time.\n",
+    "\n",
+    "We can cache the categories in a file named `EMBEDDING_NAME.cache` so multiple runs don't have to recalculate the entire emebddings table. If this file doesn't exist we create it by creating the embeddings and pickling the result, and if it does we read it.\n",
+    "\n",
+    "We'll save the embeddings in a list, `embeddings_list`, so we can refer to them later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "d4c1bc74-5c5d-4714-b671-c75c45b82490",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading cached category emebddings...\n",
+      "Loaded 752 category embeddings...\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pickle\n",
+    "import os\n",
+    "\n",
+    "EMBEDDING_CACHE_NAME = MODEL_NAME + \".cache\"\n",
+    "\n",
+    "if not os.path.exists(EMBEDDING_CACHE_NAME):\n",
+    "    print(\"Cached embeddings unavailable, recalculating...\")\n",
+    "\n",
+    "    print(f\"Loaded {len(ucs)} categories...\")\n",
+    "    \n",
+    "    embeddings_list = create_embeddings(ucs)\n",
+    "\n",
+    "    with open(EMBEDDING_CACHE_NAME, \"wb\") as g:\n",
+    "        print(\"Writing embeddings to file...\")\n",
+    "        pickle.dump(embeddings_list, g)\n",
+    "\n",
+    "else:\n",
+    "    print(\"Loading cached category emebddings...\")\n",
+    "    with open(EMBEDDING_CACHE_NAME, \"rb\") as g:\n",
+    "        embeddings_list = pickle.load(g)\n",
+    "\n",
+    "print(f\"Loaded {len(embeddings_list)} category embeddings...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29738074-3930-46d2-be45-39eaf0abc6e5",
+   "metadata": {},
+   "source": [
+    "### Calculate the embedding for the file to categorize, and sort\n",
+    "\n",
+    "The remaining steps can be expressed economically using NumPy operations natively on the tensors:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "c98d1af1-b4c5-478c-b051-0f8f33399dfd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "def classify_text_ranked(text):\n",
+    "    # First we obtain the embedding of the text description\n",
+    "    text_embedding = model.encode(text, convert_to_numpy=True)\n",
+    "\n",
+    "    # We collect the embeddings into an np.array in order to do the similarity calculation\n",
+    "    # in one shot.\n",
+    "    embeddings = np.array([info['Embedding'] for info in embeddings_list])\n",
+    "    sim = model.similarity(text_embedding, embeddings)\n",
+    "\n",
+    "    # `similarity` returns a tensor of rank 2 but it only has one member\n",
+    "    sim = sim[0]\n",
+    "\n",
+    "    # argsort gives us the indicies into `sim` in ascending order of their value. Grabbing the last\n",
+    "    # five gives us the five highest values, in ascending order.\n",
+    "    maxinds = np.argsort(sim)[-5:]\n",
+    "\n",
+    "    # We look up the CatIDs using maxinds, and reverse the list so they're now in descending order,\n",
+    "    # giving the best match first and each worse match following in order.\n",
+    "    catids = [embeddings_list[x]['CatID'] for x in reversed(maxinds)]\n",
+    "\n",
+    "    # And then print.\n",
+    "    print(\" ⇒ Top 5: \" + \", \".join(catids))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a58e41f-7ddd-4b6f-8c29-be55e9d49ab5",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "436d534b-ffc8-4931-a09c-c37a23115ffe",
+   "metadata": {},
+   "source": [
+    "We can now feed some possible file descriptions into our function and obtain a result:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "c894b72d-7590-444b-99ac-dff9c6246c1e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Text: Black powder explosion with loud report\n",
+      " ⇒ Top 5: EXPLMisc, AIRBrst, METLCrsh, EXPLReal, FIREBrst\n",
+      "\n",
+      "Text: Steam enging chuff\n",
+      " ⇒ Top 5: TRNSteam, FIRESizz, FIREGas, WATRFizz, GEOFuma\n",
+      "\n",
+      "Text: Playing card flick onto table\n",
+      " ⇒ Top 5: GAMEMisc, GAMEBoard, GAMECas, PAPRFltr, GAMEArcd\n",
+      "\n",
+      "Text: BMW 228 out fast\n",
+      " ⇒ Top 5: MOTRMisc, AIRHiss, VEHTire, VEHMoto, VEHAntq\n",
+      "\n",
+      "Text: City night skyline atmosphere\n",
+      " ⇒ Top 5: AMBUrbn, AMBTraf, AMBCele, AMBAir, AMBTran\n",
+      "\n",
+      "Text: Civil war 12-pound gun cannon\n",
+      " ⇒ Top 5: GUNCano, GUNArtl, GUNRif, BLLTMisc, WEAPMisc\n",
+      "\n",
+      "Text: Domestic combination boiler - pump switches off & cooling\n",
+      " ⇒ Top 5: MACHHvac, MACHFan, MACHPump, MOTRTurb, MECHRelay\n",
+      "\n",
+      "Text: Cello bow on cactus, animal screech\n",
+      " ⇒ Top 5: MUSCStr, CERMTonl, MUSCShake, MUSCPluck, MUSCWind\n",
+      "\n",
+      "Text: Electricity Generator And Arc Machine Start Up\n",
+      " ⇒ Top 5: ELECArc, MOTRElec, ELECSprk, BOATElec, TOOLPowr\n",
+      "\n",
+      "Text: Horse, canter One Horse: Canter Up, Stop\n",
+      " ⇒ Top 5: VOXScrm, WEAPWhip, FEETHors, MOVEAnml, VEHWagn\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "texts = [\n",
+    "    \"Black powder explosion with loud report\",\n",
+    "    \"Steam enging chuff\",\n",
+    "    \"Playing card flick onto table\",\n",
+    "    \"BMW 228 out fast\",\n",
+    "    \"City night skyline atmosphere\",\n",
+    "    \"Civil war 12-pound gun cannon\",\n",
+    "    \"Domestic combination boiler - pump switches off & cooling\",\n",
+    "    \"Cello bow on cactus, animal screech\",\n",
+    "    \"Electricity Generator And Arc Machine Start Up\",\n",
+    "    \"Horse, canter One Horse: Canter Up, Stop\"\n",
+    "]\n",
+    "\n",
+    "for text in texts:\n",
+    "    print(f\"Text: {text}\")\n",
+    "    classify_text_ranked(text)\n",
+    "    print(\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "36b7a30e-8bd1-486d-bc50-ce77704df64f",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}