Reworking

2025-08-09 20:58:53 -07:00
parent 6c424099fd
commit 53a91f6103
2 changed files with 277 additions and 114 deletions
--- a/Classification.ipynb
+++ b/Classification.ipynb
@@ -2,43 +2,186 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "id": "eb04426a-cfb8-4f6f-9348-4308438fb9a5",
+   "id": "a8130738-4d14-4d27-9720-0110e202cdcb",
   "metadata": {},
   "source": [
    "# Finding UCS categories with sentence embedding\n",
    "\n",
-    "In this brief example we use sentence embedding to decide the UCS category for a sound, based on a text description.\n",
+    "This is a brief example of using LLM [sentence embedding](https://en.wikipedia.org/wiki/Sentence_embedding)\n",
+    "to decide the UCS category for a sound, based on a text description."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02310d78-21bc-405e-b94d-28698fc82c12",
+   "metadata": {},
+   "source": [
+    "### Background: UCS\n",
    "\n",
-    "## Step 1: Creating embeddings for UCS categories\n",
+    "[UCS][ucs] is a standard category/subject-name schedule for libraries of general sound effects, and a corresponding file naming system to allow files categornized with UCS can be organized and collated.\n",
    "\n",
-    "We first select a SentenceTransformer model and establish a method for generating embeddings that correspond with each category by using the  _Explanations_, _Category_, _SubCategory_, and _Synonyms_ from the UCS spreadsheet.\n",
+    "There are 762 UCS categories with two levels of heirarchy: general _Categories_ such as AMBIENCE, FOLEY or WOOD, and then _SubCategories_ within these. Subcategories of FOLEY, for example, include CLOTH, PROP and HANDS. Individual categories are identified with a concatenation of these, the HANDS subcategory of the FOLEY category becomes \"FOLYHand\", the subcategory's _CatID_.\n",
    "\n",
-    "`model.encode` is a slow process so we can write this as an async function so the client can parallelize it if it wants to."
+    "The definition document for UCS contains _Explanations_ for each subcategory to aid a categorizer in selecting the appropriate subcategory for a particular sound, along with a long list of _Synonyms_. For example, the \"FOLYHand\" subcategory has the following explanation:\n",
+    "\n",
+    "> Used for 'performed' official and very clean Foley done to picture. For your own wild recordings use FOOTSTEPS and OBJECT categories.\n",
+    "\n",
+    "And synonymns are:\n",
+    "\n",
+    "> \"clapping\", \"flicking\", \"grab\", \"grasping\", \"handle\", \"pat\", \"patting\", \"rubbing\", \"scratching\", \"set\", \"shaking\", \"slapping\", \"snapping\", \"touching\"\n",
+    "\n",
+    "### The Problem: Choosing the appropriate UCS category for a sound\n",
+    "\n",
+    "Explanations and Synonyms are meant to be human-readable. If a file's name or metadata description contains a synonymn this __may__ mean the file falls within that category but often does not. For example, the synonym \"fire\" appears in the subcategory ALARM-BELL, ALARM-BUZZER, ALARM-SIREN, AMBIENCE-EMERGENCY and several others. The plain-English meaning of the sound, as given by its name and Explanations, must be considered.\n",
+    "\n",
+    "This is an obvious application for a sentence embedding, which can reduce the explanations for an individual subcategory to a tensor which can then be compared with a corresponding tensor derived from the sound's text description. \n",
+    "\n",
+    "[ucs]: https://universalcategorysystem.com/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e11d04dd-bde1-4d3c-80ed-9a96ad231977",
+   "metadata": {},
+   "source": [
+    "### The Implementation Plan\n",
+    "\n",
+    "We'll create a function that accepts the text description of a sound file and returns a list of CatIDs sorted in order of how much each one is most appropriate, given the description.\n",
+    "\n",
+    "In order to do this we need to do the following:\n",
+    "\n",
+    "1. Get the UCS category list\n",
+    "2. Get a sentence embedding LLM\n",
+    "3. Calculate an embedding for each UCS category that vectorizes its lingustic meaning for the model\n",
+    "4. Calculate the embedding for the file to categorize\n",
+    "5. Calculate the similarity score for each category embedding with regard to the given file description\n",
+    "6. Sort the CatIDs by this score and return them\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7527cf48-2a3f-48bd-abb0-c12e01c61513",
+   "metadata": {},
+   "source": [
+    "### Get UCS category list\n",
+    "\n",
+    "I've made JSON lists of the UCS categories, explanations and synonyms [on GitHub](https://github.com/iluvcapra/ucs-community), based on Tim Nielsen's Excel database in all of the available languages. We just need English for this example so we can download it directly:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 23,
-   "id": "ef63fc07-c0d7-4616-9be1-1f0c2f275a69",
+   "execution_count": 1,
+   "id": "b3559105-949b-446a-a1e1-ca0544c5c841",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
+      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
+      "100  426k  100  426k    0     0   929k      0 --:--:-- --:--:-- --:--:--  931k\n"
+     ]
+    }
+   ],
+   "source": [
+    "! curl -o \"ucs-en.json\" \"https://raw.githubusercontent.com/iluvcapra/ucs-community/refs/heads/master/json/en.json\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "c84d5d99-ff4b-40ca-851c-ab89fe2263fd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CatID: AIRBrst\n",
+      "Explanations: Sharp air releases, pressure releases, a tennis call can popping open, a fire extinguisher\n"
+     ]
+    }
+   ],
+   "source": [
+    "import json\n",
+    "\n",
+    "with open(\"ucs-en.json\") as f:\n",
+    "    ucs = json.load(f)\n",
+    "\n",
+    "print(\"CatID:\", ucs[0]['CatID'])\n",
+    "print(\"Explanations:\", ucs[0]['Explanations'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6bd7561-12d8-4273-b868-dba2bbc96a23",
+   "metadata": {},
+   "source": [
+    "Now that we've downloaded the list, we can load the categories and their definitions. The definitions are stored in the JSON as a list of dictionaries."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a681c209-c9a1-4906-b01e-58eca229a2b7",
+   "metadata": {},
+   "source": [
+    "## Get a sentence embedding LLM\n",
+    "\n",
+    "We use the `sentence_transformers` module and select a model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "0f6a2ba0-a06b-4166-a4a3-35f0a714ed1d",
   "metadata": {},
   "outputs": [],
   "source": [
-    "import json\n",
-    "import os.path\n",
-    "\n",
    "from sentence_transformers import SentenceTransformer\n",
-    "import numpy as np\n",
-    "from numpy.linalg import norm\n",
    "\n",
    "MODEL_NAME = \"paraphrase-multilingual-mpnet-base-v2\"\n",
    "\n",
    "model = SentenceTransformer(MODEL_NAME)\n",
    "\n",
+    "# sentence_transformers will emit a deprecation warning in PyTorch, we can suppress it:\n",
+    "\n",
+    "import warnings\n",
+    "warnings.simplefilter(action='ignore', category=FutureWarning)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1badfc0b-d62c-4392-a0e0-64e19067966b",
+   "metadata": {},
+   "source": [
+    "## Calculate an embedding for each UCS category that vectorizes its lingustic meaning for the model\n",
+    "\n",
+    "We generate an embedding for a category by using the  _Explanations_, _Category_, _SubCategory_, and _Synonyms_."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "42eb70a4-2abf-4868-9aa0-7ab74c7ee936",
+   "metadata": {},
+   "outputs": [],
+   "source": [
    "def build_category_embedding(cat_info: list[dict] ):\n",
-    "    # print(f\"Building embedding for {cat_info['CatID']}...\")\n",
+    "    'Create an embedding for a single category'\n",
    "    components = [cat_info[\"Explanations\"], cat_info[\"Category\"], cat_info[\"SubCategory\"]] + cat_info.get('Synonyms', [])\n",
    "    composite_text = \". \".join(components)\n",
-    "    return model.encode(composite_text, convert_to_numpy=True)"
+    "    return model.encode(composite_text, convert_to_numpy=True)\n",
+    "\n",
+    "def create_embeddings() -> list:\n",
+    "    'Create embeddings for the entire UCS list'\n",
+    "    embeddings_list = []\n",
+    "    for info in ucs:\n",
+    "        embeddings_list += [{'CatID': info['CatID'], \n",
+    "                            'Embedding': build_category_embedding(info)\n",
+    "                           }]\n",
+    "\n",
+    "    return embeddings_list"
   ]
  },
  {
@@ -46,49 +189,39 @@
   "id": "da2820ad-851b-4cb7-a496-b124275eef58",
   "metadata": {},
   "source": [
-    "We now generate an embeddings for each category using the `ucs-community` repository, which conveniently has JSON versions of all of the UCS category descriptions and languages.\n",
+    "Calculating embeddings for the entire UCS will be very time consuming (many seconds or perhaps a minute on a modern laptop), we should save them to disk once they're created so we can reuse them after calculated them the first time.\n",
    "\n",
-    "We cache the categories in a file named `EMBEDDING_NAME.cache` so multiple runs don't have to recalculate the entire emebddings table. If this file doesn't exist we create it by creating the embeddings and pickling the result, and if it does we read it."
+    "We can cache the categories in a file named `EMBEDDING_NAME.cache` so multiple runs don't have to recalculate the entire emebddings table. If this file doesn't exist we create it by creating the embeddings and pickling the result, and if it does we read it.\n",
+    "\n",
+    "We'll save the embeddings in a list, `embeddings_list`, so we can refer to them later."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 5,
   "id": "d4c1bc74-5c5d-4714-b671-c75c45b82490",
-   "metadata": {},
+   "metadata": {
+    "scrolled": true
+   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Cached embeddings unavailable, recalculating...\n",
-      "Loaded 752 categories...\n",
-      "Writing embeddings to file...\n",
+      "Loading cached category emebddings...\n",
      "Loaded 752 category embeddings...\n"
     ]
    }
   ],
   "source": [
    "import pickle\n",
-    "\n",
-    "def create_embeddings(ucs: list) -> list:\n",
-    "    embeddings_list = []\n",
-    "    for info in ucs:\n",
-    "        embeddings_list += [{'CatID': info['CatID'], \n",
-    "                            'Embedding': build_category_embedding(info)\n",
-    "                           }]\n",
-    "\n",
-    "    return embeddings_list\n",
+    "import os\n",
    "\n",
    "EMBEDDING_CACHE_NAME = MODEL_NAME + \".cache\"\n",
    "\n",
    "if not os.path.exists(EMBEDDING_CACHE_NAME):\n",
    "    print(\"Cached embeddings unavailable, recalculating...\")\n",
    "\n",
-    "    # for lang in ['en']:\n",
-    "    with open(\"ucs-community/json/en.json\") as f:\n",
-    "        ucs = json.load(f)\n",
-    "    \n",
    "    print(f\"Loaded {len(ucs)} categories...\")\n",
    "    \n",
    "    embeddings_list = create_embeddings(ucs)\n",
@@ -98,41 +231,76 @@
    "        pickle.dump(embeddings_list, g)\n",
    "\n",
    "else:\n",
-    "    print(f\"Loading cached category emebddings...\")\n",
+    "    print(\"Loading cached category emebddings...\")\n",
    "    with open(EMBEDDING_CACHE_NAME, \"rb\") as g:\n",
    "        embeddings_list = pickle.load(g)\n",
    "\n",
-    "\n",
    "print(f\"Loaded {len(embeddings_list)} category embeddings...\")"
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": 29,
-   "id": "c98d1af1-b4c5-478c-b051-0f8f33399dfd",
+   "cell_type": "markdown",
+   "id": "29738074-3930-46d2-be45-39eaf0abc6e5",
   "metadata": {},
-   "outputs": [],
   "source": [
-    "def classify_text(text):\n",
-    "    text_embedding = model.encode(text, convert_to_numpy=True)\n",
-    "    sim = model.similarity(text_embedding, [info['Embedding'] for info in embeddings_list])\n",
-    "    maxind = np.argmax(sim)\n",
-    "    print(f\" ⇒ Category: {embeddings_list[maxind]['CatID']}\")\n",
+    "### Calculate the embedding for the file to categorize, and sort\n",
    "\n",
-    "\n",
-    "def classify_text_ranked(text):\n",
-    "    text_embedding = model.encode(text, convert_to_numpy=True)\n",
-    "    embeddings = np.array([info['Embedding'] for info in embeddings_list])\n",
-    "    sim = model.similarity(text_embedding, embeddings)[0]\n",
-    "    maxinds = np.argsort(sim)[-5:]\n",
-    "    # print(maxinds)\n",
-    "    print(\" ⇒ Top 5: \" + \", \".join([embeddings_list[x]['CatID'] for x in reversed(maxinds)]))\n"
+    "The remaining steps can be expressed economically using NumPy operations natively on the tensors:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 30,
-   "id": "21768c01-d75f-49be-9f47-686332ba7921",
+   "execution_count": 6,
+   "id": "c98d1af1-b4c5-478c-b051-0f8f33399dfd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "def classify_text_ranked(text):\n",
+    "    # First we obtain the embedding of the text description\n",
+    "    text_embedding = model.encode(text, convert_to_numpy=True)\n",
+    "\n",
+    "    # We collect the embeddings into an np.array in order to do the similarity calculation\n",
+    "    # in one shot.\n",
+    "    embeddings = np.array([info['Embedding'] for info in embeddings_list])\n",
+    "    sim = model.similarity(text_embedding, embeddings)\n",
+    "\n",
+    "    # `similarity` returns a tensor of rank 2 but it only has one member\n",
+    "    sim = sim[0]\n",
+    "\n",
+    "    # argsort gives us the indicies into `sim` in ascending order of their value. Grabbing the last\n",
+    "    # five gives us the five highest values, in ascending order.\n",
+    "    maxinds = np.argsort(sim)[-5:]\n",
+    "\n",
+    "    # We look up the CatIDs using maxinds, and reverse the list so they're now in descending order,\n",
+    "    # giving the best match first and each worse match following in order.\n",
+    "    catids = [embeddings_list[x]['CatID'] for x in reversed(maxinds)]\n",
+    "\n",
+    "    # And then print.\n",
+    "    print(\" ⇒ Top 5: \" + \", \".join(catids))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a58e41f-7ddd-4b6f-8c29-be55e9d49ab5",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "436d534b-ffc8-4931-a09c-c37a23115ffe",
+   "metadata": {},
+   "source": [
+    "We can now feed some possible file descriptions into our function and obtain a result:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "c894b72d-7590-444b-99ac-dff9c6246c1e",
   "metadata": {},
   "outputs": [
    {
@@ -189,7 +357,7 @@
    "for text in texts:\n",
    "    print(f\"Text: {text}\")\n",
    "    classify_text_ranked(text)\n",
-    "    print(\"\")\n"
+    "    print(\"\")"
   ]
  },
  {
--- a/ucsinfer/main.py
+++ b/ucsinfer/main.py
@@ -29,7 +29,10 @@ def load_ucs_categories() -> list:
    return cats

 def encoode_category(cat_defn: dict, model: SentenceTransformer) -> np.ndarray:
-    sentence_components = [cat_defn['Explanations'], cat_defn['Category'], cat_defn['SubCategory']]
+    sentence_components = [cat_defn['Explanations'], 
+                           cat_defn['Category'], 
+                           cat_defn['SubCategory']
+                           ]
    sentence_components += cat_defn['Synonyms']
    sentence = ", ".join(sentence_components)
    return model.encode(sentence, convert_to_numpy=True)
@@ -60,8 +63,9 @@ def load_embeddings(ucs: list, model) -> list:


 def description(path: str) -> Optional[str]:
-    result = subprocess.run(['ffprobe', '-show_format', '-of', 'json', path], capture_output=True)
-    # print(result)
+    result = subprocess.run(['ffprobe', '-show_format', '-of', 
+                             'json', path], capture_output=True)
+
    try:
        result.check_returncode()
    except:
@@ -93,8 +97,9 @@ def recommend_category(path, embeddings, model) -> Tuple[str, list]:

    return desc, classify_text_ranked(desc, embeddings, model)

-def lookup_cat(catid: str, ucs: list) -> Optional[tuple[str,str]]:
-    return next( ((x['Category'], x['SubCategory']) for x in ucs if x['CatID'] == catid) , None)
+def lookup_cat(catid: str, ucs: list) -> tuple[str,str]:
+    return next( ((x['Category'], x['SubCategory']) \
+            for x in ucs if x['CatID'] == catid))


 class Commands(cmd.Cmd):
@@ -103,39 +108,20 @@ class Commands(cmd.Cmd):
                 stdout: IO[str] | None = None) -> None:
        super().__init__(completekey, stdin, stdout)
        self.file_list = []
-        self.model = None
-        self.embeddings = None
-        self.catlist = None
-        self._rec_list = []
-        self._file_cursor = 0
+        self.model: Optional[SentenceTransformer] = None
+        self.embeddings: Optional[list] = None
+        self.catlist: Optional[list] = None
+        self.model_rec_list = []
+        self.history = []

-    @property
-    def file_cursor(self):
-        return self._file_cursor
-
-    @file_cursor.setter
-    def file_cursor(self, val):
-        self._file_cursor = val 
-        self.onecmd('file')
-
-    @property 
-    def rec_list(self):
-        return self._rec_list
-
-    @rec_list.setter
-    def rec_list(self, value):
-        self._rec_list = value
-        if isinstance(self.rec_list, list) and self.catlist:
-            for i, cat_id in enumerate(self.rec_list):
-                cat, subcat = lookup_cat(cat_id, self.catlist)
-                print(f"  [ {i+1} ]: {cat_id} ({cat} / {subcat})")

    def default(self, line):     
-        if len(self.rec_list) > 0:
        try:
            rec = int(line)
-                if rec < len(self.rec_list):
+            if rec < len(self.model_rec_list):
                print(f"Accept option {rec}")
+                ind = rec - 1
+                self.history = [self.model_rec_list[ind]] + self.history[0:4]
                self.onecmd("next")
            else:
                pass
@@ -143,8 +129,6 @@ class Commands(cmd.Cmd):
        except ValueError:
            super().default(line)

-        else:
-            super().default(line)

    def preloop(self) -> None:
        self.file_cursor = 0
@@ -152,45 +136,56 @@ class Commands(cmd.Cmd):
        return super().preloop()

    def postcmd(self, stop: bool, line: str) -> bool:
+        self.update_prompt()
        return super().postcmd(stop, line)

    def update_prompt(self):
        self.prompt = f"(ucsinfer:{self.file_cursor}/{len(self.file_list)}) "
 
-    def do_file(self, args):
+    def do_file(self, _):
        'Print info about the current file'
        if self.file_cursor < len(self.file_list):
            self.update_prompt()
            path = self.file_list[self.file_cursor]
            f = os.path.basename(path)
            print(f"  > {f}")
-            desc, recs = recommend_category(path, self.embeddings, self.model)
-            print(f"  >> {desc}")
-            self.rec_list = recs
        else:
            print( "  > No file")

-
-
-    def do_addcontext(self, args):
-        'Add the argument to all file descriptions before searching for '
-        'similar. Enter a blank value to reset.'
-        pass
+    def do_rec(self, _):
+        if self.file_cursor < len(self.file_list):
+            self.update_prompt()
+            path = self.file_list[self.file_cursor]
+            desc, recs = recommend_category(path, self.embeddings, self.model)
+            print(f"  >> {desc}")
+            self.model_rec_list = recs
+        else:
+            self.model_rec_list = []

    def do_lookup(self, args):
        'print a list of UCS categories similar to the argument'
-        self.rec_list = classify_text_ranked(args, self.embeddings, self.model)
+        self.model_rec_list = classify_text_ranked(args, self.embeddings,
+                                                   self.model)
+
+    def do_ls(self, _):
+        'Print list of all files in the buffer'
+        for file in self.file_list[self.file_cursor:] + \
+                self.file_list[0:self.file_cursor]:
+            f = os.path.basename(file)
+            print(f"  > {f}")

    def do_next(self, _):
        'go to next file'
        self.file_cursor += 1
+        self.file_cursor = self.file_cursor % len(self.file_list)

    def do_prev(self, _):
        'go to previous file'
        self.file_cursor -= 1
+        self.file_cursor = self.file_cursor % len(self.file_list)

-    def do_quit(self, _):
-        'exit'
+    def do_bye(self, _):
+        'exit the program'
        print("Exiting...")
        return True