Files

Jamie Hardt 53a91f6103 Reworking

2025-08-09 20:58:53 -07:00

14 KiB

Raw Blame History

Finding UCS categories with sentence embedding¶

This is a brief example of using LLM sentence embedding to decide the UCS category for a sound, based on a text description.

Background: UCS¶

UCS is a standard category/subject-name schedule for libraries of general sound effects, and a corresponding file naming system to allow files categornized with UCS can be organized and collated.

There are 762 UCS categories with two levels of heirarchy: general Categories such as AMBIENCE, FOLEY or WOOD, and then SubCategories within these. Subcategories of FOLEY, for example, include CLOTH, PROP and HANDS. Individual categories are identified with a concatenation of these, the HANDS subcategory of the FOLEY category becomes "FOLYHand", the subcategory's CatID.

The definition document for UCS contains Explanations for each subcategory to aid a categorizer in selecting the appropriate subcategory for a particular sound, along with a long list of Synonyms. For example, the "FOLYHand" subcategory has the following explanation:

Used for 'performed' official and very clean Foley done to picture. For your own wild recordings use FOOTSTEPS and OBJECT categories.

And synonymns are:

"clapping", "flicking", "grab", "grasping", "handle", "pat", "patting", "rubbing", "scratching", "set", "shaking", "slapping", "snapping", "touching"

The Problem: Choosing the appropriate UCS category for a sound¶

Explanations and Synonyms are meant to be human-readable. If a file's name or metadata description contains a synonymn this may mean the file falls within that category but often does not. For example, the synonym "fire" appears in the subcategory ALARM-BELL, ALARM-BUZZER, ALARM-SIREN, AMBIENCE-EMERGENCY and several others. The plain-English meaning of the sound, as given by its name and Explanations, must be considered.

This is an obvious application for a sentence embedding, which can reduce the explanations for an individual subcategory to a tensor which can then be compared with a corresponding tensor derived from the sound's text description.

The Implementation Plan¶

We'll create a function that accepts the text description of a sound file and returns a list of CatIDs sorted in order of how much each one is most appropriate, given the description.

In order to do this we need to do the following:

Get the UCS category list
Get a sentence embedding LLM
Calculate an embedding for each UCS category that vectorizes its lingustic meaning for the model
Calculate the embedding for the file to categorize
Calculate the similarity score for each category embedding with regard to the given file description
Sort the CatIDs by this score and return them

I've made JSON lists of the UCS categories, explanations and synonyms on GitHub, based on Tim Nielsen's Excel database in all of the available languages. We just need English for this example so we can download it directly:

In [1]:

! curl -o "ucs-en.json" "https://raw.githubusercontent.com/iluvcapra/ucs-community/refs/heads/master/json/en.json"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  426k  100  426k    0     0   929k      0 --:--:-- --:--:-- --:--:--  931k

In [2]:

import json

with open("ucs-en.json") as f:
    ucs = json.load(f)

print("CatID:", ucs[0]['CatID'])
print("Explanations:", ucs[0]['Explanations'])

CatID: AIRBrst
Explanations: Sharp air releases, pressure releases, a tennis call can popping open, a fire extinguisher

Now that we've downloaded the list, we can load the categories and their definitions. The definitions are stored in the JSON as a list of dictionaries.

Get a sentence embedding LLM¶

We use the sentence_transformers module and select a model.

In [3]:

from sentence_transformers import SentenceTransformer

MODEL_NAME = "paraphrase-multilingual-mpnet-base-v2"

model = SentenceTransformer(MODEL_NAME)

# sentence_transformers will emit a deprecation warning in PyTorch, we can suppress it:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Calculate an embedding for each UCS category that vectorizes its lingustic meaning for the model¶

We generate an embedding for a category by using the Explanations, Category, SubCategory, and Synonyms.

In [4]:

def build_category_embedding(cat_info: list[dict] ):
    'Create an embedding for a single category'
    components = [cat_info["Explanations"], cat_info["Category"], cat_info["SubCategory"]] + cat_info.get('Synonyms', [])
    composite_text = ". ".join(components)
    return model.encode(composite_text, convert_to_numpy=True)

def create_embeddings() -> list:
    'Create embeddings for the entire UCS list'
    embeddings_list = []
    for info in ucs:
        embeddings_list += [{'CatID': info['CatID'], 
                            'Embedding': build_category_embedding(info)
                           }]

    return embeddings_list

Calculating embeddings for the entire UCS will be very time consuming (many seconds or perhaps a minute on a modern laptop), we should save them to disk once they're created so we can reuse them after calculated them the first time.

We can cache the categories in a file named EMBEDDING_NAME.cache so multiple runs don't have to recalculate the entire emebddings table. If this file doesn't exist we create it by creating the embeddings and pickling the result, and if it does we read it.

We'll save the embeddings in a list, embeddings_list, so we can refer to them later.

In [5]:

import pickle
import os

EMBEDDING_CACHE_NAME = MODEL_NAME + ".cache"

if not os.path.exists(EMBEDDING_CACHE_NAME):
    print("Cached embeddings unavailable, recalculating...")

    print(f"Loaded {len(ucs)} categories...")
    
    embeddings_list = create_embeddings(ucs)

    with open(EMBEDDING_CACHE_NAME, "wb") as g:
        print("Writing embeddings to file...")
        pickle.dump(embeddings_list, g)

else:
    print("Loading cached category emebddings...")
    with open(EMBEDDING_CACHE_NAME, "rb") as g:
        embeddings_list = pickle.load(g)

print(f"Loaded {len(embeddings_list)} category embeddings...")

Loading cached category emebddings...
Loaded 752 category embeddings...

Calculate the embedding for the file to categorize, and sort¶

The remaining steps can be expressed economically using NumPy operations natively on the tensors:

In [6]:

import numpy as np

def classify_text_ranked(text):
    # First we obtain the embedding of the text description
    text_embedding = model.encode(text, convert_to_numpy=True)

    # We collect the embeddings into an np.array in order to do the similarity calculation
    # in one shot.
    embeddings = np.array([info['Embedding'] for info in embeddings_list])
    sim = model.similarity(text_embedding, embeddings)

    # `similarity` returns a tensor of rank 2 but it only has one member
    sim = sim[0]

    # argsort gives us the indicies into `sim` in ascending order of their value. Grabbing the last
    # five gives us the five highest values, in ascending order.
    maxinds = np.argsort(sim)[-5:]

    # We look up the CatIDs using maxinds, and reverse the list so they're now in descending order,
    # giving the best match first and each worse match following in order.
    catids = [embeddings_list[x]['CatID'] for x in reversed(maxinds)]

    # And then print.
    print(" ⇒ Top 5: " + ", ".join(catids))

In [ ]:

We can now feed some possible file descriptions into our function and obtain a result:

In [7]:

texts = [
    "Black powder explosion with loud report",
    "Steam enging chuff",
    "Playing card flick onto table",
    "BMW 228 out fast",
    "City night skyline atmosphere",
    "Civil war 12-pound gun cannon",
    "Domestic combination boiler - pump switches off & cooling",
    "Cello bow on cactus, animal screech",
    "Electricity Generator And Arc Machine Start Up",
    "Horse, canter One Horse: Canter Up, Stop"
]

for text in texts:
    print(f"Text: {text}")
    classify_text_ranked(text)
    print("")

Text: Black powder explosion with loud report
 ⇒ Top 5: EXPLMisc, AIRBrst, METLCrsh, EXPLReal, FIREBrst

Text: Steam enging chuff
 ⇒ Top 5: TRNSteam, FIRESizz, FIREGas, WATRFizz, GEOFuma

Text: Playing card flick onto table
 ⇒ Top 5: GAMEMisc, GAMEBoard, GAMECas, PAPRFltr, GAMEArcd

Text: BMW 228 out fast
 ⇒ Top 5: MOTRMisc, AIRHiss, VEHTire, VEHMoto, VEHAntq

Text: City night skyline atmosphere
 ⇒ Top 5: AMBUrbn, AMBTraf, AMBCele, AMBAir, AMBTran

Text: Civil war 12-pound gun cannon
 ⇒ Top 5: GUNCano, GUNArtl, GUNRif, BLLTMisc, WEAPMisc

Text: Domestic combination boiler - pump switches off & cooling
 ⇒ Top 5: MACHHvac, MACHFan, MACHPump, MOTRTurb, MECHRelay

Text: Cello bow on cactus, animal screech
 ⇒ Top 5: MUSCStr, CERMTonl, MUSCShake, MUSCPluck, MUSCWind

Text: Electricity Generator And Arc Machine Start Up
 ⇒ Top 5: ELECArc, MOTRElec, ELECSprk, BOATElec, TOOLPowr

Text: Horse, canter One Horse: Canter Up, Stop
 ⇒ Top 5: VOXScrm, WEAPWhip, FEETHors, MOVEAnml, VEHWagn