Example applications and performance

The following serves as a demonstration/tutorial of how to apply the occupation coding tools on some example data, including full code examples as a tutorial/walkthrough. We also show some tests to compare tool predictions to manually assigned codes in the example data, to serve as basic accuracy tests.

Further background to the process and techiques used are set out in these pages. For details on the example data used, see here.

All the following examples assume you have a working Python install on your system.

occupationcoder-international

Initial set up

To install occupationcoder-international, you can install the package directly from its Github repository using pip, as set out here. Alternatively, installing the requirements.txt for the present repository will also install everything required.

The package comes with a “dictionary” for the ISCO-08 coding scheme, so there is no need to set this up manually.

Coding the examples

While occupationcoder-international comes with a script that allows it to be run on an input file directly from the command line, we here demonstrate its use in Python code, as a module. We assume this is the most likely approach for embedding this tool into existing (Python) pipelines.

To load the required modules, set up the coder class, and load the example data, you can do the following:

import pandas as pd
from oc3i import Coder

# Initialise the coder with the desired coding scheme (here ISCO):
coder = Coder(scheme = "isco")

# Load the example data (please adjust the file path as needed on your machine)
example_data = pd.read_csv("../data/isco_benchmark_data.csv", dtype={'MANUAL_ISCO1': str, 'MANUAL_ISCO2': str, 'MANUAL_ISCO3': str})
#example_data.head()
# Code the example using all text in the TITLE, TASKS and INDUSTRY columns.
example_coded = coder.code_data_frame(example_data, 
                                        title_column = "TITLE", 
                                        description_column = "TASKS", 
                                        sector_column = "INDUSTRY")
#example_coded.head()
Coding 85 records in dataframe...
Warning: Column 'INDUSTRY' contains 34 missing values. These will be interpreted as empty strings.
Warning: Column 'TASKS' contains 6 missing values. These will be interpreted as empty strings.

We can now assess performance of occupationcoder-international, by calculating “accuracy” as the number of cases where the top predicted code matches the manually assigned code exactly:

# Accuracy 1: number of cases where top prediction matches the preferred manually assigned code:
match1 = (example_coded["MANUAL_ISCO1"] == example_coded["prediction 1"]).sum()
#print(match1)

Alternatively, we can be less conservative and also consider cases where either the top 1, 2 or 3 prediction matches the manually assigned code (i.e. assuming a user uses the tool to suggest potential alternative options):

# Accuracy 2: number of cases where manually assigned code is included in the top 3 predictions:
match123 = example_coded["MANUAL_ISCO1"].isin(
                example_coded[["prediction 1", "prediction 2", "prediction 3"]].values.flatten()
           ).sum()
#print(match123)

To avoid needing to re-run the same calculation every time, we can wrap the above in a short function that also calculates the % of the total (note, the following code is deliberately verbose for demonstration purposes - we suggest this is optimised and expanded on in production settings):

def matches(dat, preds, output = "both"):

    # Calculate number of matches to "best" prediction:
    match1 = (dat["MANUAL_ISCO1"] == dat[preds[0]]).sum()

    # Calculate number of matches to any of the top 3 predictions:
    match123 = dat["MANUAL_ISCO1"].isin(
                   dat[preds].values.flatten()
               ).sum()

    # Calculate proportions:
    match1p = (match1/len(dat))*100
    match123p = (match123/len(dat))*100

    # Ouput match count, proportion or both - formatted as strings:
    if output == "both":
        return([f"{match1} ({match1p:.1f}%)", f"{match123} ({match123p:.1f}%)"])
    if output == "abs":
        return([f"{match1}", f"{match123}"])
    if output == "prop":
        return([f"{match1p:.1f}%", f"{match123p:.1f}%"])

We can now use this function to calculate performance on different subsets of the data.

For example, we first split the full example data set by TYPE (for a more detailed explanation of these data TYPES, see here). Specifically, we split by (1) cases where we expect an exact match; (2) cases that should be codeable but not an exact match; (3) cases we expect to be challenging due to ambiguity:

t1 = example_coded[example_coded["TYPE"] == "Exact match"]
t2 = example_coded[example_coded["TYPE"].isna()]
t3 = example_coded[example_coded["TYPE"].str.contains('ambigui', case=False, na=False)]

Now we can use our function above on the full example data set, as well as the subsets, and collate a summary table (note that the code below does not follow best practice; this is deliberate to ensure maximum accessibility):

preds1 = ["prediction 1", "prediction 2", "prediction 3"]
summary1 = [
    ["All examples", len(example_coded),matches(example_coded, preds1)[0],matches(example_coded, preds1)[1]],
    ["Expected exact match", len(t1), matches(t1, preds1)[0],matches(t1, preds1)[1]],
    ["Expected codeable", len(t2), matches(t2, preds1)[0],matches(t2, preds1)[1]],
    ["Expected ambiguity", len(t3), matches(t3, preds1)[0],matches(t3, preds1)[1]]
]
summary1 = pd.DataFrame(summary1, columns = ["Type","Total", "Match to 1","Match in 1-3",])

The output of the above is shown in the following section.

Occpationcoder accuracy

Match counts and rates between manually assigned codes, and those produced by occupationcoder-international are given in the table below.

Table 1: occupationcoder-international 'accuracy', estimated as match rates between manually assigned occupation codes and either the highest ranked match ('Match to 1'), or as a match to any of the best 3 matches ('Match in 1-3'), grouped by either all examples or examples split by type.
Type Total Match to 1 Match in 1-3
All examples 85 42 (49.4%) 66 (77.6%)
Expected exact match 17 17 (100.0%) 17 (100.0%)
Expected codeable 40 17 (42.5%) 30 (75.0%)
Expected ambiguity 25 6 (24.0%) 14 (56.0%)

The above results show that:

  • When considering all of the example data in 49.4% of cases, the top prediction from occupationcoder-international agrees with a manually assigned code. When interpreting a “match” as the manual code being included in the top 3 predictions, this increases to 77.6% cases. It is worth stressing this includes all example cases we a priori expected to be challenging.
  • As expected, direct match rates are much lower for known ambiguous cases (24.0%); but this is much improved by also considering the top 3 predictions (56.0%).
  • Less informative (as this is a feature of occupationcoder-international, but nevertheless a good sense check), in all cases where we expected an exact match, this is matched correctly.

classifai

Initial set up

Classifai can be installed either by following the instructions provided with its repository (recommended method); or as of the time of writing this, by installing the requirements for the present repository (noting that this may change as classifai is under active development).

Note that for the purposes of this example and tests, we here use a Huggingface model as the vectoriser - Classifai can use other models and performance may vary. We also stress that vectorising the entire coding scheme takes some time, so it is worth doing this once, and re-using a saved copy when running queries. This minimises processing overhead.

Processing the data

We first need to retrieve the raw ISCO coding scheme data, and convert into a format that can be vectorised by Classifai.
The following sets download targets for the scheme (note this uses a copy included with occuptioncoder-international; it can also be sourced from the ILO website) and local save locations.

ISCO_DATA_SOURCE = "https://raw.githubusercontent.com/datasciencecampus/occupationcoder-international/main/data/ISCO-08%20EN%20Structure%20and%20definitions.xlsx"
ISCO_DATA_FILE = "../data/ISCO-08-scheme.xlsx" # Please adjust this path as required if running this code on your machine
PROCESSED_ISCO_DATA = "../data/ISCO-08-processed.csv" # Please adjust this path as required if running this code on your machine

We can now download and save the ISCO file using a simple helper function included with this repository:

from src.utils import get_isco_scheme_data
if not os.path.exists(ISCO_DATA_FILE):
    get_isco_scheme_data(source_url=ISCO_DATA_SOURCE, local_file_path=ISCO_DATA_FILE)

The ISCO-08 scheme consists of separate columns containing the ISCO code, the corresponding title, and various description fields. We here chose to simply concatenate the latter, resulting in a raw input file with two columns: ISCO code, and all description for that code. For the sake of simplicity, we provide a helper function that does this; used as follows:

from src.utils import process_excel_to_csv
process_excel_to_csv(ISCO_DATA_FILE, PROCESSED_ISCO_DATA)
Filtered and processed data saved to ../data/ISCO-08-processed.csv

Create vector store

Now that we have the scheme data processed in a format that can be used easily by Classifai, we download our chosen model for this example (all-mpnet-base-v2).

from classifai.vectorisers import HuggingFaceVectoriser
from classifai.indexers import VectorStore
hf_vectoriser = HuggingFaceVectoriser(model_name="sentence-transformers/all-mpnet-base-v2")
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/classifai/indexers/main.py:38: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

We can now use this model and the Classifai functions to create a local vector database from the scheme data. As above, we only need to do this once on a given machine - so the code below first checks whether we already have a previously created store, and loads this if available. Otherwise, it creates a new one.

if not os.path.exists("../data/hf_vectoriser"):
    hf_vector_store = VectorStore(
        file_name="../data/ISCO-08-processed.csv",
        data_type="csv",
        vectoriser=hf_vectoriser,
        output_dir="../data/hf_vectoriser",
        overwrite=True
    )
else:
    hf_vector_store = VectorStore.from_filespace(folder_path="../data/hf_vectoriser",vectoriser=hf_vectoriser)
INFO - Processing file: ../data/ISCO-08-processed.csv...
INFO - Gathering metadata and saving vector store / metadata...
INFO - Vector Store created - files saved to ../data/hf_vectoriser

Coding the examples

Now that we have the local vector database from the scheme, we can use Classifai to code the example data. For the purposes of this example, we simply extract all available input text (ie job title, tasks description and industry description) from the example data, and combine this into a single string.

jobs = example_coded["TITLE"] + " " + example_coded["TASKS"] + " " + example_coded["INDUSTRY"]
classifai_coded = hf_vector_store.search(jobs.tolist(), n_results=3)

Classifai outputs results in a longitudinal format where “rank” 0-2 indicates the first, second and third best match respectively. For convenience, we here extract each of these in turn for a given example, and add these as coluns to our example coded data set.

example_coded["classifai_p0"] = classifai_coded[classifai_coded["rank"]==0]["doc_id"].values
example_coded["classifai_p1"] = classifai_coded[classifai_coded["rank"]==1]["doc_id"].values
example_coded["classifai_p2"] = classifai_coded[classifai_coded["rank"]==2]["doc_id"].values

We can now compare the manually assigned code to the ones top 3 matches provided by Classifai, following the same approach as for occupationcoder-international as above. We first extract subsets of the data for expected exact matches, expected codeable, and expected ambiguous examples:

t1a = example_coded[example_coded["TYPE"] == "Exact match"]
t2a = example_coded[example_coded["TYPE"].isna()]
t3a = example_coded[example_coded["TYPE"].str.contains('ambigui', case=False, na=False)]

We then use the same macthing function defined above - but this time using the Classifai predictions.

preds2 = ["classifai_p0", "classifai_p1", "classifai_p2"]
summary2 = [
    ["All examples", len(example_coded),matches(example_coded, preds2)[0], matches(example_coded, preds2)[1]],
    ["Expected exact match", len(t1a), matches(t1a, preds2)[0],matches(t1a, preds2)[1]],
    ["Expected codeable", len(t2a), matches(t2a, preds2)[0],matches(t2a,preds2)[1]],
    ["Expected ambiguity", len(t3a), matches(t3a, preds2)[0],matches(t3a, preds2)[1]]
]
summary2 = pd.DataFrame(summary2, columns = ["Type","Total", "Match to 1","Match in 1-3",])

Classifai accuracy

Match counts and rates between manually assigned codes, and those produced by Classifai are given in the table below.

Table 2: Classifai 'accuracy', estimated as match rates between manually assigned occupation codes and either the highest ranked match ('Match to 1'), or as a match to any of the best 3 matches ('Match in 1-3'), grouped by either all examples or examples split by type.
Type Total Match to 1 Match in 1-3
All examples 85 47 (55.3%) 75 (88.2%)
Expected exact match 17 13 (76.5%) 16 (94.1%)
Expected codeable 40 23 (57.5%) 37 (92.5%)
Expected ambiguity 25 10 (40.0%) 16 (64.0%)
  • Across all of the example data in 55.3% of cases, the top prediction from Classifai agrees with a manually assigned code. When interpreting a “match” as the manual code being included in the top 3 predictions, this increases to 88.2% cases. It is worth stressing this includes all example cases we a priori expected to be challenging.
  • Match rates to the top predictions are lower for known ambiguous cases (40.0%); but this is much improved by also considering the top 3 predictions (64.0%).

Summary & comparison

Based on a synthetic benchmark data set (N = 85), and defining “accuracy” as a direct match between a preferred manually assigned code, and the best predicted match;

  • Accuracy for occupationcoder-international is 49.4% across the full example data set (i.e. including known ambigous cases).
  • Accuracy for Classifai is 55.3% across the same full example data set.
  • Both tools perform better when all top 3 predictions are considered, with occupationcoder-international at 77.6% and Classifai at 88.2%.
  • Comparing occupationcoder-international and Classifai:
    • Classifai’s accuracy performs better for a priori ambiguous cases (40.0%), compared to occupationcoder-international (24.0%).
    • Whereas (as expected, given its design) occupationcoder-international correctly matches all expected exact matches, Classifai misses a relatively small number of these.