Introduction

With this repository, we aim to provide a demonstration of:

How different coding tools can be used to code descriptions of occupations to a standardised coding scheme like ISCO-08.
The (relative) performance of such coding tools, benchmarked against a common example data set.

We compare two tools that can support the occupation coding process - occupationcoder-international and Classifai. Other tools are available. It is also worth stressing that in the way they are used here, the tools used here should be seen as just a small part of a wider coding “pipeline”. For example, in production settings, it will be important to embed the workflow set out here into a wider process that might include initial data processing, but also post-processing steps that might involve (for example) confirming final classification.

Background and context

What is “occupation coding”

National Statistics Organisations (NSOs) collect data on respondents’ occupations through surveys such as Labour Force Surveys and Censuses. Enumerators record brief text answers to questions like “What job have you done in the last X months?” and “What tasks did it involve?”

Using these short descriptions, enumerators or later clerical coders assign one or more classes from an occupational coding scheme (e.g., ISCO-08). These schemes link job descriptions to standardised numeric codes, enabling consistent reporting and international comparability. While ISCO-08 is the international standard, many countries use adapted national versions (e.g., the UK’s SOC or Namibia’s NASCO).

Coding schemes are hierarchical, with each numeric code representing a class that includes a title (e.g., “Psychologists”), a task description, and common job titles. Higher-level groups encompass more specific ones, with hierarchy reflected in the number of digits in the code. If a description is ambiguous, coders may assign multiple possible codes or fall back to a higher-level category.

Common challenges

Accurate classification of job descriptions depends on coders correctly applying the coding scheme and understanding its scope and limitations. This requires substantial training and ongoing skill maintenance, which is often underresourced, leading to data quality issues.

Enumerators must probe for sufficient detail to enable precise coding, while clerical coders need a clear grasp of the data’s constraints, the scheme, and the dataset’s intended use. Coding frequently involves trade-offs, selecting multiple plausible codes, and maintaining consistent procedures.

Consistency is both essential and difficult to achieve. Even well-trained coders may classify ambiguous descriptions differently or struggle when the scheme lacks detail. Ideally, validation—such as double-blind coding with reconciliation of discrepancies—would be widespread, but this can be impractical for large surveys with millions of records due to the resources required.

Purpose of coding tools

Occupation coding tools aim to reduce the workload of coding, especially during validation. They should be seen as assistive tools, not fully autonomous coders: limited input data and coarse coding schemes mean that ambiguity faced by humans also affects any coding tool. Thus, keeping a human in the loop is crucial, particularly for business-critical outputs.

In the application demonstrated here, we focus on workflows where manually assigned codes already exist (ie after data collection), with the tools helping to review and potentially revise them. Although they could be integrated into data-collection software (e.g. CSPro), the focus is on their role in data processing, not collection. When used as such, the tools have the potential to significantly increase efficiency by (1) filtering out cases that do not need any revision as the manual codes agree with tool predictions; and (2) for those cases that do need review, provide (ranked) alternative (better) choices.

Coding tools used

Tools to support occupation coding have been available for decades. Established examples are CASCOT and GCode. However, most such tools are closed-source and proprietory. As a result, it is not trivial to build these into existing pipelines, and it can be challenging to adapt them to bespoke schemes. In addition, the growing availability of context-driven language models provide potential for further development.

While we by no means argue against using other established tools, we here compare application and performance of two open source solutions: occupationcoder-international and Classifai.
Both of these use distinctly different technical approaches, and because they are Python based tools, can be embedded directly in existing pipelines. In part, the pages presented here serve as a demonstration of their application as much as a comparison of their performance.

occupationcoder-international

occupationcoder-international is a string “fuzzy matching” tool, extending an existing tool for the UK SOC scheme to the international ISCO one.

In brief, occupationcoder-international compares a given text input to each of the class descriptions in the ISCO scheme, and identifies which are most similar from a pure text perspective, suggesting the closest matches as the most appropriate class.

More specifically, it uses TF-IDF to turn text descriptions of scheme classes into numeric vectors that identify words particularly associated with different classes. By doing the same with the text inputs, and ranking its similarity to those for the scheme classes, it returns the “most suitable” matches.

Importantly, it should be stressed that -at its core- this process works purely with word frequency/commonality. There is no (explicit) consideration or “understanding” of context, and accuracy of matches rely purely on sufficient detail being available in both the text input and the class descriptions in the scheme, and the being sufficiently distinct.

Obvious limitations aside, this approach is well established (TF-IDF is a widely used technique in Natural Language Processing), is quick to implement, and in most scenarios is very fast.

Application and performance

For a demonstration of the use of occupationcoder-international, and performance tests, see HERE.

classifai

Classifai is a general vector search tool using (Large) Language Models to vectorise scheme and input text. It can be used to identify the best match(es) between input text and specific sections of a corpus, such as a coding scheme. The overall logic is very similar to occupationcoder-international, in that both the scheme and input text are turned into numerical vectors, compared and ranked by similarity.

The key difference is that Classifai uses pre-trained (Large) Language Models to do this vectorisation. Such models produce context-aware embeddings, which means that when used as vectorisers, they can provide some level of “understanding” of context.
For example, a good pre-trained Language Model shown the phrase “tuktuk” would likely associate this with “motorised rickshaw driver” or similar. By contrast, unless “tuktuk” is explicitly given in the coding scheme, a TF-IDF based approach would fail, because there simply is no word-based link between this input phrase and any class in the scheme.

While the ability to consider context may represent a significant advantage, it should be noted that the use of pre-trained models as vectorisers can be restricted in certain circumstances, and their use can be more computationally intensive compared to TF-IDF (but, depending on implementation, this can be effectively managed).

Application and performance

For a demonstration of the use of classifai (specifically for classifying jobs to ISCO-08), and performance tests, see HERE.