Synthetic example data for occupation coding

Background and context

This repository includes a synthetic data file containing example text descriptions of occupations (including tasks, industry etc), coded to the ISCO-08 International Standard Classification of Occupations scheme.

Data included here

The example data combine job titles, descriptions, and industry information to assign one or more ISCO codes as accurately as possible, with comments added where relevant.

These synthetic cases were built from varied sources—job adverts, example job descriptions, including ones inspired by real-world census or survey data we have seen in a range of English-speaking countries (including the UK and numerous African countries). Note that for reasons of data confidentiality, no examples included are directly taken from such datasets.

Each case was manually coded using the ISCO-08 scheme, focusing on listed example jobs, class titles, and task descriptions. This requires interpretation and contextual understanding (e.g., “senior dev at gaming company” implies “Software Developer”). Although up to three codes can be stored, most cases currently have a single manual code. Currently, all examples were compiled and coded by one (non-specialist) coder, so the dataset has limitations; contributions and review are encouraged.

Each record is also assigned a “Type” to indicate potential coding issues—such as ambiguity or spelling errors—capturing real-world challenges and enabling tests of coding tools under different conditions.

Why synthetic data?

To compare coding accuracy between tools, a shared benchmark is essential.

Ideally, this would be real-world data coded and thoroughly validated by clerical coders, but we are not aware of publicly accessible, rigorously cross-checked datasets coded to ISCO-08 that meet these requirements. Even if available, real-world data could not isolate or analyse specific input-data issues (e.g., known ambiguous cases, see below).

Purpose

The dataset included here serves two purposes: (1) to provide a common benchmark of text inputs pre-coded to ISCO for comparing coding tools; and (2) to enable analysis of tool performance, including under specific challenges (e.g., ambiguous input).

Contributing examples

We welcome suggested additions to these example data. Any proposed additions or changes should be made by making changes to the example file and including these in a formal Pull Request to the repository code.

As a minimum, when adding new examples, please include a unique sequential ID number, a job TITLE, and at least one expected ISCO code which should be assigned to it (MANUAL_ISCO1). For further detail, see “Data format & structure” below.

Because these data are used to test and compare coding approaches, it is vital such data (1) adhere to the core purpose/aims specified above; and (2) are formatted correctly. When proposing additions, please ensure you carefully consider the purpose and take note of the format requirements set out below.
Any additions will be reviewed before inclusion in the main branch of the repo and the authors reserve the right to reject additions and do not commit to specifying reasons for this.

Data format & structure

The example data file is in CSV format to maximise accessibility, interchangeability, while minimising potential cross-platform formatting issues.
The file has strict formatting requirements: these are enforced using validation steps, and any changes or additions to the data require validation to be passed before further code is run. For details see sections below, but in general:

A fixed number of expected columns (see below): if extra ones are found or some are missing, validation fails.
Exact column naming as below
Some columns can’t have missing values
Given ISCO codes have to be valid (ie included in the coding scheme)
TYPE values are one of a fixed number of categories, or left blank if no significant coding issues are expected.

Table 1: Example data columns and data types

Column	Description	Missing allowed?	Value
ID	Unique sequential ID number for example	No	Integer
TITLE	Text input: short job title	No	Free text
TASKS	Text input: longer description of job and tasks involved	Yes	Free text
INDUSTRY	Text input: description of industry for the job	Yes	Free text
MANUAL_ISCO1	First candidate ISCO-08 code (most preferred)	No	Valid ISCO code (1-4 digits)
MANUAL_ISCO2	Second candidate ISCO-08 code	Yes	Valid ISCO code (1-4 digits)
MANUAL_ISCO3	Third candidate ISCO-08 code	Yes	Valid ISCO code (1-4 digits)
TYPE	Type of example (see detail below)	Yes	One of a fixed number of types (see below), or missing (empty)
COMMENT	Comments	Yes	Free text

Example types

In the example data file, the TYPE value should be one of a fixed number of classes (see table below). It can also be left blank/empty - implying that no issues are expected with the “codeability” of the example.

TYPE is included in the example file to allow analyses of potential mismatches between the manually assigned codes, and the potential predictions by coding tools. It is important to stress that such mismatches can occur due to a variety of reasons - they may reflect genuine mistakes on the part of the enumerators or clerical coders; they may reflect ambiguity in the text inputs and therefore imply a level of subjectivity in the choice of code; or it may be due to ambiguity in the coding scheme itself.

While the example data set is intended to be coded as accurately as possible (ie. excluding straight errors in coding), the TYPE value in the data flags cases where we think mismatches between manual and predicted codes are still likely to happen (due to, for example, ambiguity in the input - see below). In addition, different TYPE classes provide some information on why we think that is the case.

Table 2: Example data columns and data types

Type	Description
Exact match	Expecting an 'exact match': job title is explicitly included in the ISCO-08 scheme
Exact match fail	Expecting an exact match as above, but not coded as such (ie. strictly speaking, a coding error)
Semantic ambiguity	Text inputs combined are semantically ambiguous (e.g. lack of detail means different codes could apply)
Scheme ambiguity	Ambiguity caused (predominantly) by structure/detail in the scheme, rather than the text inputs per se
Deeper ambiguity	A combination of different types of ambiguity of inputs and scheme
Input issue	Issues like misspellings in text inputs
	Blank values for type means no coding issues are expected

“Exact” matches

The ISCO-08 scheme includes examples of job titles that are expected to be directly associated with specific classes/codes. For example, “Ambassador” is included as a match to “Senior Government Officials (1112)”; and “Environmental analyst” is included as a match to “Environmental Engineers (2143)”.

Exact match. Cases where an exact match is expected based on the TITLE given and specification in the coding scheme, should be classed as TYPE “Exact match” in the example data.

Exact match fail. Cases (reflecting real-world examples) where enumerators or clerical coders have not have assigned expected exact matches strictly correctly.

Ambiguous cases

In many practical cases, text input describing occupations can be ambiguous in terms of classification to the scheme.

This is most apparent in cases where detail is lacking in text inputs. Extreme examples are “farmer” or “teacher”: in both cases, the scheme requires significantly more detail to allow confident coding to any level of granularity in the scheme. For “farmer”, for coding to 4 digits, ISCO-08 expects differentiation between e.g. livestock and crop farmers; or subsistence and commercial farmer. For “teacher”, 4-digit coding requires detail on the level of education, e.g. primary or secondary.
In our experience, such data limitations and resulting ambiguity is very common in real-world data, where for many different reasons enumerators often record only very limited written text (e.g. 2-3 words), even if survey guidelines specify more information should be included.

Similar to the ‘exact match’ example types above, the TYPE classes in the examples attempts to distinguish between three (partly overlapping/subjective) subtypes as listed below. Note that the assignment of these is (to some extent) open to interpretation, but an attempt is made to distinguish these to help potential for deeper analyses and comparisons.

Semantic ambiguity. These are cases where, given the text input, some level of semantic “understanding” is required to code to the correct class.
For example, “Kapana seller” is understood in some African countries as being a street food seller, i.e. 5212 in ISCO-08. However, as the scheme does not include any specific reference to this, this input requires broader semantic and contextual understanding of the meaning of this phrase. Whereas some coding tools (e.g. ClassifAI) may use language models that provide such context; others (e.g. occupationcoder-international) rely on string similarity, which would fail on this match as it does not include that broader context.

Scheme ambiguity. These are cases where the ambiguity is primarily due to limitations of the coding scheme, rather than the detail in the text input.
For example, “data scientist” as title alongside a fairly detailed description of the job is challenging to assign to a specific code in the scheme. This is because this class of job is simply not included in the scheme and potentially overlaps multiple classes. So, the assignment will need to be a compromise depending on the nuance in the text input. As with semantic ambiguity, in some limited cases, tools that are based on language models as opposed to (fuzzy) string matchers may help in these cases, but they are still likely to be limited in terms of potential for disambiguation.

Deeper ambiguity. These are cases where there is ambiguity of the types described above, but it is unclear which specifically; or the issue is a combination of factors. Such cases are likely to be challenging to code regardless of the approach used: in some cases, correct strict application of the coding scheme may require limiting coding to three- or two-digit levels (less granularity).

Input issues

These are cases where the likely coding issue is due to input errors such as spelling mistakes. For example, “barbar” instead of “barber”. While this should be codeable with a level of contextual understanding of language, more basic tools such as fuzzy- or string matchers will struggle.