Removal or decreasing of something. Typically used in the context of an Ablation Study in ML where some element of the data or model is removed and performance is compared to the original model (without ablation). The corresponding performance drop (usually) is taken as insight into the contribution of the element or feature which has been ablated over (i.e. removed).

Bag-Of-Words (BOW)

A collection of counts of how many times a word appears in a document (term frequency of raw counts). Information about the order or structure of the words in the sequence is lost. CBOW (Continuous BOW) is a dense vector representation extracted from a model trained to predict a centre word from the surrounding context.

See also: https://machinelearningmastery.com/gentle-introduction-bag-words-model/.

CCG Supertagging

Combinatory Categorical Grammar (CCG) Supertagging is a sequence tagging task in NLP. The standard parsing model of Clark and Curran (2007) uses over $400$ lexical categories (or supertags), compared to about $50$ part-of-speech (POS) tags for typical context-free grammar (CFG) parsers (Xu et al., 2017). As an example, take the following sequence of tokens as an input: [I, saw, squirrels, with, nuts], with output: [NP, (S\NP)/NP, NP, (NP\NP)/NP, NP].

See also: https://nlpprogress.com/english/ccg_supertagging.html, https://aclweb.org/anthology/P15-2041, https://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.4.493, https://aclweb.org/anthology/N/N16/N16-1026.pdf.

Cloze Questions

Fill-in-the-blank questions with multiple choice. Fill-in-the-blank questions without multiple choice are commonly referred to as open cloze questions. E.g. “John went for a ____ in the park.”

Cosine Similarity

A measure of similarity between two vectors based on their orientation. Two vectors with the same orientation have a cosine similarity of $1$ while two orthogonal vectors have a cosine similarity of $0$, irrespective of their magnitudes. It is derived from the Euclidean dot product formula and is given as: $\text{similarity} = \cos(\theta) = \frac{\mathbf{A}\cdot\mathbf{B}}{||\mathbf {A}|| \ ||\mathbf {B} ||}$. If both vectors $\mathbf{A}$ and $\mathbf{B}$ are normalised, then their magnitudes are both $1$ and the product of their magnitudes ($||\mathbf {A}|| \ ||\mathbf {B} ||$) is also $1$. It therefore follows that, for normalised vectors, the cosine similarity is equal to the dot product between them.

See also: https://en.wikipedia.org/wiki/Cosine_similarity.

End-to-end learning

End-to-end refers to the fact that we are asking the learning algorithm to go directly from the input to the desired output i.e. the learning algorithm directly connects the input end of the system to the output end.

F Score

Commonly refers to the $F_1$ score, however, it’s important to note that the F measure can be weighted towards higher recall than precision (e.g. the $F_2$ score) or higher precision (e.g. $F_{0.5}$). The general formula is $F_\beta = (1+\beta^2) \cdot \frac{precision \cdot recall}{(\beta^2 \cdot precision) + recall}$.

See also: https://en.wikipedia.org/wiki/F1_score.

F1 Score

The harmonic mean of precision and recall. $F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$.

See also: https://en.wikipedia.org/wiki/F1_score.

Hamming Distance

When between strings of equal length, is the minimum number of substitutions for one string to become the other.

Hashing Trick (Feature Hashing)

Allows you to use variable-size feature vectors with standard learning algorithms. Essentially, instead of using a vocabulary to define a BOW one-hot vector representation (so each word added to the vocabulary changes the vector size), define a large vector (say $\text{vector_size} = 2^{28}$) and a hashing function modulo vector size ($\% \text{vector_size}$). Then, pass each word into the hashing function and increment the resulting index. In this way, the vector size is fixed for any number of words including new words. It also means that there are less issues with words not seen in the training data (UNKs) if a suitable hashing function and vector size are chosen.

See also: https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f.

Hierarchical Clustering

Can be Agglomerative (bottom-up) or Divisive (top-down). Standard algorithm takes $O(n^3)$ time complexity and requires $O(n^2)$ memory.

See also: https://en.wikipedia.org/wiki/Hierarchical_clustering.

Hierarchical Softmax

Approximating the softmax function by converting to a binary tree. Probabilities are normalised because sum of any path to a leaf is $1$.

See also: http://ruder.io/word-embeddings-softmax/index.html#hierarchicalsoftmax.

Imputing Missing Values

Filling missing values in the data with approximating values. Common choices include the expected value (mean), most common value for categoricial features or using an ML model to predict the most likely values given some other features in the data.


[adjective] Something which is hidden or not directly observable.

Levenshtein Distance

A string metric for measuring the difference between two strings defined as the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. For example, the Levenshtein distance between “kitten” and “sitting” is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of “s” for “k”)
  2. sitten → sittin (substitution of “i” for “e”)
  3. sittin → sitting (insertion of “g” at the end).

See also: https://en.m.wikipedia.org/wiki/Levenshtein_distance.

Low-rank Approximation

In mathematics, low-rank approximation is a minimization problem, in which the cost function measures the fit between a given matrix (the data) and an approximating matrix (the optimization variable), subject to a constraint that the approximating matrix has reduced rank.


In 1D, lines or circles (but not figures of 8 because they have crossing points). In 2D, surfaces e.g. sphere. A manifold is a space “modeled” on Euclidean space.


An NP-complete decision problem is one belonging to both the NP and the NP-hard complexity classes. In this context, NP stands for “nondeterministic polynomial time”.

Order of Magnitude

In mathematics, if one amount is an order of magnitude larger than another, it is $10$ times larger than the other. If it is two orders of magnitude larger, it is $100$ ($10^2$) times larger.

See also: https://www.collinsdictionary.com/dictionary/english/order-of-magnitude.


Commonly used as an evaluation metric for language models. Perplexity is a measurement of how well a probability distribution or probability model (e.g. language model) predicts a sample. A low perplexity indicates that the probability distribution is good at predicting the sample (therefore lower perplexity is better). Where $H(p)$ is the entropy, perplexity is $2^{H(p)} = 2^{-\sum_x{p(x)\log_2p(x)}}$.

See also: https://en.wikipedia.org/wiki/Perplexity.


[noun] The existence of several meanings in a single word. For example, the word play has different meanings in the following sentences: “I like to play football” and “I went to watch a play”.

See also: https://www.collinsdictionary.com/dictionary/english/polysemy.


Precision (also positive predictive value) is the ratio of True Positives ($TP$) to the total number of positive examples in the data (true positives and false positives, $TP + FP$). $\text{precision} = \frac{TP}{TP + FP}$. A model with a perfect precision evaluation means that the model is predicting all positive examples correctly (i.e. no false positives).

See also: https://en.wikipedia.org/wiki/Precision_and_recall, https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c.


Recall (also sensitivity or true positive rate) is the ratio of True Positives ($TP$) to the sum of true positives and false negatives ($TP + FN$). $\text{recall} = \frac{TP}{TP + FN}$. A model with a perfect recall evaluation means that the model is predicting all negative examples correctly (i.e. no false negatives).

See also: https://en.wikipedia.org/wiki/Precision_and_recall, https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c.


[adjective] Most important or notable. E.g. “He read the salient facts quickly”.

See also: https://www.collinsdictionary.com/dictionary/english/salient.


[adjective] Semantic is used to describe things that deal with the meanings of words and sentences. E.g. “He did not want to enter into a semantic debate”.

See also: https://www.collinsdictionary.com/dictionary/english/semantic.

Semantic Role Labeling

“Who did what to whom”. A sequence labeling task which aims to model the predicate-argument structure of a sentence.

See also: https://paperswithcode.com/task/semantic-role-labeling.


“Sequence-to-sequence”. The task involves taking a sequence of tokens (usually characters, bytes, words, word pieces, etc) as inputs and outputs a sequence of predictions. Commonly used in an encoder-decoder setup.

Sequence Labeling

A task which takes in a sequence of tokens as input and predicts a categorical label corresponding to each input token. Examples of sequence labeling tasks are part-of-speech (POS) tagging and semantic role labeling (SRL).


A type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

See also: https://www.citusdata.com/blog/2018/01/10/sharding-in-plain-english/.

Soft Cosine Similarity

Cosine similarity matrix re-weighted using some other distance such as Levenshtein distance.


Normalized exponential function ($\frac{e^x}{\sum_ie^{x_i}}$). Takes a vector as an input and outputs a vector in which the highest input values have been pushed towards $1$ and the lowest input values have been pushed towards $0$. The sum of all the elements in the softmaxed vector is $1$ (i.e. normalized).


[adjective] Syntactic means relating to syntax i.e. the arrangement of and relationships among words, phrases, and clauses forming sentences; sentence structure.

See also: https://www.collinsdictionary.com/dictionary/english/syntactic.

Wasserstein Distance

Also referred to as “earth mover’s” distance. If each distribution is viewed as a unit amount of dirt piled on $M$, the metric is the minimum cost of turning one pile into the other, which is assumed to be the amount of dirt that needs to be moved multiplied by the distance it has to travel.