A Brief History of Computational Linguistics
Kristian Berg
Jun 11, 2024

The history of Natural Language Processing (NLP) goes back as early as the 1950s.
Introduction
With the advent of "ChatGPT" in the winter of 2022, the world awakened to the AI Spring thawed by the groundbreaking paper "Attention is All You Need", in which the Transformer Neural Network architecture (the T in GPT) was first proposed 5 years earlier.
Computational Linguists have been developing methods to help computers understand natural language since their origin, but drastically unerestimated the complexity of the task at the outset. Indeed the field matured alongside with, and helped propel the fields of Linguistics and Computer Science, but it was not until the stars aligned with the right training paradigm and intuitive user interface that the world would understand the implications of this research.
This post was written with the intent to help you understand how we arrived at the technology in our hands today.
Robot brain translates Russian into King’s English
"Mi pyeryedayem mislyi posryedstvom ryechi -> We transmit thoughts by means of speech."
The first public demonstration of Natural Language Processing (NLP) was the Georgetown-IBM Experiment on January 7th of 1954, In which "More than 60" Russian sentences primarily within the organic chemistry domain, were automatically translated into English sentences on the IBM-701, which was about the size of a tennis court and had a memory of ~9216 bytes.
The solution relied on six syntactic rules, and a bilingual dictionary of 250 carefully selected words, and had an input / output rate of about 800 words per second (Take that ChatGPT).
By today's standards, the experiment would be considered at best trivial, and at worst a Hack, but in 1954, it required the most advanced computing resources available.
From Hutchin's reporting on the event:
"Every aspect of the process sent the programmers into unknown territory. Decisions had to be made about the coding of alphabetic characters, how the Russian letters were to be transliterated, how the Russian vocabulary was to be stored on the magnetic drum, how the ‘syntactic’ codes were to operate and how they were to be stored, how much information was to go on each punched card, etc..."
The bilingual dictionary was not just a simple mapping of {"Source Word" : "Target Word"}
. Each entry in the dictionary had been broken down into a word-stem or suffix, and had a hand coded set of syntactic rules the wordpart should follow for the translation process.

The entirety of the obvious drawbacks of this approach will not be listed here, but it is a good time to introduce some of the core obstacles these researchers ran into that have persisted to this day within Natural Language Processing (NLP):
The concept of breaking a word down into its component parts is non-trivial. This process is known in the field as "Word Tokenization", and varies between languages.
Words suffer from collision; multiple meanings can be mapped to the same Word Token. Differentiating the correct meaning from the surrounding context words became known as "Word Sense Disambiguation".
The Syntactic Structure (ie. the rules) of Language is entirely sub-conscious, this leads to complications teaching it to a computer.
The Georgetown-IBM experiment succeeded in invoking an academic and commerical interest in Computational Linguistics. Optimism in "cracking" Machine Translation was exuberant, and funding began to flow to universities.
As We May Think
"Consider a future device … in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory."
While researchers in Machine Translation continued to expound on "Rules Based" approaches, the concept of using a computer to search for information2 traversed from the realm of speculative fiction into reality.
In 1965, researchers at Cornell Univeristy lead by George Salton introduced the "System for the Mechanical Analysis and Retrieval of Text" or SMART3 Information Retrieval (IR) system, which helped lay the groundwork for the cornerstone technology within NLP: the Vector Space Model.
An early iteration of the Vector Space model within IR was what Salton called the "Incidence Matrix", which held counts of words present in each document (rows were documents, columns were words, a very sparse matrix indeed).
The system would use syntactic parsing4 to determine which words were important in Documents and Search Queries, and use the incidence matrix to match documents to queries with high word "overlap".
It took another decade, but by 1975 Salton had flushed out "Term Frequency-Inverse Document Frequency (TF-IDF) vectors5, which integrated a notion of statistical rarity of the words present in a document and search query relative to the rest of corpus.
With TF-IDF vectors, documents and search queries could be plotted as coordinates in a high dimensional plane. The retrieval model searches the plane by finding documents that lay close to the search queries coordinates as measured by some distance metric, typically, cosine similarity.
similarity(a,b)=cos(θ)=a⋅b∣∣a∣∣⋅∣∣b∣∣similarity(a,b)=cos(θ)=∣∣a∣∣⋅∣∣b∣∣a⋅b
Cosine similarity calculates the cosine of the angle between two vectors, which makes it a measure of orientation rather than magnitude. This is particularly useful in text analysis because it helps to capture the similarity of documents based on their direction in the vector space, regardless of their length. Ie, similar documents "point in the same direction", documents that have no relation are called orthogonal (cosine of 90 degrees is zero).
Sounds like a great system! Infact, in 2015 83% of production IR systems6 still relied on TF-IDF models.
But TF-IDF suffers from a serious drawback: If you want to take into consideration synonyms, you're kinda S.O.L.
Elasticsearch built their whole business model on selling software to support TF-IDF indexing (but you still needed to provide the synonyms yourself).
If only their were a way to "learn" a representation of a word based on the words that commonly appear next to it...
Footnotes
The first public demonstration of machine translation: the Georgetown-IBM system, 7th January 1954 ↩
More articles
Explore more insights from our team to deepen your understanding of digital strategy and web development best practices.