Deep Learning for Natural Language Processing and Information Retrieval

Back in 1988, during the second revival of neural networks I became increasingly interested in using Neural Networks for language related human cognitive functions. Not, speech recognition or speech synthesis, but more in the higher-level linguistics; syntactic- and semantic tasks in relation to language processing and information retrieval in particular. The years before, I did a research project on machine learning for Optical Character Recognition (OCR) supervised by Jaap van den Herik and worked for almost two years on an information retrieval project in one of the intelligence departments of the Royal Dutch Navy. Looking for a PhD project, neural networks for natural language processing and information retrieval was a logical topic. In 1989, my PhD project started at the University of Amsterdam under the supervision of Remko Scha leading to my thesis defense in January 1993: Neural networks for Natural Language Processing and Information Retrievalfeel free to download!

Neural Networks in NLP and IR

In 1995, the European Commission provided ZyLAB a grant to extend the research and publish a book on the topic: Artificial neural networks for information retrieval in a libraries context, which you can download here:

So, you may understand that nobody is probably more happy than myself with the third revival of neural networks, or deep learning as we call it now. Let me explain this with a number of quotes from the above publications.

My research in the late 80’s and early 90’s showed a number of interesting findings:

“During the training process, self-organizing neural networks show very interesting recurrent interactions, which enable the models to derive organizations and classifications that are based on more than just simple adjacent context dependencies”

to continue:

“… neural networks are able to integrate knowledge from different sources, thereby disambiguating language more easily. The model successfully integrated word, syntax category, syntax structure and case role information. Other additions to the model can be implemented easily. Moreover, neural networks are able to process corrupted sentences and they can even correct them to proper sentences by using the generalization capabilities of the neural network. It can even be argued that the model shows creative behavior as it replaces words by more probable ones within that context.”

Such cognitive behaviors we never observed before in rules-based AI models!

However, we also ran into several problems we could not solve with the hardware we had in these days:

“… neural technology is not really scalable. Almost all experiments are based on sequential simulations of parallel algorithms. Parallel hardware is rare and very expensive. Moreover, many of the models tend to collapse if they are applied to large problems.”

 “It is difficult to use the models in larger applications. As soon as the feature maps grow larger, they tend to entangle.”

We even proposed larger and more computationally complex models of which we could only dream to simulate them:

“ … a possible solution could be to try to combine many smaller maps into a hierarchical system.”

Which is exactly one of the differentiating principles in the multi-layer Convolutional Neural Networks (CNN).

At the end of 7 years of research, unfortunately, the inevitable conclusion was:

“As neural nets still show considerable drawbacks for real-world applications and as solutions for these problems are not to appear shortly, it may be wise to investigate other implementation techniques such as advanced statistical (re-estimation) techniques.”

Which we did ever since, using Support Vector Machines, Conditional Random Fields, Non-Negative Matric Factorization, Maximal Entropy Modeling and other more understandable and scalable machine learning methods building AI and machine learning algorithms for a variety of tasks in eDiscovery and related applications.

But the story continued and after 25 years, computing power increased approximately 2 to the power of 17 (131,000 times!), there is more electronic data available than ever, including many manually tagged training sets which can be used to train the neural networks. The new architecture of Convolution Neural Networks is a great success in using deep learning for visual & audio classification problems and even gaming (although part of the skills to be successful in the Go game can also be considered as a visual classification skill).

Despite all these successes of Deep Learning in human cognitive functions such as speech and vision, there was no real solution to use CNN’s for written language. Until 2013, when Tomas Mikolov proposed a CNN text representation schemes for written language named Word2Vec. Word2Vec is based on the analysis of 3 billion running words (!) and provided the research field a great tool to represent written language in neural network architectures.

Recent research towards a variety of machine learning problems related to eDiscovery at ZyLAB and the University of Maastricht shows that using Deep Learning and Word2Vec on written language classification problems, consistently outperforms not only all other machine learning algorithms, but also humans.

Calculation times are still a bit problematic, as they are 200 times slower (without using GPU’s or other special hardware) than other machine learning algorithms. But the superior results, announcements of hardware development for dedicated Deep Learning chips and scientific research focused on finding more efficient algorithms to train CNN’s give us hope that this problem will be solved soon.

So, is the Deep Learning approach superior and can we forget about everything else? Now quit yet, there are still a few conceptual problems related to CNN’s which have not gone away. I quote from my thesis in 1993:

“If one applies neural technology to an application, it either works, or it does not work. It is impossible to patch the model for certain exceptions, which can be done in a sequential algorithm. … it is hard to say which neuron is responsible for which action (the credit assignment problem).”

Therefor, there is still enough room for research in better understanding how and why deep learning works. Especially for the acceptance of Artificial Intelligence solving important tasks in our society, this kind of understanding is essential.

But, I strongly believe that this 3rd revival of neural networks will bring us what we have been looking for since the early 1940’s. There are many problems in eDiscovery which use machine learning techniques that we can probably solve better by using Deep Learning; some examples are Technology Assisted Review, Concept Search, Semantic Clustering, Privileged Detection, Automatic Redaction (also for GDPR purposes), Information Extraction for better internal investigations and maybe even a few other applications we have not even thought about yet. These are (again) exiting times!

Let me close with one of my postulations from 1993 after a quote from Frederick Jelinek, one of the pioneers of data driven machine learning in the late 70’s:

“The only good data is more data”

Which is exactly what we have in abundance these days!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s