Posted on

This post discusses semi-supervised learning algorithms that learn from proxy labels assigned to unlabelled data. Unsupervised learning constitutes one of the main challenges for current machine learning models and one of the key elements that is missing for general artificial intelligence. While unsupervised learning on its own is still elusive, researchers have a made a lot of progress in combining unsupervised learning with supervised learning. This branch of machine learning research is called semi-supervised learning.

Semi-supervised learning has a long history. For a slightly outdated overview, refer to Zhu [1] and Chapelle et al. Particularly recently, semi-supervised learning has seen some success, considerably reducing the error rate on important benchmarks. In this blog post, I will focus on a particular class of semi-supervised learning algorithms that produce proxy labels on unlabelled data, which are used as targets together with the labelled data.

These proxy labels are produced by the model itself or variants of it without any additional supervision; they thus do not reflect the ground truth but might still provide some signal for learning.

semi supervised learning nlp

In a sense, these labels can be considered noisy or weak. I will highlight the connection to learning from noisy labels, weak supervision as well as other related topics in the end of this post. This class of models is of particular interest in my opinion, as a deep neural networks have been shown to be good at dealing with noisy labels and b these models have achieved state-of-the-art in semi-supervised learning for computer vision. Note that many of these ideas are not new and many related methods have been developed in the past.

In one half of this post, I will thus cover classic methods and discuss their relevance for current approaches; in the other half, I will discuss techniques that have recently achieved state-of-the-art performance. Some of the following approaches have been referred to as self-teaching or bootstrapping algorithms; I am not aware of a term that captures all of them, so I will simply refer to them as proxy-label methods.

I will divide these methods in three groups, which I will discuss in the following: 1 self-training, which uses a model's own predictions as proxy labels; 2 multi-view learning, which uses the predictions of models trained with different views of the data; and 3 self-ensembling, which ensembles variations of a model's own predictions and uses these as feedback for learning.

I will show pseudo-code for the most important algorithms. You can find the LaTeX source here. There are many interesting and equally important directions for semi-supervised learning that I will not cover in this post, e. Self-training Yarowsky, ; McClosky et al. As the name implies, self-training leverages a model's own predictions on unlabelled data in order to obtain additional information that can be used during training.

Typically the most confident predictions are taken at face value, as detailed next. This process is generally repeated for a fixed number of iterations or until no more predictions on unlabelled examples are confident. This instantiation is the most widely used and shown in Algorithm 1. Classic self-training has shown mixed success. In parsing it proved successful with small datasets Reichart, and Rappoport, ; Huang and Harper, [6] [7] or when a generative component is used together with a reranker when more data is available McClosky et al.

Some success was achieved with careful task-specific data selection Petrov and McDonald, [9]while others report limited success on a variety of NLP tasks He and Zhou, ; Plank, ; Van Asch and Daelemans, ; van der Goot et al.

The main downside of self-training is that the model is unable to correct its own mistakes. If the model's predictions on unlabelled data are confident but wrong, the erroneous data is nevertheless incorporated into training and the model's errors are amplified.

This effect is exacerbated if the domain of the unlabelled data is different from that of the labelled data; in this case, the model's confidence will be a poor predictor of its performance. Multi-view training aims to train different models with different views of the data.

Ideally, these views complement each other and the models can collaborate in improving each other's performance. These views can differ in different ways such as in the features they use, in the architectures of the models, or in the data on which the models are trained.

One model thus provides the labels to the inputs on which the other model is uncertain. Co-training can be seen in Algorithm 2. In the original co-training paper Blum and Mitchell,co-training is used to classify web pages using the text on the page as one view and the anchor text of hyperlinks on other pages pointing to the page as the other view.

As two conditionally independent views are not always available, Chen et al. To this end, pseudo-multiview regularization constrains the models so that at least one of them has a zero weight for each feature.Machine Learning Career.

Recently, natural language processing models, such as BERT and T5, have shown that it is possible to achieve good results with few class labels by first pretraining on a large unlabeled dataset and then fine-tuning on a smaller labeled dataset. These methods fall under the umbrella of self-supervised learning, which is a family of techniques for converting an unsupervised learning problem into a supervised one by creating surrogate labels from the unlabeled dataset.

However, current self-supervised techniques for image data are complex, requiring significant modifications to the architecture or the training procedure, and have not seen widespread adoption. Our proposed framework, called SimCLR, significantly advances the state of the art on self- supervised and semi-supervised learning and achieves a new record for image classification with a limited amount of class-labeled data The simplicity of our approach means that it can be easily incorporated into existing supervised learning pipelines.

The SimCLR framework SimCLR first learns generic representations of images on an unlabeled dataset, and then it can be fine-tuned with a small amount of labeled images to achieve good performance for a given classification task. The generic representations are learned by simultaneously maximizing agreement between differently transformed views of the same image and minimizing agreement between transformed views of different images, following a method called contrastive learning.

To begin, SimCLR randomly draws examples from the original dataset, transforming each example twice using a combination of simple augmentations random cropping, random color distortion, and Gaussian blurcreating two sets of corresponding views.

SimCLR then computes the image representation using a convolutional neural network variant based on the ResNet architecture. Afterwards, SimCLR computes a non-linear projection of the image representation using a fully-connected network i. We use stochastic gradient descent to update both CNN and MLP in order to minimize the loss function of the contrastive objective. After pre-training on the unlabeled images, we can either directly use the output of the CNN as the representation of an image, or we can fine-tune it with labeled images to achieve good performance for downstream tasks.

Performance Despite its simplicity, SimCLR greatly advances the state of the art in self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on top of self-supervised representations learned by SimCLR achieves Understanding Contrastive Learning of Representations The improvement SimCLR provides over previous methods is not due to any single design choice, but to their combination. Several important findings are summarized below. As SimCLR learns representations via maximizing agreement of different views of the same image, it is important to compose image transformations to prevent trivial forms of agreement, such as agreement of the color histograms.

To understand this better, we explored different types of transformations, illustrated in the figure below. We found that while no single transformation that we studied suffices to define a prediction task that yields the best representations, two transformations stand out: random cropping and random color distortion.

Although neither cropping nor color distortion leads to high performance on its own, composing these two transformations leads to state-of-the-art results. To understand why combining random cropping with random color distortion is important, consider the process of maximizing agreement between two crops of the same image. This naturally encompasses two types of prediction tasks that enable effective representation learning: a predicting local views e. However, different crops of the same image usually look very similar in color space.

If the colors are left intact, a model can maximize agreement between crops simply by matching the color histograms. In this case, the model might focus solely on color and ignore other more generalizable features. By independently distorting the colors of each crop, these shallow clues can be removed, and the model can only achieve agreement by learning useful and generalizable representations.

In SimCLR, a MLP-based nonlinear projection is applied before the loss function for contrastive learning objective is calculated, which helps to identify the invariant features of each input image and maximize the ability of the network to identify different transformations of the same image.

Interestingly, comparison between the representations used as input for the MLP projection module and the output from the projection reveals that the earlier stage representations perform better when measured by a linear classifier. Since the loss function for contrastive objective is based on the output of the projection, it is somewhat surprising that the representation before the projection is better.These methods fall under the umbrella of self-supervised learning, which is a family of techniques for converting an unsupervised learning problem into a supervised one by creating surrogate labels from the unlabeled dataset.

However, current self-supervised techniques for image data are complex, requiring significant modifications to the architecture or the training procedure, and have not seen widespread adoption.

semi supervised learning nlp

Our proposed framework, called SimCLR, significantly advances the state of the art on self- supervised and semi-supervised learning and achieves a new record for image classification with a limited amount of class-labeled data The simplicity of our approach means that it can be easily incorporated into existing supervised learning pipelines.

The SimCLR framework SimCLR first learns generic representations of images on an unlabeled dataset, and then it can be fine-tuned with a small amount of labeled images to achieve good performance for a given classification task.

The generic representations are learned by simultaneously maximizing agreement between differently transformed views of the same image and minimizing agreement between transformed views of different images, following a method called contrastive learning. To begin, SimCLR randomly draws examples from the original dataset, transforming each example twice using a combination of simple augmentations random cropping, random color distortion, and Gaussian blurcreating two sets of corresponding views.

SimCLR then computes the image representation using a convolutional neural network variant based on the ResNet architecture. Afterwards, SimCLR computes a non-linear projection of the image representation using a fully-connected network i. We use stochastic gradient descent to update both CNN and MLP in order to minimize the loss function of the contrastive objective.

After pre-training on the unlabeled images, we can either directly use the output of the CNN as the representation of an image, or we can fine-tune it with labeled images to achieve good performance for downstream tasks.

An illustration of the proposed SimCLR framework. The CNN and MLP layers are trained simultaneously to yield projections that are similar for augmented versions of the same image, while being dissimilar for different images, even if those images are of the same class of object.

semi supervised learning nlp

The trained model not only does well at identifying different transformations of the same image, but also learns representations of similar concepts e. Performance Despite its simplicity, SimCLR greatly advances the state of the art in self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on top of self-supervised representations learned by SimCLR achieves ImageNet top-1 accuracy of linear classifiers trained on representations learned with different self-supervised methods pretrained on ImageNet.

Gray cross indicates supervised ResNet Understanding Contrastive Learning of Representations The improvement SimCLR provides over previous methods is not due to any single design choice, but to their combination.

Several important findings are summarized below. Finding 1: The combinations of image transformations used to generate corresponding views are critical. As SimCLR learns representations via maximizing agreement of different views of the same image, it is important to compose image transformations to prevent trivial forms of agreement, such as agreement of the color histograms.

To understand this better, we explored different types of transformations, illustrated in the figure below. Random examples of transformations applied to the original image. We found that while no single transformation that we studied suffices to define a prediction task that yields the best representations, two transformations stand out: random cropping and random color distortion.

Although neither cropping nor color distortion leads to high performance on its own, composing these two transformations leads to state-of-the-art results. To understand why combining random cropping with random color distortion is important, consider the process of maximizing agreement between two crops of the same image.

This naturally encompasses two types of prediction tasks that enable effective representation learning: a predicting local views e. Maximizing agreement between different crops leads to two prediction tasks. However, different crops of the same image usually look very similar in color space.

If the colors are left intact, a model can maximize agreement between crops simply by matching the color histograms. In this case, the model might focus solely on color and ignore other more generalizable features.My work is focused on improving machine learning architectures for representation learning, transfer learning, autoregressive modeling and multi-task optimization.

Most of my research is applied in the area of Natural Language Understanding and on tasks that benefit from capturing the semantics in text, such as structured prediction, language modeling, grammatical error detection, sentiment analysis and text classification. Previously, I worked in the Research team at SwiftKeywhere we developed experimental technologies for language modeling and natural language processing.

One of my main projects was the neural network language model for text prediction SwiftKey has since been acquired by Microsoft. I received a PhD degree as a member of Churchill College in Cambridge, with my thesis on Minimally supervised dependency-based methods for natural language processingunder the supervision of Professor Ted Briscoe. The topic of my dissertation was Adaptive Interactive Information Extraction. My main areas of interest include: neural networks and deep learning models transfer learning representation learning multi-task optimization educational applications distributional and compositional semantics unsupervised and semi-supervised learning bio medical applications of NLP I occasionally offer consultancy services to companies, in the areas of machine learning and natural language processing.

If you are interested, feel free to get in touch. Attending to characters in neural sequence labeling models [ pdf ] [ arXiv ] [ poster ] [ code ] Marek Rei, Gamal K.

Atlanta, United States, Portland, United States, Springer, Dordrecht, Uppsala, Sweden, Download: My CV. About Publications Teaching Projects Blog.ELMo Peters et al. Example :. Amazing Results for Computer Vision! Adversarial Examples. Small imperceptible to humans tweak to neural network inputs can change its output.

Creating an adversarial example:. CSn lecture Why use semi-supervised learning for NLP? Why has deep learning been so successful recently? Semi-supervised learning algorithms We will cover three semi-supervised learning techniques : Pre-training One of the tricks that started to make NNs successful You learned about this in week 1 word2vec!

Have the model label the unlabeled data. Take some of examples the model is most confident about i. Online Self-Training note : convert model-produced label to one-hot vector, [0. Virtual Adversarial Training Miyato et al. Creating an adversarial example: Compute the gradient of the loss with respect to the input Add epsilon times the gradient to the input Possibly repeat multiple times Word Dropout Cross-View Consistency Clark et al.

Please enable JavaScript to view the comments powered by Disqus.A complex topic is broken down into manageable pieces, while maintaining a good pace. The text is accompanied by replicable code examples throughout.

Donate to arXiv

Guides you through different exercises with real case scenarios where you can feel your work is getting valuable and exciting at the same time. Welcome to Manning India! We are pleased to be able to offer regional eBook pricing for Indian residents. Transfer Learning for Natural Language Processing. Paul Azunre.

Become a Reviewer. Building and training deep learning models from scratch is costly, time-consuming, and requires massive amounts of data. To address this concern, cutting-edge transfer learning techniques enable you to start with pretrained models you can tweak to meet your exact needs. Table of Contents takes you straight to the book detailed table of contents. Part 1: What is Transfer Learning?

About the Technology Transfer learning enables machine learning models to be initialized with existing prior knowledge. Initially pioneered in computer vision, transfer learning techniques have been revolutionising Natural Language Processing with big reductions in the training time and computation power needed for a model to start delivering results.

Emerging pretrained language models such as ELMo and BERT have opened up new possibilities for NLP developers working in machine translation, semantic analysis, business analytics, and natural language generation.

About the book Transfer Learning for Natural Language Processing is a practical primer to transfer learning techniques capable of delivering huge improvements to your NLP models. What's inside Fine tuning pretrained models with new domain data Picking the right model to reduce resource usage Transfer learning for neural network architectures Foundations for exploring NLP academic literature.

Semi-supervised Learning in NLP

About the reader For machine learning engineers and data scientists with some experience in NLP. He works as a Research Director studying Transfer Learning in Natural Language Processing, and has pushed the state of the field with peer-reviewed articles, serving as a program committee member, speaking engagements, and judging at top conferences.

Don't refresh or navigate away from the page.

semi supervised learning nlp

Transfer Learning for Natural Language Processing combo added to cart. We'll charge your credit card for the purchase.

Machine Learning Series 5 : Semi- Supervised Learning Algorithms and Applications.

Your book will ship via to:. Commercial Address. You can read ePub files on your smartphone, tablet, eReader, or computer. Total: Prices displayed in rupees will be charged in USD when you check out.

Deep Learning with Python, Second Edition. Ekaterina Kochmar. Exploring Deep Learning for Search. With chapters selected by Tommaso Teofili. Deep Reinforcement Learning in Action. Alexander Zai and Brandon Brown. Machine Learning Bookcamp Build a portfolio of real-life projects.

Semi-Supervised Learning for NLP

Alexey Grigorev. Math and Architectures of Deep Learning. Krishnendu Chaudhury. Microservices in. Christian Horsdal Gammelgaard.

Pandas in Action.The goal of this page is to collect all papers focusing on semi-supervised learning for natural language processing. Carlson, A. Boulder, Colorado, USA. June Association for Computational Linguistics. Veeramachaneni, S. Goldberg, A. Zubiaga, A. Plank, B. Andrzejewski, D. Poveda, J. Liao, W. Chen, Z. Huang, J. Dasgupta, S. Hal Daume Athens, Greece. March Rao, D. Candito, M. Wang, Q. Columbus, Ohio. Koo, T. Suzuki, J. Mann, G. Haffari, G. Manchester, UK.

August Wong, K. Xu, J. McClosky, D.


Replies to “Semi supervised learning nlp”

Leave a Reply

Your email address will not be published. Required fields are marked *