Integration of collective knowledge and computer vision for annotation of multimedia objects

01 January 2013 → 31 December 2016
Regional and community funding: IWT/VLAIO
Research disciplines
  • Engineering and technology
    • Computer hardware
    • Computer theory
    • Scientific computing
    • Other computer engineering, information technology and mathematical engineering
annotaion of multimediaobjects natural language processing language modeling
Project description

Having a natural conversation with an artificially intelligent agent such as a computer and not being able to distinguish it from a human is one of the major goals of building artificially intelligent agents. Natural Language Processing (NLP) algorithms that are able to analyze,understand, and generate text play an important role in realizing this goal. These NLP algorithms are already deployed in many practical applications we use in our daily life. For example, when we search for information using Google’ search engine, when we ask a virtual assistant like Alexa how the weather is today, or simply by automatically removing spam from our e-mail inbox. In this dissertation, four NLP tasks are considered that could potentially be used as part of an NLP application: (1) Part-of-Speech (PoS) tagging, (2) morphological tagging, (3) language modeling, and (4) Named Entity Recognition (NER). In all four tasks, a label is predicted for an input word. In PoS tagging, a PoS tag such as noun, verb, or adjective is predicted for each word. Morphological tagging is related to PoS tagging but more fine grained. For each word, multiple labels are predicted based on the morphemes contained in that word such as the stem and different affixes. In language modeling, the label predicted is the next word in a sentence, given the last predicted word of that sentence. Finally, NER is about identifying named entities such as persons, locations, and organizations. In this case, a label is predicted for each word in a sentence, indicating whether or not this word is part of a named entity, and if so, which type of entity. For realizing the aforementioned NLP tasks, neural networks are used. Neural networks are a type of machine learning algorithm, able to learn a mapping between input and output without having to be explicitly programmed. Three different aspects of neural networkbased modeling are considered in this dissertation, orthogonal to the four NLP tasks: (1) techniques for improving the input representation, (2) adaptations for optimizing the internal workings of the neural network models themselves, and (3) explaining the output of the model. The first tasks considered in this dissertation are PoS tagging and NER in Twitter microposts, for which the focus is on the input representations in particular. Because Twitter microposts are a relatively new type of data, few labeled data are available. Moreover, already existing approaches mainly focused on news article data and perform significantly worse on Twitter data. In addition, these already existing approaches mainly use manual feature engineering techniques to create input representations for each word, an approach that requires a lot of effort and domain knowledge. Therefore, in this dissertation, a solution is proposed that does not require a lot of domain knowledge and that is able to automatically learn good features, called word representations. To learn word representations, two learning algorithms are considered: Word2vec and FastText. Word2vec considers words as atomic units and only focuses on context words to learn a word representation. FastText also takes into account (groups of) characters contained within a word. While using Word2vec word representations yields competitive results for PoS tagging, using FastText word representations matches the performance of the best performing system for PoS tagging of microposts without using manual feature engineering. The performance difference between Word2vec and FastText word representations can be attributed to the following observation: FastText word representations can be generated for words that are not part of the vocabulary, while with Word2vec representations, a special unknown token is used. Finally, the same approach is also applied for NER in Twitter microposts, with this approach again outperforming other approaches that only rely on hand-engineered features. However, different from PoS tagging, the use of Word2vec word representations for NER yields better results than the use of FastText word representations. The following task considered is language modeling. Two different approaches are proposed for improving the structure of the neural network for predicting the next word in a sentence. The first approach proposed introduces a new neural network architecture. Specifically, the idea is to densely connect all layers in the neural network stack by adding skip connections between all layer pairs within the neural network. With a skip connection, the signal traveling from input to output does not need to pass through each layer but can skip one or more layers. By densely connecting all layers, similar perplexity scores can be obtained as the traditional stacked approach, but with six times fewer parameters. One reason for that is that adding skip connections avoids the need for gradients to pass through every layer, thus mitigating the vanishing gradient problem. Another reason is that every higher layer has direct access to the outputs of all lower layers, thus making it possible to create more expressive inputs with different time scales. The second approach proposed in the context of language modeling is a new activation function called Dual Rectified Linear Unit (DReLU). The activation functions used in Recurrent Neural Networks (RNNs) are prone to vanishing gradients. Therefore, inspired by Rectified Linear Units (ReLUs), which are commonplace in Convolutional Neural Networks (CNNs) and which are less prone to vanishing gradients, DReLUs are proposed as a replacement for units that are making use of the tanh activation function, and where the latter are commonly used in RNNs. Similar to tanh units, DReLUs have both positive and negative activations. In addition, DReLUs have several advantages over tanh units: (1) they do not decrease the magnitude of gradients when they are active, (2) they can be exactly zero, making them noise robust, and (3) they induce sparse output activations in each layer. This new activation function is tested as part of a Quasi Recurrent Neural Network (QRNN), a specific type of RNN that avoids the heavy matrix-to-matrix multiplications typically used in RNNs. This approach performs equally good or better than QRNNs using tanh units, Long Short-Term Memory (LSTM) networks, and other state-of-the-art methods for language modeling, while at the same time being less computationally intensive. Finally, in the last part of this dissertation, a new technique is introduced for explaining the output of a CNN based on the principles outlined for contextual decomposition of LSTMs. The word-level prediction task of interest is morphological tagging because linguists can attribute the presence of many morphological labels to a specific set of characters within a word. For three morphologically different languages, CNN and bidirectional LSTM models are trained for morphological word tagging based on character-level inputs, and subsequently decomposed using contextual decomposition. It is shown that the patterns learned by these models generally coincide with the morphological segments as defined by linguists, but that sometimes other linguistically plausible patterns are learned. Additionally, it is shown that this technique is able to explain why the model makes a wrong or correct prediction, which implies that this technique could potentially be used for understanding and debugging neural network predictions. In the first part of this dissertation, hand-engineered input representations are replaced by automatically learned word representations that make the input less interpretable. In the middle part, techniques are introduced that improve neural networks but also make them more complex. In contrast, the last part of this dissertation focuses on making neural networks and their input again more interpretable, counterbalancing the opaqueness and the increased complexity of neural networks.