3RD INTERNATIONAL CONGRESS ON TECHNOLOGY - ENGINEERING & SCIENCE - Kuala Lumpur - Malaysia (2017-02-09)

Lexical Normalisation For Malay Short Forms In Social Media

Social media is gaining so much popularity these days with an abundance of social media types ranging from the updating of status, the recording of videos, and the taking of pictures up to the uploading of animations to certain social media such as Facebook, Twitter or Instagram. The social media is also broadly used not only in the daily life of users but also for business purposes such as advertising and promotional news, including those that go viral, and also by politicians and popular personal users. In this paper, a method is proposed for the identification and normalisation of ill-formed words by targeting out-of-vocabulary words in messages in the Malay language using the Twitter platform to make a lexical normalisation. Twitter contains a large volume of very noisy data that hinder its usage for Natural Language Processing (NLP). This study made use of one thousand Malaysian Twitter messages.However, due to the racial and cultural diversity in Malaysia, the native language is sometimes mixed with foreign languages in messages. For example, Malay and English are often mixed together in a phrase, or a sentence may be a mix of Chinese, Malay and English words. The noisiness of these data slows down the performance of downstream applications that require existing NLP tools to process data. These non-standard words will also cause inaccurate frequency estimates for keywords and consequently, reduce the utility of the keyword-based system. Most NLP tools are primarily designed to pursue better accuracy than efficiency, and it is a challenge to process texts from Twitter. The processing speed cannot match the data generation speed, resulting in an efficiency gap between the data generation and the consumption speed. To overcome this, there were three steps to the normalisation, namely the pre-processing of ill-formed words, the creation of a rule for dialects, focusing on the dialect of northern Malaysia, and the normalisation of a text from a cascaded token-based approach to a type-based approach using a combined lexicon based on an analysis of existing and developed text normalisation methods. The impact of the normalisation on a downstream Twitter POS tagging task will also be evaluated, and the results will reveal the effectiveness of the text normalisation, although its boost in accuracy is not comparable in the Twitter POS tagger domain. It is hoped that the evaluation of the normalized words will increase the effectiveness of the normalisation of the Malay language and remove the difficulties with regard to NLP applications.
NURUL SYAFIQAH ABDUL RAHMAN, NAZLIA OMAR