Datasets/Corpora

Keywords: Vietnamese datasets, Vietnamese corpora, Vietnamese corpus, Vietnamese textual resources.

UIT-ViQuAD (version 1.0) - A Vietnamese Dataset for Evaluating Machine Reading Comprehension. Bộ Dữ liệu Đọc hiểu Tự động cho Tiếng Việt.

Abstract: Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a new process of dataset creation for Vietnamese MRC. Our in-depth analyses illustrate that our dataset requires abilities beyond simple reasoning like word matching and demands single-sentence and multiple-sentence inferences. Besides, we conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD. We also estimate human performance on the dataset and compare it to the experimental results of powerful machine learning models. As a result, the substantial differences between human performance and the best model performance on the dataset indicate that improvements can be made on UIT-ViQuAD in future research. Our dataset is freely available on our website to encourage the research community to overcome challenges in Vietnamese MRC.

Cross-Lingual Machine Reading Comprehension: SQuAD (for English), UIT-ViQuAD (for Vietnamese), KorQuAD (for Korean), FQuAD (for French), and SberQuAD (for Russian).

Paper: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. A Vietnamese Dataset for Evaluating Machine Reading Comprehension. COLING 2020. Link.

Please contact us via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to sign the corpus user agreement and then receive the corpus.


UIT-ViNewsQA: New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

Large-scale and high-quality corpora are really necessary for evaluating machine reading comprehension models on the low-resource language like Vietnamese. In addition, machine reading comprehension for the health domain offers great potential for practical applications; however, there is still very little machine reading comprehension research in this domain. In this study, we present UIT-ViNewsQA as a new corpus for the Vietnamese language to evaluate models of healthcare reading comprehension. The corpus comprises 22,077 human-generated question--answer pairs. Crowd-workers create the questions and their answers based on a set of over 4,419 online Vietnamese healthcare news articles, where the answers comprised spans extracted from the corresponding articles. In particular, we develop a process of creating a corpus for the Vietnamese machine reading comprehension. Comprehensive evaluations demonstrated that our corpus requires abilities beyond simple reasoning such as word matching, as well as demanding difficult reasoning similar to inferences based on single-or-multiple-sentence information. We conduct experiments using state-of-the-art methods for machine reading comprehension to obtain the first baseline performance measures, which will be compared with further models' performances. We measure human performance based on the corpus and compared it with several strong neural network-based models. Our experiments showed that the best model was BERT, which achieved an exact match score of 57.57% and F1-score of 76.90% on our corpus. The significant difference between humans and the best model (F1-score of 15.93%) on the test set of our corpus indicates that improvements in UIT-ViNewsQA can be explored in future research. Our corpus is freely available on our website in order to encourage the research community to make these improvements.

Paper: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. New Vietnamese Corpus for Machine ReadingComprehension of Health News Articles. Link.

ViMMRC (version 1.0) - Vietnamese Multiple-choice Machine Reading Comprehension Corpus

Abstract: Machine Reading Comprehension (MRC) is the task of natural language processing that studies the ability to read and understand unstructured texts and then find the correct answers for questions. Until now, we have not yet had any MRC dataset for such a low-resource language as Vietnamese. In this paper, we introduce ViMMRC, a challenging machine comprehension corpus with multiple-choice questions, intended for research on the machine comprehension of Vietnamese text. This corpus includes 2,783 multiple-choice questions and answers based on a set of 417 Vietnamese texts used for teaching reading comprehension for 1st to 5th graders. Answers may be extracted from the contents of single or multiple sentences in the corresponding reading text. A thorough analysis of the corpus and experimental results in this paper illustrate that our corpus ViMMRC demands reasoning abilities beyond simple word matching. We proposed the method of Boosted Sliding Window (BSW) that improves 5.51% in accuracy over the best baseline method. We also measured human performance on the corpus and compared it to our MRC models. The performance gap between humans and our best experimental model indicates that significant progress can be made on Vietnamese machine reading comprehension in further research. The corpus is freely available at our website for research purposes.

Paper: Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen, Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice reading comprehension.  Link.

Please contact us via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to sign the corpus user agreement and then receive the corpus.

ViCTSD (Vietnamese Constructive and Toxic Speech Detection dataset)

Abstract: The rise of social media has led to the increasing of comments on online forums. However, there still exists some invalid comments which were not informative for users. Moreover, those comments are also quite toxic and harmful to people. In this paper, we create a dataset for classifying constructive and toxic speech detection, named UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection dataset) with 10,000 human-annotated comments. For these tasks, we proposed a system for constructive and toxic speech detection with the state-of-the-art transfer learning model in Vietnamese NLP as PhoBERT. With this system, we achieved 78.59% and 59.40% F1-score for identifying constructive and toxic comments separately. Besides, to have an objective assessment for the dataset, we implement a variety of baseline models as traditional Machine Learning and Deep Neural Network-Based models. With the results, we can solve some problems on the online discussions and develop the framework for identifying constructiveness and toxicity Vietnamese social media comments automatically.

Paper: Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese. The 34th International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA/AIE 2021). Link.

Please contact us via email: 17520721@gm.uit.edu.vn  (Mr. Luan Nguyen) to sign the corpus user agreement and then receive the corpus.

UIT-VSFC (version 1.0) - Vietnamese Students’ Feedback Corpus

Abstract: Students’ feedback is a vital resource for the interdisciplinary research involving the combining of two different research fields between sentiment analysis and education. Vietnamese Students’ Feedback Corpus (UIT-VSFC) is the resource consists of over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications. To assess the quality of our corpus, we measure the annotator agreements and classification evaluation on the UIT-VSFC corpus. As a result, we obtained the inter-annotator agreement of sentiments and topics with more than over 91% and 71% respectively. In addition, we built the baseline model with the Maximum Entropy classifier and achived approximately 88% of the sentiment F1-score and over 84% of the topic F1-score.

Paper: Kiet Van Nguyen, Vu Duc Nguyen, Phu Xuan-Vinh Nguyen, Tham Thi-Hong Truong, Ngan Luu-Thuy Nguyen, UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis,  2018 10th International Conference on Knowledge and Systems Engineering (KSE 2018), November 1-3, 2018, Ho Chi Minh City, Vietnam.  Link.

Please download this dataset/corpus here  .

UIT-SPC (version 1.0)

Abstract: This is a scientific paper's corpus (namely UIT-SPC) which is collected and processed by us. The UIT-SPC corpus contains 1565 papers of top NLP/CL conferences such as ACL (2014, 2015, and 2016), CoNLL 2015, EACL 2014, NAACL 2015, and EMNLP 2015. First, they are pre-processed by removing unnecessary information in these paper (e.g formula, table, etc). Then, we formatted them by files .xml that include the title paper, sections, and sub-sections according to the paper's structure.

Paper: Dang Van Thin, Nguyen Van Kiet, Nguyen Luu-Thuy Ngan. Ứng dụng hỗ trợ tra cứu cụm từ trong bài báo khoa học tiếng Anh, Proceeding of The 10th National Conference on Fundamental and Applied IT Research – FAIR’10,Dang Nang, 17-18/8/2017.

Please contact us via email: thindv@uit.edu.vn (Mr. Thin Dang) to sign the corpus user agreement and then receive the corpus.

UIT-VSMEC (version 1.0) - Vietnamese Social Media Emotion Corpus

Emotion recognition is a higher approach or special case of sentiment analysis. In this task, the result is not produced in terms of either polarity: positive or negative or in the form of rating (from 1 to 5) but of a more detailed level of sentiment analysis in which the result are depicted in more expressions like sadness, enjoyment, anger, disgust, fear and surprise. Emotion recognition plays a critical role in measuring brand value of a product by recognizing specific emotions of customers’ comments. In this study, we have achieved two targets. First and foremost, we built a standard Vietnamese Social Media Emotion Corpus (UIT-VSMEC) with about 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese which is a low-resource language in Natural Language Processing (NLP). Secondly, we assessed and measured machine learning and deep neural network models on our UIT-VSMEC. As a result, Convolutional Neural Network (CNN) model achieved the highest performance with 57.61% of F1-score.

Paper: Vong Ho, Duong Nguyen, Danh Nguyen, Linh Pham, Kiet Nguyen and Ngan Nguyen, Emotion Recognition for Vietnamese Social Media Text, 2019 16th International Conference of the Pacific Association for Computational Linguistics (PACLING 2019), October 11-13, 2019, Ha Noi, Vietnam.  Link.

Please download this dataset/corpus here .

UIT-ABSA (version 1.0) 

Aspect-based sentiment analysis (ABSA) is an important task in sentiment analysis (known as opinion mining) that is proposed to provide valuable information for providers and customers on users comment. It aims to identify the aspects of entities mentioned in the review and the sentiment polarity corresponding to the aspects of entities for certain domain. ABSA can be divided into three main sub-tasks (Pontiki et al., 2016) as follow: Aspect Category, Opinion Target Expression, Sentiment Polarity. In this work, we introduce a benchmark corpus for aspect detection with aspect polarity tasks in Vietnamese at sentences-level. Our corpus consists of 9,737 sentences and is divided into three datasets with rate 7/1/2. 

Paper: Updating....

Please contact us via email: thindv@uit.edu.vn (Mr. Thin Dang) to sign the corpus user agreement and then receive the corpus.

UIT-ViIC (version 1.0) - Vietnamese Image Captioning Dataset

Automatic generation of image captions has attracted attentions from researchers in various fields of computer science such as computer vision, natural language processing and machine learning in recent years. This paper contributes to Image captioning problem in terms of extending Image captioning dataset to different language. In particular, we concentrate on generating Vietnamese captions for images, as there is no dataset in Image captioning for Vietnamese existed. We propose a dataset called UIT-ViIC which was annotated manually in Vietnamese with the images from MS - COCO dataset. In addition, we built a web-based annotation tool for improving annotators performances. UIT-ViIC in this scope consists of 19,250 captions for 3,850 images on sport-ball. UIT-ViIC is then experimented and evaluated on existing Image captioning deep neural network models. Our dataset in this scope will be published this on our lab website for researching purpose.

Paper: Quan Hoang Lam, Quang Duy Le, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image CaptioningLink.

Please download the corpus: here.

UIT-ViNames (version 1.0) - Vietnamese Name Dataset

Abstract—As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposal for English and Chinese languages are tremendous; still, there has been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotated with genders. This dataset is available on our website for research purposes. In addition, this paper describes six machine learning algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive Bayes, Decision Tree, Random Forrest and Logistic Regression) and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names. We create a dataset and investigate the impact of each name component on detecting gender. As a result, the best F1-score that we have achieved is up to 96% on LSTM model and we generate a web API based on our trained model.

Paper: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, and Anh Gia-Tuan Nguyen. Gender Prediction Based on Vietnamese Names with Machine Learning TechniquesLink.

Trial API of Vietnamese Name Dataset, you can try here:  Link .

Please contact us via email: huytq@uit.edu.vn (Mr. Huy To) to sign the corpus user agreement and then receive the corpus.


UIT-ViOCD: Vietnamese Open-domain Complaint Detection Dataset

Customer product reviews play a role in improving the quality of products and services for organizations or brands. Complaining is an attitude that expresses dissatisfaction with an event or a product not meeting customer expectations. In this paper, we build a Vietnamese dataset (UIT-ViOCD), including 5,485 human-annotated reviews on four categories about product reviews on e-commerce sites. After the data collection phase, we proceed to the annotation task and achieve Am = 87% by Fleiss' Kappa. Then, we present an extensive methodology for the research purposes and achieve 92.16% by F1-score for identifying complaints. With the results, in the future, we want to build a system for open-domain complaint detection in E-commerce websites.

Paper: Nhung Thi-Hong Nguyen, Phuong Ha-Dieu Phan, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. Vietnamese Open-domain Complaint Detection in E-Commerce Websites. Link.

Please contact us via email: 18521218@gm.uit.edu.vn (Ms. Nhung) to sign the corpus user agreement and then receive the corpus.

UIT-ViHSD - Vietnamese Hate Speech Detection Dataset

Abstract—In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social media, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD – a human-annotated dataset for automatically detecting hate speech on the social network. This dataset contains over 30,000 comments, each comment in the dataset has one of three labels: CLEAN, OFFENSIVE, or HATE. Besides, we introduce the data creation process for annotating and evaluating the quality of the dataset. Finally, we evaluated the dataset by deep learning models and transformer models.

Please contact us via email: sonlt@uit.edu.vn (Mr. Son Luu) to sign the corpus user agreement and then receive the corpus.