This article is transcoded by 简悦 SimpRead, original link zhuanlan.zhihu.com
First published by: AINLPer WeChat Official Account (Get shared dry goods!!)
Editor: ShuYini
Proofreader: ShuYini
Date: 2023-04-24
Introduction
Named Entity Recognition (NER) is one of the important research directions in Natural Language Processing (NLP), aiming to identify named entities in texts and classify them into corresponding entity types. Doing NER requires datasets. After a week of effort, I have finally compiled all the accessible datasets and am sharing them with everyone. Saying this is the most comprehensive collection online might be exaggerated, but I’ve really tried my best. In the process of organizing, I also heavily referenced the expert’s article: Liu Cong NLP: Chinese NER Dataset Compilation. All dataset download links are placed at the end; feel free to access if needed.
Entertainment NER — Youku
The Entertainment NER dataset is mainly based on Youku video-related titles. It includes 3 major categories (Entertainment Star Names, Film and TV Names, Music Names) and 9 sub-entity categories (e.g., Anime, Movies, TV Shows, Variety Shows). The dataset contains 8001 training examples, 1000 validation examples, and 1001 test examples. It is jointly provided by Alibaba DAMO Academy and Singapore University of Technology and Design. The latest Github update was in 2022.
Related paper: https://aclanthology.org/N19-1079.pdf
Github: https://github.com/allanj/ner_incomplete_annotation
E-commerce NER — Taobao
The E-commerce NER dataset is mainly constructed based on Taobao e-commerce data. It includes 4 major categories (Product Name, Product Model, Person Name, Location) and 9 sub-entity categories (Computer, Automobile, Daily Necessities, etc.). The dataset is jointly provided by Alibaba DAMO Academy and Singapore University of Technology and Design. It includes 6000 training samples, 998 validation samples, and 1000 test samples. The latest Github update was in 2022.
Related paper: https://aclanthology.org/N19-1079.pdf
Github: https://github.com/allanj/ner_incomplete_annotation
Resume NER — Sina Finance
This dataset is based on Sina Finance, containing a resume dataset of executives from companies listed on the Chinese stock market. 1027 resume summaries were randomly selected and manually annotated with 8 named entity types using the YEDDA system: Nationality (CONT), Education Background (EDU), Location (LOC), Name (NAME), Organization (ORG), Profession (PRO), Ethnicity (RACE), and Title (TITLE). The dataset includes 3821 training samples, 463 validation samples, and 477 test samples. The texts are relatively standardized, and entity recognition models usually achieve over 90% F1 score.
Related paper: https://arxiv.org/pdf/1805.02023.pdf
Github: jiesutd/LatticeLSTM: Chinese NER using Lattice LSTM. Code for ACL 2018 paper. (github.com)
Weibo - NER
This dataset is “A Weibo information corpus annotated for NER”. Compared to MSRA-NER, this data is more specific and mainly includes: Person Names (specific and generic names), Addresses (specific and generic), Administrative Regions, Organizations (specific and generic names). The corpus is based on 1890 messages sampled from Weibo from November 2013 to December 2014 and annotated (1350 training samples, 270 development samples, 270 test samples). The sample size is smaller than MSRA-NER. The latest Github update was in 2018.
Related paper: Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings
GitHub: https://github.com/hltcoe/golden-horse
People’s Daily (1998/2014) - NER
This dataset is also “An NER annotated dataset” generated from the People’s Daily corpus editions of 1998 and 2014. It contains three common entity types: Person Name (PER), Location (LOC), and Organization (ORG). The 1998 corpus includes over 20,000 training samples, over 2,300 development samples, and over 4,600 test samples. The latest Github update was in 2018. There is no found paper referencing this dataset; friends who know can message me.
Github: https://github.com/zjy-ucas/ChineseNER
MSRA-NER
This dataset is provided by Microsoft Research Asia (MSRA) for Chinese NER. It mainly includes locations, organizations, and person names with BIO tagging. The training set has 45,000 sentences, 36,000+ location entities, 20,000+ organization entities, and 17,000+ person names; the test set is roughly one-tenth the size, with 3,400+ sentences, 2,800+ locations, 1,300+ organizations, and 1,900+ person names. The latest Github update was in 2018.
Related paper: The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition
GitHub: https://github.com/bytetopia/nlp_datasets
Boson-NER
Boson provides an NER dataset encoded in UTF-8; each line is a paragraph with annotation, consisting of 2000 paragraphs. The dataset includes six entity types: Time, Location, Person, Organization, Company, and Product. The dataset is generally referenced to the website https://bosonnlp.com/, which seems currently inaccessible, but the dataset can still be found online.
CLUENER Fine-grained - NER
This dataset is derived from a subset of Tsinghua University’s open-source text classification dataset THUCTC, annotated with fine-grained NER. It contains 10,748 training samples and 1,343 validation samples. It has 10 label categories: address, book, company, game, government, movie, name, organization, position, scene.
Related paper: https://arxiv.org/abs/2001.04351
Github: https://github.com/CLUEbenchmark/CLUENER2020
Electronic Medical Records - NER
This dataset is released by the China Conference on Knowledge Graph and Semantic Computing (CCKS), which has held four NER competitions related to clinical Electronic Medical Records (CNER) from 2017 to 2020. The task is to identify and extract medically relevant clinical entities from plain text EHR documents and classify them into predefined categories like symptoms, drugs, surgeries, etc. The datasets include CCKS2017-NER, CCKS2018-NER, CCKS2019-NER, and CCKS2020-NER.
CCKS2017-NER: 2,229 samples, 5 categories: symp, dise, chec, body, and cure.
CCKS2018-NER: 797 samples, 5 categories: symptoms & signs, checks & tests, treatments, diseases & diagnoses, body parts.
CCKS2019-NER: 1,379 samples, 6 categories: anatomical parts, surgeries, diseases & diagnoses, drugs, lab tests, imaging examinations.
CCKS2020-NER: 1,887 samples.
2017: https://www.biendata.xyz/competition/CCKS2017_2/
2018: https://www.biendata.xyz/competition/CCKS2018_1/
2019: http://openkg.cn/dataset/yidu-s4k
2020: https://www.biendata.xyz/competition/ccks_2020_8/
github: https://github.com/hy-struggle/ccks_ner
Military Equipment Testing and Evaluation - NER
This dataset comes from the Military Academy of System Engineering’s organization of a military equipment testing and evaluation NER challenge in CCKS 2020. The training and test sets each have 400 samples with average length 150 and maximum length 358. Entity types include four main categories: test elements (e.g., RS-24 ballistic missile, SPY-1D phased array radar), performance indicators (e.g., measurement accuracy, circular error probable, failure distance), system components (e.g., mid-wave infrared seeker head, booster, fairing), and mission scenarios (e.g., French navy, missile warning, terrorist attack).
Github: https://github.com/hy-struggle/ccks_ner
Chinese Medical CMeEE-NER
The CMeEE dataset is from the Chinese Medical Language Understanding Evaluation benchmark (CBLUE). Medical text entities are divided into nine categories including: Disease (dis), Clinical Manifestation (sym), Drug (dru), Medical Equipment (equ), Medical Procedure (pro), Body (bod), Medical Test Item (ite), Microorganism (mic), and Department (dep). Articles are first automatically segmented before annotation; all medical entities are correctly tokenized. CMeEE-V2 is a supplement to CMeEE.
Paper: https://arxiv.org/pdf/2106.08087.pdf
Github: https://github.com/CBLUEbenchmark/CBLUE
Chinese Literature - NER
This dataset is annotated based on Chinese literary articles, including 726 articles. Seven entity types are defined: Object, Task, Address, Event, Measurement Unit, Organization, Source.
Related paper: https://arxiv.org/pdf/1711.07010.pdf
Github: https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset
Bank Loan 2021-NER
This dataset contains 10,000 samples with 4 entity types: BANK, COMMENTS_ADJ, COMMENTS_N, and PRODUCT.
https://www.heywhale.com/mw/dataset/617969ec768f3b0017862990/file
Task-oriented Dialogue 2018-NER
Released from the NLPCC 2018 competition, this dataset corresponds to Task-Oriented Dialogue Task 4 with 21,352 samples and 15 entity types: language, origin, theme, custom_destination, style, phone_num, destination, contact_name, age, singer, song, instrument, toplist, scene, and emotion.
Link: http://tcci.ccf.org.cn/conference/2018/taskdata.php
CCIR2021-NER
The China Conference on Information Retrieval (CCIR), jointly organized by the Chinese Information Processing Society and China Computer Federation, initiated the CCIR Cup technical evaluation competition. The CCIR2021 dataset aims to improve robustness of algorithms for Chinese NER. It contains 15,723 samples and 4 categories: LOC, GPE, ORG, and PER.
https://www.datafountain.cn/competitions/510
Ruijin MCC2018-NER
This dataset was released by Shanghai Ruijin Hospital and Alibaba Cloud during an AI competition. Its main task is to mine diabetes-related literature and construct a diabetes knowledge graph using diabetes-related textbooks and research papers. The dataset contains 3,498 samples and 18 categories: Level, Method, Disease, Drug, Frequency, Amount, Operation, Pathogenesis, Test_items, Anatomy, Symptom, Duration, Treatment, Test_Value, ADE, Class, Test, and Reason.
Traditional Chinese Medicine Application 2020-NER
Published in the 2020 Intelligent Traditional Chinese Medicine Application Innovation Challenge, mainly hosted by Alibaba and Vanke. The goal was to select excellent AI big data application solutions in Traditional Chinese Medicine. The dataset contains 1,255 samples and 13 categories including: drug formulation, disease grouping, population, drug classification, traditional Chinese medicine efficacy, symptoms, diseases, drug components, drug properties and tastes, food grouping, food, syndromes, and drugs.
Product Title 2022-NER
The GAIIC2022 dataset comes from the Global Artificial Intelligence Innovation Competition 2022. The context: JD.com product titles contain a large amount of key product information. Product title NER is a core fundamental task in NLP applications, reusable in many downstream scenarios. Accurately extracting product-related entities from title text can improve user experience and platform efficiency in retrieval, recommendation, and other business scenarios. There are about 40,000 labeled training samples and 1 million unlabeled samples. There are 52 entity types labeled, anonymized and represented by numeric codes from 1 to 54 (excluding 27 and 45), where “O” denotes non-entity. Labels “B” represent the beginning of an entity, and “I” the interior or end of an entity. The number after “-” indicates the entity type of the character.
https://www.heywhale.com/home/competition/620b34ed28270b0017b823ad/content/2
Diagnosis and Treatment Dialogue 2021-NER
The online consultation platform is gradually emerging. Online consultation refers to doctors communicating with patients about their conditions, diagnosing diseases, and providing relevant medical advice through dialogue. Medical dialogue understanding aims to perform information extraction on consultation text, mainly including two tasks: named entity recognition and symptom check identification. Currently, the task is to identify five important types of medical-related entities (Operation, Drug_Category, Medical_Examination, Symptom, and Drug) from medical dialogue text. The dataset contains over 2,000 dialogue pairs, totaling 98,452 samples.
Link: http://www.fudan-disc.com/sharedtask/imcs21/index.html
FNED Dataset - NER
The FNED dataset contains 8 types of events, with a total of 13,000 sentences containing event information (each sentence includes one event). The data is sourced from publicly available military news websites (such as Sina Military, Phoenix Military, and NetEase Military). The annotations include event mentions (trigger words, event types, and event elements), entity mentions (entities), and relation mentions (head entity, tail entity, and relation type). There are 8 event types, 7 entity types, and 8 relation types.
