KGPT: Knowledge-Grounded Data-to-Text Pretraining
CODE DATA
AACL Tutorial on Self-Supervised Learning for NLP with Xin Wang
SLIDES VIDEO
ProQA: Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering
CODE
SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning
CODE
Logic2Text: High-Fidelity Natural Language Generation from Logical Forms
CODE DATA
HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data
WEBSITE
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler
MODEL
Logical Natural Language Generation from Open-Domain Tables
CODE DATA
Few-Shot NLG with Pre-Trained Language Model
CODE
TabFact: A Large-scale Dataset for Table-based Fact Verification
WEBSITE
REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments
DATA
Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs
CODE DATA
Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection
DATA
DOLORES: Deep Contextualized Knowledge Graph Embeddings
CODE
A Benchmark Dataset for Learning to Intervene in Online Hate Speech
DATA
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
WEBSITE
Knowledge-Aware Reader
PyTorch implementation of the ACL 2019 paper "Improving Question Answering over Incomplete KBs with Knowledge-Aware Reader".
CODE
Self-Supervised Extractive Summarization (ACL 2019)
Code and Data for ACL 2019 "Self-Supervised Learning for Contextualized Extractive Summarization".
CODE
Hierarchically Disentangled Self Attention
Code and Data for ACL 2019 "Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention".
CODE
Lifelong Relation Extraction
Code for our NAACL 2019 paper: Sentence Embedding Alignment for Lifelong Relation Extraction.
CODE
Riemannian Normalizing Flow for Variational Wasserstein Autoencoder
Pytorch Implemetation for our NAACL2019 Paper "Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling".
CODE
Variational Vocabulary Reduction
Code for NAACL19 Paper "How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection".
CODE
Extremely Fine-Grained Entity Typing
PyTorch implementation of our paper "Imposing Label-Relational Inductive Bias for Extremely Fine-Grained Entity Typing" (NAACL19).
CODE
Deep Adversarial Learning for NLP
I gave a tutorial on Deep Adversarial Learning for NLP at NAACL 2019 conference with Sameer Singh (UCI). Slides are available
here.
XL-NBT: A Cross-lingual Neural Belief Tracking Framework
Arxiv preprint:
PDF CODE
One-Shot Relational Learning for Knowledge Graphs
Arxiv preprint:
PDF CODE
WikiHow: A Large Scale Text Summarization Dataset
Arxiv preprint:
PDF DATA
CIPS Summer School Slides
PART 1: Recent Advances in Distant Supervision IE
PDF
PART 2: Recent Advances in Knowledge Graph Embeddings
PDF
PART 3: Recent Advances in Knowledge Graph Reasoning
PDF
MOJITALK: Generating Emotional Responses at Scale
Xianda Zhou and William Yang Wang, "MOJITALK: Generating Emotional Responses at Scale", to appear in
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), full paper, Melbourne, Australia, July 15-20, 2018, ACL. Preprint arxiv
PDF BIB CODE and DATA
ACL 2018 Tutorial on Deep Reinforcement Learning for NLP
William Wang, Jiwei Li, and Xiaodong He.
PDF
Scheduled Policy Optimization
Wenhan Xiong, Xiaoxiao Guo, Mo Yu, Shiyu Chang, Bowen Zhou, and William Yang Wang, "Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents", to appear in
Proceedings of the 27th International Joint Conference on Artificial Intelligence and the 23rd European Conference on Artificial Intelligence (IJCAI-ECAI 2018), full oral paper, Stockholm, Sweden, July 13-19, 2018, IJCAI. Preprint arxiv
PDF BIB CODE
Deep Reinforcement Learning for Chinese Zero Pronoun Resolution
Qingyu Yin, Yu Zhang, Wei-Nan Zhang, Ting Liu, and William Yang Wang, "Deep Reinforcement Learning for Chinese Zero Pronoun Resolution", to appear in
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), full paper, Melbourne, Australia, July 15-20, 2018, ACL. Preprint arxiv
PDF BIB CODE
NAACL 2018 Tutorial on Knowledge Construction and Reasoning
Part 1: Xiang Ren (USC)
Part 2: Nanyun Peng (USC)
Part 3: William Wang (UCSB)
PDF
Simple Models for Word Formation in Slang
Vivek Kulkarni and William Yang Wang, "Simple Models for Word Formation in Slang", to appear in
Proceedings of The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), long paper, New Orleans, LA, USA, June 1 - June 6, 2018, ACL.
PDF BIB CODE
KBGAN: Adversarial Learning for Knowledge Graph Embeddings
Liwei Cai and William Yang Wang, "KBGAN: Adversarial Learning for Knowledge Graph Embeddings", to appear in
Proceedings of The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), long oral paper, New Orleans, LA, USA, June 1 - June 6, 2018, ACL. Preprint arxiv
PDF.
CODE
CHARADES-Caption Dataset for Video Captioning
Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and
William Yang Wang, "Video Captioning via Hierarchical Reinforcement Learning", preprint arxiv
PDF DATA
DeepPath: Reinforcement Learning for Knowledge Graph Reasoning
See Wenhan Xiong's code and his prepared NELL-995 dataset from the paper "DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning".
PDF |
CODE |
NELL-995 DATASET
Learning to Generate Explanations
Ke Ni, and
William Yang Wang, "Learning to Explain Non-Standard English Words and Phrases", to appear in
Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), short paper, Taipei, Taiwan, Nov.27-Dec.1, AFNLP.
PDF BIB DATA
Deep Residual Learning for Weakly-Supervised Relation Extraction
See Darren Huang's code and his EMNLP 2017 paper.
PDF BIB CODE
Liar: a benchmark dataset for fake news detection
Wlliam Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL.
DATA PDF
How to Do Research?
I gave a short talk on
how to do research with my undergraduate students.
NAACL 2016 Tutorial on Statistical Relational Learning for NLP
Part 1: overview on logic, probability, MLNs, and probabilistic DDBs
Part 2 - ProPPR and applications
Part 3 - TensorLog, and other recent and current work
Annotated Annoying Behaviors from Twitter
William Yang Wang and Diyi Yang, "That's So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using
#petpeeve Tweets", in
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), short paper, Lisbon, Portugal, Sept. 17-21, ACL.
PDF
BIB
DATA
Information Extraction Tutorial at Peking University
CIPS Summer School IE Course Homepage Slides:
PPTX PDF July 25, 2015
Three Wikipedia Datasets for Joint IE and Reasoning
William Yang Wang and William W. Cohen, "Joint Information Extraction and Reasoning: A Scalable Statistical Relational Learning Approach", to appear in
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and The 7th International Joint Conference of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015), long paper for oral presentation, Beijing, China, July 26-31, ACL.
PDF BIB DATA
ProPPR: a scalable probabilistic first-order logic
William Yang Wang, Kathryn Mazaitis, Ni Lao, and William W. Cohen, "Efficient Inference and Learning in a Large Knowledge Base: Reasoning with Extracted Information using a Locally Groundable First-Order Probabilistic Logic", to appear in
Machine Learning Journal (MLJ 2015), Springer. Preprint version:
PDF BIB CODE
A large European family dataset for relational learning
William Yang Wang, Kathryn Mazaitis, and William W. Cohen, "A Soft Version of Pre
dicate Invention Based on Structured Sparsity", in
Proceedings of the 24th Inte
rnational Joint Conference on Artificial Intelligence (IJCAI 2015), full paper for oral presentation, Buenos Aires, Argentina, July 25-31, IJCAI.
Preprint version:
PDF
BIB DATA
The meme descriptions datase
William Yang Wang and Miaomiao Wen, "I Can Has Cheezburger? A Nonparanormal Approach to Combining Textual and Visual Information for Predicting and Generating Popular Meme Descriptions", to appear in
the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), long paper, Denver, CO., USA, May 31-June 5, ACL. Preprint version:
PDF BIB DATA
The earnings calls dataset
William Yang Wang, and Zhenhao Hua, "A Semiparametric Gaussian Copula Regression Model for Predicting Financial Risks from Earnings Calls",
in
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics
(ACL 2014), long paper, Baltimore, MD, June 22-27, ACL.
Preprint version:
PDF
BIB DATA
The Yelp computational branding analytics (CBA) data
William Yang Wang, Ed Lin, John Kominek,
"This Text has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics", in
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013),
full paper, Seattle, WA, USA, Oct. 18-21, ACL.
PDF BIB DATA
The Columbia Summarization Corpus (CSC)
William Yang Wang, Kapil Thadani, and Kathleen R. McKeown,
"Identifying Event Descriptions using Co-training with Online News Summaries",
in Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011),
Chiang Mai, Thailand, Nov. 8-13, ACL-AFNLP.
PDF BIB
The Columbia Summarization Corpus (CSC) was retrieved from the output of the
Newsblaster online news summarization system
that crawls the Web for news articles, clusters them on specific topics and produces multidocument summaries for each cluster.
We collected a total of 166,435 summaries containing 2.5 million sentences and covering 2,129
days in the 2003-2011 period. Additional references of the Columbia Newsblaster summarizer can be found on the website
of Columbia NLP group
publication page.
The CSC corpus can be used, but not limited to the following areas:
* Event Mining
* Language generation
* Summarization
* Information retrieval
* Information extraction
* Sentiment analysis and opinion mining
* Question answering
* Text mining and natural language processing applications
* Language modeling for text processing
* Lexicon and ontology development
* Machine learning (supervised, semi-supervised, and unsupervised learning)
Click
here to download the CSC corpus.