50% of the grade of this course will be a project. There should be 3 or 4 students per group. Groups can be different from lab exercises and paper presentation groups.
You will find a list of projects below. They are mostly links to datasets (sometime with a scientific paper). You can choose to do something with is not in the list. All projects must be discussed with me by mail. The goal is to try to construct a ML algorithm for a NLP task, to explore perfomance and limits of the network, analyse and discuss results.
- Choose a project before December 9, 2019. To choose a project, you to send me one mail per group where all group members are in the mail recipients. Projects will be distributed in a first come first served basis. We can discuss about the project before by mail.
- Send me a small intermediary report before January 23, 2020 where you describe your project, your plan, what kind of neural architecture you want to build, what limits you may face etc. The report should be 3-5 pages.
- January 29, 2020: Project defense. You don’t need to have finished the project for this defense. You need to prepare a 10-15 minutes presentation of your project, your plan etc. You can use the blackboard or slides.
- End of February: You have to send me a 10 page report and your code!
- https://www.clips.uantwerpen.be/conll2000/chunking/ for Chunking
- https://github.com/UniversalDependencies/UD_English for Dependency Parsing
https://www.kaggle.com/c/quora-question-pairs, Quora Question Pairs
- https://nlp.stanford.edu/sentiment/treebank.html, Sentence-Level Sentiment Analysis
- http://ai.stanford.edu/~amaas/data/sentiment/, Document-Level Sentiment Analysis
- https://nlp.stanford.edu/projects/snli/, Textual Entailment
- https://www.yelp.com/dataset/challenge, Yelp Reviews
- https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/ for WikiText Language Modeling
- https://github.com/FakeNewsChallenge/fnc-1, Fake News Challenge
- https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge, Toxic Comment Classification
- https://arxiv.org/abs/1704.04683, the race corpus (ask me for data)
- Automated essay scoring: https://www.kaggle.com/c/asap-aes
- Character prediction for contemporary english: Article http://cs224d.stanford.edu/reports/mwlow.pdf
- StackOverflow questions and tags: https://github.com/dgrtwo/StackLite + article (http://cs224d.stanford.edu/reports/lefrankl.pdf )
- MPQA Opinion Corpus : http://mpqa.cs.pitt.edu/ (Opinion Tagging)
- DailyMail/CNN Reading Comprehension task: https://github.com/danqi/rc-cnn-dailymail
- Blog authorship corpus: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm for example for gender detection (http://aclweb.org/anthology/D/D10/D10-1021.pdf )
- Beers rating : article http://www.aclweb.org/anthology/D16-1011 + data http://jmcauley.ucsd.edu/cse255/data/beer/