Resources

Corpora: Purchased/Acquired

Access: NLP Lab Only (See the CSE departmental office for copies of license agreements)

  1. DUC Summarization evaluation, 2005-2007, National Institute of Standards
  2. TAC Summarization evaluation, 2008-2011, National Institute of Standards
  3. Wikipedia Corpus, 1.9 billion words of English, 2014, Mark Davies
  4. Penn Discourse Treebank (PDTB) 3.0, 53600 tokens, 2018, Bonnie Webber

Access: Linguistic Data Consortium Membership (LDC)

  • Co-subscriber units (contact name): CSE (Becky Passonneau, rjp49@psu.edu); IST (Ting-Hao Kenneth Huang,txh710@psu.edu); Center for Social Data Analytics: C-SoDA (Burt Monroe, burtmonroe@psu.edu); Center for Language Acquisition (Kevin McManus, kmcmanus@psu.edu)
  • Welcome to co-subscribe with us! (Contact: rjp49@psu.edu)
LDC Corpora received under the current subscription (2017-present). Any PSU student or instructor can request access to these corpora that PSU has already paid for.

All LDC Corpora PSU has rights to are listed here.

 

Corpus Name Requester Path in the Server 
Conll-formatted-ontonotes-5.0 IST (Sarah Rajtmajer) /home/nlp/corpora
Concretely_annotated_gigaword SoDA (Burt Monroe) /home/nlp/corpora
RST_discourse_treebank CSE (Becky Passonneau) /home/nlp/corpora

The New York Times Annotated Corpus

SoDA (Burt Monroe)

Applied Ling (Susan Strauss)

/home/nlp/corpora

ETS Corpus of Non-Native Written English

IST (Prasenjit Mitra)

/home/nlp/corpora

TAC Relation Extraction Dataset

IST (Sarah Rajtmajer)

 

Annotation guidelines we create or help create

Access:  For NLP Lab only

  • Annotationguidelines.pdf

Github: Penn State NLP Group 

Access: Open Source for anyone/ Private for NLP Lab only

  1. EasyCCG-Tree-Categorization (Private)
  2. RL-Reading-Group-Fall18,Reinforcement Learning Reading Group (Fall-Spring ’18,’19)
  3. Docx2Txt (Private), an REU on an NSF CyberLearning project: EAGER: Collab: Automated Instruction Assistant.
  4. SEAView, a tool for annotating content in two-part essays, which contain a summary and an argument.
  5. DucView, a tool for creating and using pyramids, a method for summary content annotation.