reserch text

This commit is contained in:
Henry Dowd
2025-11-29 14:54:32 +00:00
parent fb68bc869a
commit 02cdc7bac6
7 changed files with 447 additions and 210 deletions

View File

@@ -0,0 +1,109 @@
--- Datasets/Corpus ---
-- Microsoft Research Paraphrase Corpus --
|- Primary Usage -
Paraphrase Identification
|- Content -
~6,000 sentence pairs from online news sources, labled (1,0)
Relatively small and limited to news articles
|- Best For -
Initial development, benchmarking
-- PAN Plagiarism Detection Corpus --
|- Primary Usage -
Plagiarism Detection Research
|- Content -
Different years of competitions with various text types
PAN-PC-10/11 External plagiarism detection
PAN-SS-13 Single source plagiarism with complexity levels
Academic texts, web content, student essays
|- Best For -
Advanced evluation on realistic plagiarism
-- Quora Question Pairs --
|- Primary Usage -
Duplicate Questions - (Paraphrased, likely unintentional)
|- Content -
+400,000 question pairs from Quora
Labled duplicate and not duplicate
Focuses on questions, not general statements
|- Best For -
Training data-intensive models
-- SemEval Paraphrase Datasets --
|- Primary Usage -
Paraphrase and semantic similarity
|- Content -
Datasets from SemEval competitions
SemEval-2012 Task 6: Semantic textual similarity
SemEval-2015 Task 1: Paraphrase & Semantic similarity
SemEval-2017 Task 1: Semantic similarity
News, headlines, image captions
Well annotated, multiple languages
Fragmented across "Tasks"
|- Best For -
Multi domain evaluation (different criteria)
-- P4P (Paraphrase for Plagiarism) Corpus --
|- Primary Usage -
Plagiarism detection with paraphrasing
|- Content -
Academic texts with paraphrased plagiarism
Source-plagiarism mappings, paraphrase types
Academic writing
Limited availability + academic focus
Access limited to requests
|- Best For -
Paraphrase-specific plagiarism research
-- ParaBank 2.0 --
|- Primary Usage -
Paraphrase generation and evaluation
|- Content -
Large-scale paraphrase pairs generated from parallel text
Multiple paraphrase candidates per sentence
Machine generated, may contain noise
|- Best For -
Large scale training and data augmentation
-- Twitter Paraphrase Corpus --
|- Primary Usage -
Short text paraphrase detection
|- Content -
tweet pairs annotated for paraphrase relationship
Paraphrase scoresand binary lables
~20,000 pairs
Informal language, real-world usage
Short text, informal grammer (difficult to parse)
|- Best For -
Informal language and social media applications
-- UW-Stanford Paraphrase Corpus --
|- Primary Usage -
Paraphrase detection
|- Content -
Sentence pairs from news & web text
Paraphrase judgments
~3,000 pairs (Very small dataset)
High quality human judgments (good test set)
|- Best For -
High-precision evaluation, testing
-- Sheffield Plagiarism Corpus --
|- Primary Usage -
Academic plagiarism detection research
|- Content -
Original academic texts and publications
Modified documents with various types of plagiarism
Detailed Markups of plagiarised sections with source mappings
Plagiarism types
Verbatim copying, Parphrasing, Structural plagiarism
Academic writing and student essays
Realistic plagiarism + obfuscation types
|- Best For -
Evaluating real academic plagiarism detection
-- New PAN25 --
|- 3 parts
spot_check
train
validation