Close menu


Sunderland Repository records the research produced by the University of Sunderland including practice-based research and theses.

Text Mining Legal Documents for Clause Extraction

Vidler, Tony, McGarry, Kenneth and Baglee, David (2023) Text Mining Legal Documents for Clause Extraction. In: The 19th International Conference on Data Science (ICDATA'23), 24-27 Jul 2023, Las Vegas, USA.

Item Type: Conference or Workshop Item (Paper)


Natural Language Processing (NLP) solutions for legal contracts have been the preserve of large law firms and other industries (e.g., investment banks), especially those with large amounts of resources, having both the volume and range of legal documents and manpower to label the training data. The findings suggest that it is possible to use a smaller volume of training contacts and still generate results that are within an acceptable range. Our results show that just 120 training contracts trained on a pre-trained language model can generate results that are within 10% of the same model trained on 3.3 times the volume. In conclusion, smaller law firms could benefit from machine learning NLP solutions for clause extraction.

CSCE23-vidler v4.pdf - Accepted Version

Download (716kB) | Preview

More Information

Uncontrolled Keywords: NLP, Text Mining, Legal Clauses, Deep Learning, BERT.
Depositing User: Kenneth McGarry


Item ID: 16508
Official URL:

Users with ORCIDS

ORCID for Kenneth McGarry: ORCID iD
ORCID for David Baglee: ORCID iD

Catalogue record

Date Deposited: 21 Aug 2023 10:18
Last Modified: 14 Sep 2023 15:02


Author: Kenneth McGarry ORCID iD
Author: David Baglee ORCID iD
Author: Tony Vidler

University Divisions

Faculty of Technology > School of Computer Science


Computing > Data Science
Computing > Artificial Intelligence

Actions (login required)

View Item View Item