Regex to match the end of sentences in order to split a block of text into sentences.

([\.\?!][\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+)

Python Code

sentence_regex = ur'([\.\?!][\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+)'
regex = re.compile(sentence_regex, flags=re.UNICODE)
sentences = regex.split(TEXT_BLOCK)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.