Regex to match the end of sentences in order to split a block of text into sentences.
([\.\?!][\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+)
Python Code
sentence_regex = ur'([\.\?!][\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+)' regex = re.compile(sentence_regex, flags=re.UNICODE) sentences = regex.split(TEXT_BLOCK)
Leave a Reply