Jason Tian

Suck out the marrow of data

Self-Defined Classification Functions in Python (Part 1)

In this posting, I will write down some regex rules that I used in my finance NLP program.

Find surnames and other names from Hong Kong name list

In the Hong Kong name list, the pattern is that the surnames will be all capital words and other name words will start with capital character and followed by lower-case characters.

Two exampes are D'AQUINO Thomas Paul, DOWSLEY James William D'Altera and E Meng

find_surname = re.compile(r'\b[A-Z]+\'?[A-Z]+\b|^[A-Z]{1}\b')
find_othername = re.compile(r'\b[A-Z]{1}\'[A-Z]{1}[a-z]+\b|\b[A-Z]{1}[a-z]+\b|\b [A-Z]{1}-?\b')
fullname = "D'AQUINO Thomas Paul"
find_surname.findall(fullname)
find_othername.findall(fullname)

Issues

Find surname in LIE-A-CHEONG Tai Chong David

When I add this pattern to previous regex, I found One problem.

find_surname = re.compile(r'\b[A-Z]+\'?[A-Z]+\b|^[A-Z]{1}\b|\b-[A-Z]{1}-\b')
find_surname.findall("LIE-A-CHEONG Tai Chong David")

The output will be ['', '', '', '-A-', '']

The confliction was caused by | and () together