In this posting, I will write down some regex rules that I used in my finance NLP program.
Find surnames and other names from Hong Kong name list
In the Hong Kong name list, the pattern is that the surnames will be all capital words and other name words will start with capital character and followed by lower-case characters.
Two exampes are D'AQUINO Thomas Paul
, DOWSLEY James William D'Altera
and E Meng
find_surname = re.compile(r'\b[A-Z]+\'?[A-Z]+\b|^[A-Z]{1}\b')
find_othername = re.compile(r'\b[A-Z]{1}\'[A-Z]{1}[a-z]+\b|\b[A-Z]{1}[a-z]+\b|\b [A-Z]{1}-?\b')
fullname = "D'AQUINO Thomas Paul"
find_surname.findall(fullname)
find_othername.findall(fullname)
Issues
Find surname in LIE-A-CHEONG Tai Chong David
When I add this pattern to previous regex, I found One problem.
find_surname = re.compile(r'\b[A-Z]+\'?[A-Z]+\b|^[A-Z]{1}\b|\b-[A-Z]{1}-\b')
find_surname.findall("LIE-A-CHEONG Tai Chong David")
The output will be ['', '', '', '-A-', '']
The confliction was caused by |
and ()
together