I am facing a very unique problem.
I have a lot of mail data(10000)MAILS and I need to extract out registration numbers from them.
the catch is there is no particular pattern to them they can be of any length, alpha-numeric, numeric or straight out alphabets, or symbols.
Now how or what models can I use to create a machine learning application which can extract out the numbers from mail body.
I am using all the text preprocessing to weed out all the unnecessary words(stopwords), dont think lemming or stemming will be utilised here.
will NER work.?
Thanks for help.
EXAMPLE OF MAIL
HI MY NAME IS XYZ FOR SOMETIME I HAVE BEEN CALLING FOR SOMETIME BUT TO NO EFFECT THIS IS MY REGISTRATION NUMBER 122431AIB.
A better solution may be to extract all numbers with a regex like https://stackoverflow.com/questions/44187078/regex-to-get-words-containing-letters-and-numbers-certain-special-but-not-o
Using that with a match you get ‘122431AIB’.