AI Email Scraper

NER, Data Scraping

I involved in AI email scraper project when I was working at OFE. It is maritime tech start up based in Singapore. Basically, AI email scraper is a system that automate the scraping process of relevant information from email. We categorized email as structured and unstructured. Unstructured scraping happen if email only contain of body text while stuctured scraping if email has attachment of excel, pdf, or image file.

The flow of scraping process consists of data retrieval, data classification, data scraping and data cleaning. Data retrieval is the process of retrieving email that clients send to the company. The result is saved in dictionary in MongoDB. In data classification stage, we classified the email as structured or unstructured and also classified the body text to certain categories that product team needed. The result is then saved as dictionary in MongoDB. The scraping process happens in data scraping stage. Unstructured scraping is Named Entity Recognition (NER) problem, we apply state of the art algorithm to solve, LSTM-CNN-CRF using tensorflow library.

Same goes with structured scraping for excel, image, plain text and pdf. Ideally it would be easier to scrape structured information but the real case is not always like that. We got various format of excel, image, plain text and pdf. Thus we need to create algorithm that accomodate all possible cases. As the project is not available for public, I can only give the above explanation. Hope it helps.