[ad_1]
I’ve all the time had a powerful curiosity in chemistry, and it has performed a big function in shaping each my educational {and professional} journey. As a knowledge skilled with a background in chemistry, I’ve discovered some ways to use each my scientific and analysis abilities like creativity, curiosity, persistence, eager remark, and evaluation to information initiatives. On this article, I’ll stroll you thru the event of a easy Named Entity Recognition (NER) mannequin that I’ve dubbed ChemNER. This mannequin can determine chemical compounds inside textual content and classify them into classes resembling alkanes, alkenes, alkynes, alcohols, aldehydes, ketones, or carboxylic acids.
TL;DR
When you simply wish to mess around with the ChemNER mannequin and/or use the Streamlit app I made, you’ll be able to entry them by way of the hyperlinks under:
HuggingFace hyperlink: https://huggingface.co/victormurcia/en_chemner
Streamlit App: ChemNER Link
NER approaches may be usually categorized into one of many following 3 classes:
- Lexicon-based: Outline a dictionary of courses and phrases
- Rule-based: Outline guidelines the phrases that correspond to every class
- Machine Studying (ML) — based mostly: Let the mannequin be taught the naming guidelines from a coaching corpus
Every of those approaches has their strengths and limitations and as all the time, a extra difficult and complex mannequin isn’t all the time the very best strategy.
On this case, the lexicon-based strategy could be limiting by way of scope since for each class of compounds we’re involved in classifying we’d need to manually outline ALL the compounds that fall inside that class. In different phrases, for this strategy to be all encompassing you’d have to manually enter each chemical compound for each compound class.
The ML strategy may very well be essentially the most highly effective method to go, nevertheless, annotating a dataset may be fairly laborious (spoiler alert: I’ll find yourself coaching a mannequin however I wish to present your entire course of for academic functions). As a substitute, how about we begin with some predefined naming guidelines?
[ad_2]
Source link