Building Custom Classifier using Amazon Comprehend.
Source: Internet

Building Custom Classifier using Amazon Comprehend.

A quick rundown on how to perform NLP tasks with AWS s3, Boto3, and Amazon Comprehend using Python.

For the past few days, I've had the pleasure of working on Amazon Comprehend, a fantastic service provided by Amazon Web Services or as we all know AWS.

AWS Comprehend is one of the most amazing technological marvels I've ever seen. I was astounded by how effective it might be while still being so simple to use. I began as a complete novice and progressed to discover some incredible features that captivated me.

Introduction

In 2006, Amazon Web Services (AWS) began offering IT infrastructure services to businesses in the form of web services -- which is now commonly known as “cloud computing”.

AWS is the world's most comprehensive and widely used cloud platform that includes a mixture of infrastructure as a service (IaaS), platform as a service (PaaS), and packaged software as a service (SaaS) offering.

No alt text provided for this image

Organizations of every size these days are using AWS to lower costs, become more agile, and innovate faster.

With 81 Availability Zones spanning over 25 geographic locations and plans for further 21 Availability Zones and 7 AWS Regions, AWS has the most extensive global cloud infrastructure.

AWS offers on-demand service delivery with pay-as-you-go pricing. To put it another way, you only pay for what you use.

Natural Language Processing (NLP)

Natural Language Processing or NLP is a field of Artificial Intelligence (AI) that gives machines the ability to read, understand and derive meaning from human languages. It is an important component in a wide range of software applications that we use in our daily lives.

Amazon Comprehend.

Amazon Comprehend is a natural-language processing (NLP) service that uses machine learning to uncover information in unstructured data.

No alt text provided for this image

The service can identify critical elements in data, including references to language, people, and places, and the text files can be categorized by relevant topics.

 Comprehend not only locates any content that contains personally identifiable information, but it also redacts and masks those content.

Benefits:

With AWS Comprehend we can :

  1. Uncover valuable insights from our text.
  2. Organize documents by topics.
  3. Train models on our own data,
  4. Identify industry-specific insights from unstructured text and documents, like emails.

Amazon Comprehend provides features like Keyphrase Extraction, Sentiment Analysis, Entity Recognition, Topic Modeling, and Language Detection APIs so we can easily integrate natural language processing into our applications.

we simply call the Amazon Comprehend APIs in our application and provide the location of the source document or text. The APIs will output entities, key phrases, sentiment, and language in a JSON format, which we can use in our application.

Amazon Comprehend Features.

  1. Keyphrase Extraction.

The Keyphrase Extraction API returns the key phrases or talking points and a confidence score to support that this is a key phrase.

2. Sentiment Analysis.

The Sentiment Analysis API returns the overall sentiment of a text (Positive, Negative, Neutral, or Mixed).

No alt text provided for this image

 3. Syntax Analysis.

The Amazon Comprehend Syntax API enables customers to analyze text using tokenization and Parts of Speech (PoS), and identify word boundaries and labels like nouns and adjectives within the text.

 4. Entity Recognition.

The Entity Recognition API returns the named entities ("People," "Places," "Locations," etc.) that are automatically categorized based on the provided text.

5. Comprehend Medical.

Medical Named Entity and Relationship Extraction (NERe).

The Medical NERe API returns the medical information such as medication, medical condition, test, treatment, and procedures (TTP), anatomy, and Protected Health Information (PHI). It also identifies relationships between extracted sub-types associated with Medications and TTP. There is also contextual information provided as entity “traits”.

6. Custom Entities.

Custom Entities allows you to customize Amazon Comprehend to identify terms that are specific to our domain. Using AutoML, comprehend will learn from a small private index of examples (for example, a list of policy numbers and text in which they are used), and then train a private, custom model to recognize these terms in any other block of text. 

7. Language Detection.

The Language Detection API automatically identifies text written in over 100 languages and returns the dominant language with a confidence score to support that a language is dominant.

8. Custom Classification.

The Custom Classification API enables you to easily build custom text classification models using your business-specific labels without learning ML.

9. Topic Modeling.

Topic Modeling identifies relevant terms or topics from a collection of documents stored in Amazon S3. It will identify the most common topics in the collection and organize them into groups and then map which documents belong to which topic.

Amazon Comprehend Pricing.

Natural Language Processing: Amazon Comprehend APIs for entity recognition, sentiment analysis, syntax analysis, key phrase extraction, and language detection can be used to extract insights from natural language text. These requests are measured in units of 100 characters (1 unit = 100 characters), with a 3 unit (300 characters) minimum charge per request.

PII: The detect PII API finds locations of chosen Personally Identifiable Information (“PII”) entities inside a document and can be used to create redacted versions of documents. The PII API tells us if a document contains the chosen PII or not. These requests are also measured in units of 100 characters (1 unit = 100 characters), with a 3 unit (300 characters) minimum charge per request.

No alt text provided for this image

Custom Comprehend: The Custom Classification and Entities APIs can train a custom NLP model to categorize text and extract custom entities.

Asynchronous inference requests are measured in units of 100 characters, with a 3 unit (300 characters) minimum charge per request. You are charged $3 per hour for model training (billed by the second) and $0.50 per month for custom model management.

For synchronous Custom Classification and Entities inference requests, you provision an endpoint with the appropriate throughput. You are charged from the time that you start your endpoint until it is deleted.

Topic Modeling: Topic Modeling identifies relevant terms or topics from a collection of documents stored in Amazon S3. It will identify the most common topics in the collection and organize them into groups and then map which documents belong to which topic.

You are charged based on the total size of documents processed per job. The first 100 MB is charged a flat rate. Above 100 MB, you are charged per MB.

Boto3

Boto3 is AWS SDK for python. Boto3 makes it easy to integrate our Python application, library, or script with AWS services. The SDK provides an object-oriented API as well as low-level access to AWS services.

Topic modeling

A topic model is a form of statistical model used in machine learning and natural language processing to find abstract "topics" that appear in a collection of documents.

Topic Modeling is an unsupervised learning method for clustering documents and identifying topics based on their contents. It works in the same way as the K-Means algorithm and Expectation-Maximization. We will have to evaluate individual words in each document to uncover topics and assign values to each depending on the distribution of these terms because we are clustering texts

No alt text provided for this image

Example:

Let's say I have 5 documents:

Document 1: I like mongo and apple.

Document 2: Crab and fish live in water.

Document 3: Puppies and kittens are fluffy.

Document 4: I had spinach and berry smoothie.

Document 5: My pup loves mango.


The output should sound something like this:

Document 1 is talking about Food.

Document 2 is talking about Animals.

Document 3 is talking about Animals.

Document 4 is talking about Food.

Document 5 is talking about Food + Animals.

Task Performed: 

Given the abstract and title for a set of documents, Comprehend has to predict the topics for each Document included in the test set.

Steps:

No alt text provided for this image
  1. Data Pre Processing:
No alt text provided for this image

2. Creating S3 Bucket:

No alt text provided for this image

3. Training:

No alt text provided for this image

4. Predicting:

No alt text provided for this image

5. Validating:

No alt text provided for this image

Source Code:

Conclusion:

AWS is a powerful technology that is making a lot of name. In my opinion, I feel AWS will be employed in a variety of fields in the near future since it has the potential to revolutionize the world.

Boto3 with AWS Comprehend enables ML engineers and Data Scientists to complete various jobs that would otherwise take hours. Of course, given domain knowledge and time to understand the problem, a Custom Model will perform far more precisely.

Presentation:

Reference

Amazon provides the best documentation.

https://aws.amazon.com/comprehend

https://aws.amazon.com/comprehend/features

https://aws.amazon.com/comprehend/pricing

https://towardsdatascience.com/sentiment-analysis-entity-extraction-with-aws-comprehend-618a7bec60b8

Youtube.

Images - Internet.

mamta sharma

Executive at PlusOne Marketing Solutions

2y

Language Detection for Unstructured Data with AWS S3 Batch Operations and AWS Comprehend https://www.decipherzone.com/blog-detail/language-detection-amazon-comprehend For years, organizations have collected text data. Analyzing text data can help you meet a range of business challenges, from customer experiences to analytics. The unstructured datasets and mixed languages in business documents, emails, and web pages can possess riches of knowledge. Through processing and interpreting this data, you can gain insight that can help with decision-making in your business. But manual processing and interpretation of the text data can cost a lot of effort, time and money.

Shishir chaurasiya

Follow me to learn concepts that only top 5% Engineer Knows🔹Sharing Authentic Job Updates🔹Not from IIT or NIT🔹Helping Students to make Career Decisions🔹 CareerCoach🔹Teacher🔹Distributed Systems🔹Mentor

2y

Awesome sir !!

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics