Blog post

What is Text Annotation in Machine Learning?

Giving machines a deeper understanding of sentiment, intent and technical concepts.  

Reading a sentence such as: “You are killing it!”, a human would understand the rich meaning and context behind that simple statement: the person writing it is complimenting someone on doing something exceptionally.  

However, a natural language processing model might have more trouble understanding the sentence fully: for example, it may mislabel it with a negative sentiment, or misunderstand the sentence completely. Correct text annotations within the data used to train that model can help correct the misunderstanding, leading to a more accurate interpretation of the text at hand.  

As a type of data annotation, text annotation is the machine learning process of assigning meaning to blocks of text: whether they are short phrases, longer sentences or full paragraphs. This is done by providing AI models with additional information in the form of definitions, meaning and intent to supplement the text as written.  

Here’s a closer look at why text annotation is important, what the different types of text annotation are and how to annotate text.  

Why is Text Annotation Important? 

The richness of human languages is the main reason why text annotation is so important within NLP. As smart as they are becoming, machines still have a lot to learn when it comes to context and deeper meaning. It’s annotation that gives them that information.    

Take as an example a chatbot – chatbots are among the most recognizable applications of natural language processing around today, and there are hundreds of examples of chatbots gone wrong. Chatbot fails can be entertaining. On the other hand, badly trained chatbots, especially those in customer service, can affect company reputation, user experience and ultimately customer loyalty.   

Lending a human perspective 

If the point is to make AI learn and comprehend in the same way that humans do, then we need to give machines perspective, feeling and understanding: three concepts that stay very firmly in the human realm – for now.  

Given the right input from correctly annotated text, machines will eventually be writing poetry – it’s just a matter of time. 

Types of Text Annotation 

Here are three types of text annotation, and a few sample use cases for each:  

Named Entity Tagging (NET):  

Also commonly referred to as Named Entity Recognition, NET is assigning labels to words or phrases within a text from predefined categories such as “actor” or “city”. This type of annotation is useful for machines to understand the subject matter of the text.  

Named Entity Tagging has a variety of applications in the real world, including:  

— Customer Service: in chatbots, and other automated processes, it ensures machines understand the meaning of queries and comments. From routing customer service complaints and comments to the right department, to understanding emails, NET allows for the automation of more of the customer service pipeline.  

— Screening Processes: Within hiring and recruitment, NET identifies keywords, skills and experience within user profiles for a faster, more efficient recruitment process.   

— Medical Records: In healthcare, NET is used for processing patient information and records more effectively, such as classifying documents, filing patient records and amplifying medical research.  

Sentiment Annotation 

Sentiment Tagging goes beyond simple definitions and helps machines understand the sentiment behind a block of text. Is the sentiment positive, negative or neutral?  

Here are some examples of where Sentiment Annotation is used:  

— Brand Social Listening: Brands can analyze social media posts and comments with the aid of Sentiment Annotation to help understand public opinion of them and which platforms tend to be more positive or negative. They can use this data to help shift communication strategies.  

— Detailed Customer Insights: Through sentiment annotation, companies can understand sentiment behind customer interactions, including reviews, emails and other comments. This information allows for a more relevant response to customer service and support.  

— Employee engagement: In human resources, using Sentiment Annotation to analyze employee feedback can help shape an effective employee engagement strategy – especially when receiving high volume feedback in a short time frame.  

Semantic Annotation 

Semantic Annotation attaches additional information to words and phrases that further explain user intent or domain specific definitions, such as industry jargon. This type of annotation is used in virtual assistants and chatbots.  

How to annotate text: the text annotation process 

The process for manually annotating a text starts with identifying the right people to complete the annotation. For very general use cases and datasets, annotators with general cultural and technical knowledge can be used. For more specialized and technical subjects, identify subject matter experts for better results. Here’s a closer look at the process for each of the three annotation techniques discussed above.  

Named Entity Tagging:  

NET starts with defining the categories that will be used. This would include “City” “Country” “Profession” and so on. The annotator is then shown a block of text and possible tags. They will choose parts of the sentence and assign relevant tags to each part. If they are doing Named Entity Tagging for an entertainment app, for example, they would assign “Anthony Hopkins” as an Actor and “Lady Gaga” as a musician.   

Sentiment Annotation:  

The annotator is shown a text and will label it as positive, negative or neutral. Multiple annotations are especially important in this type of annotation, as sentiment is subjective, depending on the individual.  

Semantic Annotation:  

The annotator will be shown a text and will label it with domain specific information (for example, banking specific terminology), or with intent (accounting for the use of sarcasm or irony).  

Ensuring Quality Text Annotations 

There are several ways to keep an eye on quality throughout the text annotation process:  

  • Collect multiple annotations on the same text. The more annotations that a text goes through, the more accurate those annotations should be – which is why the generally accepted ideal number of annotators per block of text is 3.  
  • Use inter-annotator agreements. These are a measurement of agreement between annotators, and can help determine which annotation is correct when there is a discrepancy between multiple annotators.   
  • Use gold standard datasets. Also called ground truth data, these are well-annotated datasets that can be used as a reference for other data annotations.  

Crowdsourcing Text Annotation  

While tools exist for automatic text annotation, some of the highest quality annotations come from human annotators. From being able to understand complex sentiments to expertly annotating highly technical subjects, human annotators produce superior results.   

With that in mind, what is a company to do when faced with a high-volume project? Crowdsourcing is a simple solution to completing annotation jobs at scale. At DefinedCrowd, we leverage the power of a crowd of people to ensure high-quality, large-scale annotations amongst our larger data collection services. For more on how to annotate a text, and how we do it at DefinedCrowd, have a look here.  

2

Leave a comment

Your email address will not be published. Required fields are marked *

Terms of Use agreement

When contributing, do not post any material that contains:

  • hate speech
  • profanity, obscenity or vulgarity
  • comments that could be considered prejudicial, racist or inflammatory
  • nudity or offensive imagery (including, but not limited to, in profile pictures)
  • defamation to a person or people
  • name calling and/or personal attacks
  • comments whose main purpose are commercial in nature and/or to sell a product
  • comments that infringe on copyright or another person’s intellectual property
  • spam comments from individuals or groups, such as the same comment posted repeatedly on a profile
  • personal information about you or another individual (including identifying information, email addresses, phone numbers or private addresses)
  • false representation of another individual, organisation, government or entity
  • promotion of a product, business, company or organisation

We retain the right to remove any content that does not comply with these guidelines or we deem inappropriate.
Repeated violations may cause the author to be blocked from our channels.

Thank you for your comment!

Please allow several working hours for the comment to be moderated before it is published.