3. Web Content Mining
3.1 Introduction to Sentiment Analysis / Opinion Mining
Detection of stances and opinions towards people, companies, and products/services has a tremendous business value: Improving products and services, targeted advertising, revealing trends in election campaigns, …
Sentiment analysis or opinion mining is the computational study of people’s opinions, appraisals, attitudes, and emotions towards. (Entities,individuals,issues,events,topics,and their attributes (aspects))
A general sentiment analysis framework aims to answer
- Who is the opinion holder? -> Opinion holder
- Towards whom or what is opinion/sentiment expressed? -> Target
- What is the polarity and intensity of the opinion?
- Is an opinion associated with a time-span?
3.2 Constructing Sentiment Lexicons
Sentiment clues (opinion words, sentiment-bearing words) – words and phrases used to express some desired or undesired state
Positive clues: good, amazing, beautiful
Negative clues: bad, awful, terrible, poor
Sentiment clues are often domain-dependent => Separate sentiment lexicons need to be constructed for different domains
Example: Quiet speaker phone vs. quiet car engine
3.2.1 Automated acquisition of sentiment lexicons
Automated acquisition of sentiment lexicon is most often semi-supervised (or weakly supervised)
- Start from a small seed lexicon of sentiment words
- Iteratively augment the lexicon based on links between words already in the lexicon and words in the large general lexicon or large corpus
- Stop when there are no more reliable candidate words to be added to the lexicon
Approaches for constructing sentiment lexicons are either Dictionary-based or Corpus-based
Often there is a final step of manual cleansing of automatically derived sentiment lexicons
3.2.1.1 Dictionary-Based Sentiment Lexicon Acquisition
Bootstrapping using a small seed sentiment lexicon. E.g.,10 positive and 10 negative sentiment words
Idea: exploit semantic links between words in the general lexicon E.g.,synonymy and antonymy links in WordNet. The procedure is typically iterative
Additional information can be used to make better lists: WordNet glosses or Machine learning(classification based on concept definitions)
Cons:
- Limited Coverage: they may miss out on nuanced or domain-specific sentiments.
- Lack of Context Understanding: These approaches often treat words in isolation without considering their context.
- Difficulty Handling Negations and Modifiers: Sentiment analysis dictionaries may struggle with handling negations (e.g., “not good”) or modifiers (e.g., “very good”)(Next page)
- Limited Adaptability: Dictionary-based approaches may not easily adapt to new domains or languages without significant manual effort to update or create new sentiment lexicons.
- Vulnerability to Ambiguity: Some words may have multiple meanings or sentiments depending on the context, making it challenging for dictionary-based approaches to accurately capture their sentiment.
- Difficulty with Sarcasm and Irony: Sentiment dictionaries may struggle to detect sarcasm, irony, or other forms of figurative language, which can lead to misinterpretations of sentiment.
SentiWordNet is a general sentiment lexicon derived from WordNet. It contains automated annotations of all WordNet synsets with sentiment scores.
3.2.1.2 Corpus-Based Sentiment Lexicon Acquisition
Methodologically, corpus-based induction of sentiment lexicons resembles to the dictionary-based: Semi-supervised learning from small initial seed sets and Graph-based propagation of positive and negative sentiment
Difference:
Graph for label propagation is computed from word co-occurrences in a large corpus
The resulting lexicon specific to the domain of the corpus
Some (simple) approaches:
(1) Sentiment consistency, conjunction of adjectives (Hatzivassiloglou & McKeown, 1997)
Adjectives conjoined by “and” have same polarity. Adjectives conjoined by “but” do not.
Step 1: Label seed set of 1336 adjectives
Step 2: Expand seed set to conjoined adjectives (look in the corpus)
Step 3: Supervised classifier assigns “polarity similarity” to word pair
Step 4: Clustering for partitioning the graph into two
(2) Pointwise mutual information (PMI) of candidate words with seed set words (Turney & Littman, 2002)
Step 1: Extract a phrasal lexicon from reviews
Step 2: Learn polarity of each phrase
Step 3: Rate a review by the average polarity of its phrase
(3)PMI-induced graph with PageRank label propagation and supervised learning (Glavaš and Šnajder, 2012)