Web Mining复习笔记3 Web Content Mining

2024-06-02 22:36
c#, 笔记, 前端, 开发语言
8人已看

3. Web Content Mining

3.1 Introduction to Sentiment Analysis / Opinion Mining

Detection of stances and opinions towards people, companies, and products/services has a tremendous business value: Improving products and services, targeted advertising, revealing trends in election campaigns, …

Sentiment analysis or opinion mining is the computational study of people’s opinions, appraisals, attitudes, and emotions towards. (Entities,individuals,issues,events,topics,and their attributes (aspects))

A general sentiment analysis framework aims to answer

Who is the opinion holder? -> Opinion holder
Towards whom or what is opinion/sentiment expressed? -> Target
What is the polarity and intensity of the opinion?
Is an opinion associated with a time-span?

Opinion

3.2 Constructing Sentiment Lexicons

Sentiment clues (opinion words, sentiment-bearing words) – words and phrases used to express some desired or undesired state
Positive clues: good, amazing, beautiful
Negative clues: bad, awful, terrible, poor

Sentiment clues are often domain-dependent => Separate sentiment lexicons need to be constructed for different domains
Example: Quiet speaker phone vs. quiet car engine

3.2.1 Automated acquisition of sentiment lexicons

Automated acquisition of sentiment lexicon is most often semi-supervised (or weakly supervised)

Start from a small seed lexicon of sentiment words
Iteratively augment the lexicon based on links between words already in the lexicon and words in the large general lexicon or large corpus
Stop when there are no more reliable candidate words to be added to the lexicon

Approaches for constructing sentiment lexicons are either Dictionary-based or Corpus-based

Often there is a final step of manual cleansing of automatically derived sentiment lexicons

3.2.1.1 Dictionary-Based Sentiment Lexicon Acquisition

Bootstrapping using a small seed sentiment lexicon. E.g.,10 positive and 10 negative sentiment words
Idea: exploit semantic links between words in the general lexicon E.g.,synonymy and antonymy links in WordNet. The procedure is typically iterative
Additional information can be used to make better lists: WordNet glosses or Machine learning(classification based on concept definitions)

Cons:

Limited Coverage: they may miss out on nuanced or domain-specific sentiments.
Lack of Context Understanding: These approaches often treat words in isolation without considering their context.
Difficulty Handling Negations and Modifiers: Sentiment analysis dictionaries may struggle with handling negations (e.g., “not good”) or modifiers (e.g., “very good”)(Next page)
Limited Adaptability: Dictionary-based approaches may not easily adapt to new domains or languages without significant manual effort to update or create new sentiment lexicons.
Vulnerability to Ambiguity: Some words may have multiple meanings or sentiments depending on the context, making it challenging for dictionary-based approaches to accurately capture their sentiment.
Difficulty with Sarcasm and Irony: Sentiment dictionaries may struggle to detect sarcasm, irony, or other forms of figurative language, which can lead to misinterpretations of sentiment.

SentiWordNet is a general sentiment lexicon derived from WordNet. It contains automated annotations of all WordNet synsets with sentiment scores.

3.2.1.2 Corpus-Based Sentiment Lexicon Acquisition

Methodologically, corpus-based induction of sentiment lexicons resembles to the dictionary-based: Semi-supervised learning from small initial seed sets and Graph-based propagation of positive and negative sentiment
Difference:
Graph for label propagation is computed from word co-occurrences in a large corpus
The resulting lexicon specific to the domain of the corpus

Some (simple) approaches:
(1) Sentiment consistency, conjunction of adjectives (Hatzivassiloglou & McKeown, 1997)
Adjectives conjoined by “and” have same polarity. Adjectives conjoined by “but” do not.

Step 1: Label seed set of 1336 adjectives
Step 2: Expand seed set to conjoined adjectives (look in the corpus)
Step 3: Supervised classifier assigns “polarity similarity” to word pair
Step 4: Clustering for partitioning the graph into two

(2) Pointwise mutual information (PMI) of candidate words with seed set words (Turney & Littman, 2002)

Step 1: Extract a phrasal lexicon from reviews
Step 2: Learn polarity of each phrase
Step 3: Rate a review by the average polarity of its phrase

Step 1