Sentiment Analysis: Teaching AI to Read Market Emotions

John Maynard Keynes described financial markets as driven by "animal spirits" — irrational waves of optimism and pessimism that send prices above and below fundamental value. Measuring these spirits was, for most of financial history, an art rather than a science. Sentiment analysis — the computational extraction of opinion and mood from text — has changed that, giving quantitative analysts a systematic way to quantify what investors are feeling about a company, an asset class, or the economy at large.

Sources of Market Sentiment Data

The raw material for sentiment analysis is text: news articles, financial press releases, earnings call transcripts, analyst reports, regulatory filings, and — increasingly — social media. Each source has different characteristics. News articles are generally well-sourced and professionally written, but market-moving information is often already reflected in prices by the time an article is published. Social media platforms like Twitter/X, Reddit, and Stocktwits provide near-real-time signals from retail investors, but are noisy, full of misinformation, and can be manipulated.

Earnings call transcripts have proven particularly rich in predictive signal. The language that CEOs and CFOs use when discussing results — their certainty or hedging, their emphasis on risks versus opportunities, the specific vocabulary they choose — correlates with subsequent stock performance. Studies have found that even the tone of management voice during earnings calls carries incremental predictive information beyond the words alone.

Methods: From Lexicons to Transformers

Early sentiment analysis used lexicon-based approaches: pre-compiled dictionaries of positive and negative words, scored and aggregated to produce a sentiment score. The Loughran-McDonald dictionary, developed specifically for financial text, identifies words that carry positive or negative connotations in a financial context — accounting for the fact that words like "liability," "loss," and "default" are negative in finance but not in everyday language.

Machine learning approaches superseded pure lexicon methods by learning context-dependent sentiment from annotated training data. VADER (Valence Aware Dictionary and Sentiment Reasoner) combined lexicon scoring with rules for handling negation, intensifiers, and punctuation. Naive Bayes and SVM classifiers, trained on labelled financial texts, outperformed lexicon baselines on domain-specific datasets.

Today, Transformer-based models — particularly FinBERT (a version of BERT fine-tuned on financial text) — achieve state-of-the-art performance on financial sentiment classification. These models capture nuanced, context-dependent meanings that lexicon methods miss entirely. "The company's losses were less severe than anticipated" is negative in absolute terms but positive relative to expectations — a distinction that FinBERT handles correctly.

From Sentiment Score to Trading Signal

Converting a sentiment score into a trading signal requires careful methodology. Sentiment signals decay rapidly — particularly those derived from news, which markets digest within minutes or hours. Combining sentiment with price momentum, value signals, and risk measures typically produces more robust strategies than using sentiment alone.

Cross-sectional sentiment strategies rank stocks by their recent sentiment trend and trade long-short portfolios based on these ranks. Event-driven strategies monitor specific sentiment triggers — sudden spikes in negative sentiment around a company, for instance — and trade around them. Factor-based models incorporate sentiment as one feature among many in a broader predictive framework.

Social Media and the Reddit Effect

The January 2021 GameStop short squeeze brought retail investor sentiment to the attention of the entire financial industry. Users of the Reddit community r/WallStreetBets coordinated buying in heavily shorted stocks, driving GameStop's price from around $20 to nearly $500 in days. Hedge funds monitoring social media sentiment in real time had hours of advance warning that a short squeeze was developing.

Since then, monitoring retail investor platforms has become standard practice at many hedge funds and market risk management desks. NLP systems parse millions of posts per day, flagging unusual patterns of interest in specific tickers, tracking sentiment shifts, and measuring the reach of viral investment narratives before they move prices.

Limitations and Responsible Use

Sentiment signals are subject to the same risks as all alternative data: they can be manipulated (pump-and-dump schemes on social media), may reflect information already priced in, and can change character rapidly as market participants adapt to their use. Firms must also navigate legal and ethical boundaries around the use of non-public information. Robust signal research, proper anonymisation of data sources, and legal review of new data contracts are essential components of responsible sentiment-based trading.