Saturday, April 23, 2011

Technology: About the Carbon Capture Report



The Technology of the Carbon Capture Report

The Carbon Capture Report builds on more than a decade of work at the University of Illinois and National Center for Supercomputing Applications (the birthplace of the modern web) leveraging some of the most powerful supercomputers in the world to synthesize the global news media and extract new meaning from it on behalf of industry and government.


Current Approaches

There are over 10,000 news clipping services in existence today. Yet, the vast majority of these are technologists applying trendy software techniques without an understanding of how news functions or media services overhyping simplistic approaches.

As electronic news archives become more accessible, technologists are increasingly applying new software techniques, yet lack an understanding in how news functions in order to properly vet their work. In 2009 computer scientists at a top university published a study examining the interaction between the mainstream media and blogosphere. From a computational perspective, the project was extremely novel, using genetic typing algorithms to compute story signatures across a news datastore of tens of millions of articles, requiring algorithmic advances to make computation feasible, and using diffusion modeling to estimate the lead/lag interaction of blogs and mainstream press. Once communications and journalism scholars began exploring the study, however, they recognized a fundamental flaw in its logic: it relied on identifying and tracking the diffusion of "soundbites" across news coverage, making the assumption that every article about a story would use the same soundbite. Yet, journalists are trained to find their own unique angle on each story and to avoid using the same text and soundbites that others have already used, while soundbites are also shortened and permuted over time. Thus, while groundbreaking from a computational standpoint, the study's media findings are limited due to a poor understanding by its authors of how the news ecosystem functions.

Google News, one of the best-known news aggregators, offers basic clustering, organizing news articles from multiple outlets about the same story together. Yet, Google's approach is designed for coarse-scale clustering over the corpus of all news topics on the entire Internet in a given day. All stories about a company's revised efforts to clean up an environmental spill might be grouped together, preventing an analyst from exploring the distinct lineage and trajectory of the myriad substories involved. From the standpoint of all Internet news coverage, Google News' approach is ideal, since it groups content by core themes, but within a more constrained topic such as CCS or nuclear power, those trying to sort through the complexities of a large "metaevent" need the ability to filter coverage at a very fine resolution. Instead of traditional coarse statistical clustering, the Carbon Capture Report relies on sophisticated language models that describe the coverage of each day and use those to identify core "storylines" at very high precision, separating out coverage of a proposed CCS bill in Canada's Senate from one occurring at the same time in the US Senate.

Similarly, seemingly advanced capabilities like "geographic insights" are often far more limited than marketing hype suggests. For example, the "blogosphere geocoding" service offered by one major media monitoring company only involves extracting the author's self-listed location, relying on the standardized author profile formats used on the major blogging platforms and recognizes only major capital city and country references. A random sample of 100 blogs covering the energy sectors that were hosted by the platforms supported by the company's product found that not a single one included geographic information. More importantly, more than 80% of the blogs discussing key energy sectors are not hosted on the core blogging platforms that provide a common geographic format. Thus, applying this company's technology to energy coverage would result in almost no usable geographic information. The Carbon Capture Report, on the other hand, uses a multi-layered approach to geographic information. In addition to using domain records and intelligent machine-learning approaches to automatically examine an outlet or Twitter account for geographic assignment information, it examines the full text of the article or post and uses a hyperlocal global geocoding system (capable of resolving the small city and landmark names often used in local news outlets and blogs) to identify, contextually disambiguate (is it Cairo Egypt or Cairo Illinois), and resolve to geographic lat/long coordinates every locative reference in the text. This uniquely permits the Carbon Capture Report to offer insights into both where coverage is originating and where it is focusing on.


Insights, Not Ad Buys

Finally, even the largest vendors rely primarily on the overall volume of coverage for their core metrics. For example, a typical "influencer" report is based on a rank ordering of the number of stories from each outlet that have gone on to be picked up by other outlets. Such a domain histogram will show that only 24% of New York Times coverage of carbon capture and sequestration (CCS) over the past 12 months has been covered elsewhere, while 69% of Oil Week's have. The same story covered in a British paper is far more likely to be picked up in social media than the equivalent story in a United States paper. Such patterns are all that customers of these tools are looking for: they want to know which papers to target with advertising campaigns to help sell their brands.

Few CCS practitioners, climate scientists, or wind power companies are in the market for large advertising campaigns: they are trying to understand the popular perception of their industry at large and predict how coverage will impact them. A simple domain-level histogram, even with a few formulaic twists, may yield insight into the top 10 outlets that tend to have more of their stories go viral (and help guide advertising expenditures), but does little to help understand the broader industry: the average mainstream outlet in CCS accounts for less than five hundredths of a percent of overall coverage, while 84% of CCS coverage comes from outlets that have published less than two articles a month on the subject. To model the news environment in such a field requires far more advanced methodologies. The Carbon Capture Report is leveraging new advances at the University of Illinois on language modeling to develop a model capable in early tests of correctly identifying nearly a third of all coverage each day as having no chance of going viral: helping to filter those articles that won't have a further impact on the media sphere. Network diffusion models being built to understand how news exchange across national borders has changed over the last half-century is being leveraged to model news interchange among energy sector outlets. Finally, actor-based interaction models, developed for business intelligence applications, are being applied to news to extract both structural and semantic (hidden) connections among the people and firms in the field.

By definition, as a research platform used for more than a decade to continually prototype future capabilities for government and industry not available today, there simply aren't commercial toolkits available that provide the same level of capability as the systems in the Carbon Capture Report. Each component of the Carbon Capture Report relies on unique software libraries, methods, and algorithms developed at the University of Illinois.


Name Recognition

The Carbon Capture Report uses software algorithms to read through each news article, blog posting, and tweet, and identify mentions of people or organizations. Within the realm of Computer Science, this is known as Information Extraction (IE), or, more specifically, Named Entity Recognition (NER). All current systems use one of two techniques: Lexicons or Statistical Context Models.

Lexicon systems simply use a huge database of all common given names and surnames and search for every capitalized phrase that contains one of these words. Their lexicons are usually built on readily-available databases like the US Census reports and tend to perform adequately on American and Western European names, but fail miserably on African, Asian, and Middle Eastern names. Non-English names are often transliterated when appearing in English text, while lexicons usually include only minimal alternative spellings and so miss most of these. Some augment their lexicons by downloading from Wikipedia or other encyclopedias a list of all people with biography pages, but this only helps with world figures and celebrities.

Statistical Context Models involve rooms full of trained analysts reading through tens of thousands of news articles and highlighting all of the people and companies referenced in the collection. This is then passed to a computer modeling system that builds up statistical tables of which words appear most frequently before and after person names that don't appear as frequently next to other words. For example, "Mr." or "President" tend to appear more commonly in front of a person's name than in front of other words, while "said" or "announced" appear before or after person or company names. Such models can perform well on the material they were originally trained on, but perform less accurately on other material. This is because they often tune more towards the distinct stylisms of the material they were trained with than on language itself: a system trained using New York Times newspaper material will heavily tune itself to the style guidelines of the New York Times and work well for that paper, but more poorly for other papers. Nearly all statistical systems in production today were trained using formal news articles and require sentences with perfect grammatical construction: blogs and Twitter posts, with their more informal writing, tend to fail under such systems.

The Carbon Capture Report uses highly specialized technology developed over 15 years of working with informal material such as emails, newsgroups, social media, and translated news material from across the world. Rather than limited dictionaries or source-specific statistical tables, the Carbon Capture Report system actually "reads" a document the way a human does, using the same cues that people rely on to identify names in text. Incorporating approaches from linguistics, psychology, and neuroscience it actually models the multichannel information transference coding process the brain's language centers undergo when comprehending a document. This information is then used to resolve references to names even in semantically ambiguous cases, much as a human reader does.


Geocoding

Some news services display a map beside each article with a pinpoint using the article's "Byline", which is the city the article originated in. For example, a newswire article filed in Washington DC about Texas energy regulations would have a byline of DC, not Texas. Bylines often don't capture the actual geographic focus of articles and can't reflect articles that discuss several geographic regions (for example, a partnership between Texas and Illinois would only be able to feature one of the states in its byline): such detail is not very informative. To offer true insight into the whole geographic range of an article, the Carbon Capture Report reads the entire article, using machine learning models to automatically identify references to every known city and geographic landmark on planet earth, resolve them down to precise points on a map, and cross-link them to the contents of the article. The Carbon Capture Report is the only news intelligence platform in the world that offers this level of deep geographic insight.


Real Ranking

Search engines use a complex formula called a "ranking algorithm" to sort and deliver the most relevant results first. While these algorithms can incorporate thousands of measures of a given page, a key factor in nearly every modern search engine is some form of "score" assigned to each web site that determines how "authoritative" content from that site is considered overall. In Google's PageRank™ model, sites that are more heavily linked-to are considered more authoritative.

Yet, under all of these systems, each website has only a single score: its relevance to the entire web at large with its billions and billions of web pages and topics. A niche website, no matter how heavily-followed within its topical community, will receive a very low overall score and will be ranked below general-purpose sites when searching.

The Carbon Capture Report introduces the concept of "industry specific site ranks." Instead of a global ranking that attempts to score a site based on its attention within the billions of topics on the web, the Carbon Capture Report ranks each site on its standing within a single industry. Advanced machine learning algorithms continually "learn" about the people, organizations, places, and outlets in each monitored industry and uses this information to map out and model how news flows amongst all of these entities. Constructing massive network models of global news flow, the Carbon Capture Report platform predicts which stories and news outlets are likely to receive further attention and those that have already "resonated" with the industry and develop rankings of news outlets based on their importance within a specific industry, helping users to understand the media landscape and prioritizing the news that is most likely to have an impact.

No comments: