About

Lexicalist reads through millions of words of chatter on the internet to analyze how certain demographics talk and what kinds of things they talk about. We currently break this information down into three kinds of demographics: age, gender, and geography.

For all of these examples, we'll look at a search for the keyword twilight on April 11, 2010 (the demographics change over time).

Age and Gender

The word "twilight" can refer to a couple of different things - the time of day just before nightfall, a figurative gradual decline - but many today use it in the context of Stephenie Meyer's book series of the same name. By applying natural language processing techniques to the information we find online, we've gathered that the largest demographic using this word is 12-17 year old females, followed by 25-34 year olds and 18-24 year olds. Women make up 66.5% of this population.

Geography

Using the same analytical methods, we've found that the use of "twilight" is pretty dispersed throughout the country, though it's especially dominant in Alabama and Washington (more than twice as popular than any other state). Contrast this to keywords like "Cubs" or "Brett Favre", where you find the words being used very strongly in specific geographical areas.

Related Words

The "Related Words" section automatically finds words and phrases that are semantically similar to the original keyword, either virtue of appearing very close to the original or by showing up in similar contexts (e.g., "oranges" and "apples" are semantically similar because they both appear in the company of words such as "tree" and "eat"). Here's the related word section for "twilight":

Here "twilight" is semantically related to the titles of the eponymous books and movies ("saga", "eclipse") and to characters within them ("edward", "bella"). The words that appear here are not judged to be similar by any editor, but are chosen completely algorithmically by analyzing millions of words of recent chatter.

Methods

Lexicalist works by analyzing rich sources of information online, including blog posts, news sources, and social networking sites like Twitter. Each bit of information is subjected to rigorous natural language processing, which includes a likelihood distribution of being authored over all geographic, age and gender demographics.

All of the statistical results displayed here are then normalized against the volume of information coming from each demographic to see what words are most commonly associated with certain populations. The result is a descriptive snapshot of language as it's used today.