Statistically Improbable Phrase

A statistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document (or collection of documents) than in some larger corpus.[1][2][3] Amazon.com uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section.[4][5] Christian Rudder has also used this concept with data from online dating profiles and Twitter posts to determine the phrases most characteristic of a given race or gender in his book Dataclysm. https://en.m.wikipedia.org/wiki/Statistically_improbable_phrase

Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside! program. To identify SIPs, our computers scan the text of all books in Search Inside. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside books, that phrase is a SIP in that book. SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.

https://stackoverflow.com/questions/2009498/how-does-amazons-statistically-improbable-phrases-work

https://www.wired.com/2009/05/web-semantics-statistically-impossible-phrases-a-literary-view/winamp/

Using NLTK - actually it just sets that as an Exercise!


Edited:    |       |    Search Twitter for discussion