Tag clouds are cool little tools, no doubt. What better way to determine what themes are most popular on Flickr?
But maybe tag clouds can provide us even more useful information than we think. A great example is Chirag Mehta’s Tagline Generator, which he demoed in 2006 by showing US Presidential Speeches Tag Clouds over the decades. It is elucidating to see how the most prevalent words in Jefferson’s 1776 Debate on Independence were “independence” and “colonies” and the most prevalent words in Bush’s 2007 State of the Union Address were “terrorists” and “Iraq”.
In just seconds anyone is able to view these clouds and surmise what was most important to a given politician at a given point in time (or rather, what he/she wanted us to believe was important at a given point in time).
But what else can we do will such tools that can capture collectively determined emphasis in virtually any space? Why hasn’t this approach caught on in mainstream media and in schools of data analytics?
I don’t know. I didn’t even know tag clouds existed until two years ago, and I am just now beginning to realize their potential importance in helping us summarize and organize and make decisions concerning the ever-growing database of the world.
There is an immense opportunity here. Here are some of the more offbeat applications of tag clouds that I can think of (please take them with a grain of salt):
- VC PowerPoint presentations should have a tag cloud on the cover or in place of an executive summary. This could, of course, apply to the executive summary (or appendix) of any document.
- Job hunters should tag cloud their resumes or CVs. Employers should run tag cloud checks in addition to googling a person. Every individual should have a tag cloud (in addition to a state-issued ID)! I’d love to see this on dating sites in particular…
- Analysts at consulting firms and investment banks and PR firms should run longitudinal and cross-media tag clouds on all clients and all competitors of clients. Tag clouds should live alongside other analytical approaches in the business world.
On that last point, let’s take an example. You are writing an analyst report for some bank in NYC and you need to answer the question: What type of company is Google really? A recent iinovate podcast interview with Google’s CEO Eric Schmidt shows he is currently thinking of the firm in the following way:
Google is an infrastructure company that enables content. Google is not in the content business. We have many partners that produce content. We are a distribution mechanism and a monetization mechanism for our partners. This is an important line that we’ve decided not to cross.
Okay fine. Let’s couple that with what he told Wired’ Fred Vogelstein in an interview right about the same time period:
[How should we think about Google today?] One is as an advertising system. Another one is as this end-user system (the search, email, and other applications Google delivers to users through an Internet browser). A third way to think of Google is as a giant supercomputer. And then a fourth way is to think of Google as a social phenomenon involving the company, the people, the brand, the mission, the values – all that kind of stuff.
Ungh now I am confused.
There are many ways of approaching the answer to the question – just one of them is listening to the CEO in public interviews. Another one would be analyzing Google’s publicly available documents. We could scrape their corporate website (too time consuming for me), or we could scrape their SEC filings (now that sounds easier).
So let’s take a look at a tag cloud representing Google’s 2006 10K (which I made with Daniel Steinbock’s wonderful little TagCrowd program):

Interesting. The word “ads” or “advertisers” appear 415 times in the 10K, accounting for 0.75% of the total 55,285 words in the document, collectively making them the second-most prevalent non-trivial word in the document (after “Google” itself). We also have “search” near the top there with 220 mentions (0.40%) followed closely by “users” with 214 (0.39%) mentions and “contents” with 213 (0.39%) mentions.
In summary, this tag cloud tells me that Google is primarily an advertising company. Oh and it has search. Which users use to find content.
But yeah Google is mostly an advertising company.
At least, this is the message Google is sending to Wall Street. So who knows what Google really is. We now know a little bit more about how the company portays itself to its shareholders. And yes, I intentionally got a little cocky in the title of this post…
There is nothing earth shattering in my very incomplete analysis, but hopefully it begins to show that there is interesting data in them there SEC filings (and pretty much every single document out there that features text). Maybe it’s time for us all to take a closer look.
Before I end, allow me to share the 10K tag clouds from a few other companies out there.
Yahoo? Yeah, they portray themselves to Wall Street (in their 2006 10K) as being more “user”-centric than Google and less focused on “ads” or “advertising”. That sounds about right:

Microsoft? It’s all about the “software” and “services” and “products” according to their 2006 10K (with not much at all about the “Internet” or “advertising”… DoubleClick sure would have been a nice acquisition, eh Steve?):
