Thinking About Networks // Midterm
Dennis Crowley // dens@dodgeball.com

What I wanted to do

It seems that no matter how hard I try, everything ends up coming back to dodgeball.com. Faced with the task of creating a network from scratch, I opted again to examine the data that is constantly being collected by dodgeball.

For those who haven't heard my rambles before, dodgeball.com is a cityguide based around the concept of user-generated concept - think Zagat without the editors. Users of dodgeball.com have the ability to both rank and review venues. As users participate, they gain "experience points" - a way to keep track of which users are most worldly about the bars and restaurants in a given area. The average dodgeball.com user has 6 experience points, which means they've actively ranked or reviewed six unique venues.

The dodgeball.com website is completely database driven. Every time a user writes a review, searches for an address or ranks a venue, the action is recorded in the database. By querying this database, I can tell you how many people read reviews of Irish bars on St. Patrick's day or what bars are generating the most traffic in a given week.

My intention for this project was to use the data collected by the dodgeball.com database and use it to generate a visual representation of venue popularity. I was hoping to do to bars and restaurants what orgnet.com (http://www.orgnet.com/leftright.html) did with political books and Amazon's recommendation engine - to show the connection between venues based upon venues that individual users have visited.



Network Characteristics

Nodes: In this case, the venues are serving as nodes. Some nodes are more popular than others as they see more traffic - more users - pass through them. I'm hoping that my research will identify which venues serve as "supernodes" - gathering places of large numbers of people who eventually venture off in many different directions.

Content: If the venues are acting as nodes, then the people passing through these nodes are the content. Since I am only tracking the people who are using dodgeball.com, we'll refer to them as "users".

Protocols: Users are able to locate "nodes" by using the standard street addressing system - e.g. "132 Ludlow Street", or "corner of 5th Street and Avenue A". Other "rules" apply to the database structure - in the example I'll be showing below, I am only querying those dodgeabll.com users who have reviewed at least 3 but no more than 25 venues.

Transport: Users (or "content") are able to move from one node to the next through a number of ways - walking, taking cabs, jumping on the subway, perhaps even riding their bikes. Users are taking advantage of existing public transportation infrastructure to move from node to node.

Package: Packages can be described as either (a) a number of people moving as one group from node to node or (b) newly formed groups of people, who arrived as individuals, but leave as a member of a group.

Addresses: Again, users are able to locate these "nodes" by using the standard street addressing system - e.g. 132 Ludlow Street, or corner of 5th Street and Avenue A.



Network Stack

Now that we've defined the components of the network, let me explain what I'm actually trying to do…

Application Layer: The network is designed to show the connections between the bars that people frequent within a given area (e.g. East Village). By collecting data from users as they rank and comment on venues, and then comparing this data to the data of other users, I am extract a visualization of how multiple venues are connected to one another.

It's important to know that I am not trying to track the order in which people travel throughout a given area, but rather to identify the venues that are most popular (read: most commonly reviewed) and then to make a connection between these venues and the venues that are not as heavily trafficked.

Transmission Layer: Every time a user submits a ranking or a review to dodgeball.com, that information is logged in the database. I am recording such information as the username, the venue the person is commenting on, what they're actually saying about the venue as well as the date and time the feedback was given.

The database is not querying all users and all venues across all neighborhoods. Due to problems I was encountering with Graphviz, I scaled back my all-inclusive plan to only include bars in the East Village as nodes and only include users who have written at least three and no more than 25 reviews.

Physical Layer: While the nodes in this network are brick and mortar venues, I am relying on software to do all of the data collection. When a user pulls up a page on the web site, the site talks to a database and

If you want to talk geography, we can probably say that most of the dodgeball.com users are browsing the site from NYC. While they browse, pages and collected data are either served or sent from a machine in Connecticut.



Visual Explanation

Below is the visual mapping of the mined dodgeball data.


[click for larger view]

This data represents...

Neighborhood: East Village
Type of Venue: Bars only
User query: Users that written more than 3, but less than 25 reviews




Predictions & Discussion

Although I had the good idea of what I wanted to do for this project early on, my idea of how to actually implement it went through several iterations. At first I was hoping to recruit people from ITP and have them use their cell phones to keep a record of the venues they visited within a given week. I quickly realized that there was no way this was going to fly due to the high level of complexity required by the user to participate.

I then planned on having people keep journals of that places they visited over a seven day period, but by the time spring break rolled around and the number of ITP'ers on the floor dwindled, I realized the only people I was going to be able to recruit would be my non-ITP friends. If I had gone this route, the data would have been severely skewed since we all visit the same venues, often to catch up with one another.

I finally decided to take what was probably the easiest route of all - mining the constantly growing database of data from dodgeball users. By using the data that is already being collected, I widen the audience of participants and automatically make it anonymous.

In any case, while working from my original idea to the actual execution, I came up with the following predictions.

Prediction #1: In my visualization of the data, there will be a few venues that will stand out as supernodes.

Just as Milena and Daniel Hirschmann were the supernodes in our in-class experiment that illustrated how people interact in social environments, I expected to see certain East Village appear as obvious branching out points. This turned out to be true - although I underestimated the number of venues that would appear as these supernodes.

Prediction #2: I expected to see clearly defined paths connecting venues.

In hindsight, this was fairly stupid of me. After viewing the nice, clean visual representation on the in-class social network, I expected my visualizations to look the same despite the fact that I'm dealing with a much larger set of venues across many more users (as opposed to a set of 20 students and their connection to 20 other students). My first pass at coming up with a visualization was chaos - it took many iterations of redefining my data set before I was able to come up with something coherent.

When first attempting to visualize the data, I started with a dataset that involved all venues (bars and restaurants) in Manhattan across all dodgeball users who had commented on at least one venue. After much frustration and lots of confusing visualizations, I modified my data set to include only bars in the East Village that were visited by users who have been to at least 3 other venues, but no more than 25.

Prediction #3: The visualization will be the coolest thing I've built!

The first time I started running data through GraphViz, this is exactly what I was thinking… "this is the coolest thing in the world". The problem is that when I rely upon user-generated reviews to gauge which venues are related to one another, I'm ignoring what the user thought about the venue. Is it their favorite place, or would they avoid it at all costs? A much more powerful version of this project would include me filtering content based on whether the reviews they were positive or negative.



A Few Notes on Using Graphviz

The visualization shown above was created using, Graphviz, an open source graphic program spawned from AT&T's research labs. Graphviz allows you to import .dot files - essentially text marked up in a proprietary syntax - and export the data in that .dot file as an image. For this network experiment, I was creating .dot files on the fly by exporting user data from the dodgeball database. I saved those dynamically generated web pages as text files, fed them into Graphviz (using the "dot" layout engine) and exported the data as .jpg.

Just for the record, I ran into a lot of problems trying to run data sets through Graphviz. For example, Graphviz crashes if you attempt to import values that contain spaces, punctuation or numbers or if the file you are importing is too large.

Also, to prevent overlapping of nodes in the visualization, make sure to set the following properties.

Scope: Node
Name: Layer
Value: no



Links

  • Graphviz Overview

  • Download Graphviz

  • .dot file which was imported into Graphviz

  • Original spec for this midterm project

    (c) 2003, dennis crowley