Category Archives: Research

Data Expedition: Where does your NIKE shirt come from?

“Can you find and visualize some interesting insights about garnment industries within one hour?”

That was the challenge we were give at my first “data expedition”, which i joined while i was at okcon (an international conference on open data).

The concept of a data expedition is simple but effective: bring together a small group of smart people: designer, analysts, story teller, researcher, IT guys. Then give them some initial data, a challenging quest, and very little time.

In our case, we were given some basic data on garnment production sites: address, number of employees, product types, retailer. Our team decided to reduce the quest to NIKE, and we immediately started our research for additional data. I always knew that there is lots of information available on the web, but still i was surprised how much we found within 10 minutes: list of all sports teams sponsored by NIKE; all NIKE shops worldwide; US and international tax reports, and much more.

We decided to compare production locations to sales activities, and visualize them on a map. Our “expedition guides” recommended a free map service – cartoDB – and within 20 minutes we had the first data on the map. The next 30 minutes we used for cleaning and combining the data, and bringing everything on the same map. 

At the end, when the data expedition ended after one hour, we had two separate maps that showed our data data; we hadn’t managed to bring everything in the same map. For this reason, we decided to do it “as homework”, and completed the tasks a few days later. In fact, we invested a few more hours, added data about sports sponsoring and did some layout improvement

NIKE Activities - click to start!

NIKE Activities – click to start!

My conclusion: I was really impressed how much you can achieve within one hour! Starting with almost nothing, we decided what to do, found the necessary data, and produced a first draft of our interactive map about NIKE. 



Corpora for Sentiment Analysis

Our recent paper on ”Potential and Limitations of Commercial Sentiment Detection Tools” (see this blog post) received alot of attention in the community. In face, we got several requests to provide access our data and the test corpora.

You can find our results and data at our sentiment analysis site. Unfortunaltely, we cannot provide the corpora directly, due to legal reasons. But you can find and download them from the following sources:  

Hope this simplifies your work!


Sentiment Analysis Tools are Good – but not Perfect

How good are commercial sentiment analysis tools? We recently tackeled this question in our research team, and evaluated the quality of 9 state-of-the-art commercial sentiment detection tools. We applied them to 30,000 short texts from various sources (tweets, news headlines, reviews etc.). The best tools have an accuracy of 75% for some document types (tweets), but the average accuracy over all documents is at best 60%. This means that even with the best tool, 4 out of 10 documents will be classified wrong.

Since we were convinced that there is still some “potential” for improvement, we combined all tools with a meta-classifier. It turned out that using a random forest classifier can improve accuracy by up to 9 percent points, in comparison to the best single tool.

Our results were published at ESSEM 2013. For more details, please see our paper.