The author

Small Data

Steven Pemberton, CWI, Amsterdam

Happy birthday internet Europe!

Twenty five years ago this November internet first started flowing into, and out of, NL and Europe.

At the CWI, in Amsterdam, in an office next to mine.

I was probably the 4th or 5th user of the internet. Maybe the 6th.

It was a "Skunk works" project.

There was a 64kb/s connection between the whole of Europe and the whole of the USA... a year later to much rejoicing it was increased to 128Kb/s

Moore's Law

We all know Moore's Law.

But often people don't understand its true effects.

Take a piece of paper, divide it in two, and write this year's date in one half:

Paper

2013

Now divide the other half in two vertically, and write the date 18 months ago in one half:

Paper

2013
2012

Now divide the remaining space in half, and write the date 18 months earlier (or in other words 3 years ago) in one half:

Paper

2013
2012
2010

Repeat until your pen is thicker than the space you have to divide in two:

Paper

2013
2012
2010
2009
2007
2006
2004
2003
01
00
99
96
95
81

This demonstrates that your current computer is more powerful than all other computers you have had put together.

(You can use this diagram to demonstrate other things too).

The need for speed

Networks get faster even faster than Moore's law! They double in speed (at constant cost) every year!

Well, if that is true, the speed at Amsterdam should be 64kb/s × 225 which is 2Tb/s.

So let's have a look at AMSIX's statistics from yesterday:

AMSIX Yesterday

AMSIX daily stats

Yep! 2Tb/s!

Information

Well, that was 1988. That was the internet.

There was no web until a few years later.

I organised a couple of workshops at the first Web conference in 1994, where about 300 of us turned up.

And it was from there that Tim Berners-Lee asked me to get involved with the then beginning World Wide Web Consortium.

Information

It wasn't until 1995 that we began to get an idea that the web was going to be so successful.

In those days the message was "if you have information, it should be on the web". It was surprising how many people didn't understand that.

Now with the coming of the Open Linked Data movement, that call has become "If you have data, it should be in machine-readable form on the internet".

Open Data

So the world has basically understood the idea of open information.

But now for open data.

I hope you have all already heard the story of the Amsterdam Fireservice that couldn't get hold of the data about which roads were being dug up.

That is one sort of reason why we need open data.

But there are other reasons, that we just don't know until we do it: making data available enables new applications.

An Example: Tube Trains

Transport for London released live data of where its Tube Trains are.

And before you knew it, there was this, built at a Science Hackday:http://traintimes.org.uk/map/tube/

A live tube map

And then he did it again

http://traintimes.org.uk/map/#stp

Live train map

Open Source Data

Another thing that the trains example has used is Open Street Map. We all know Wikipedia; well, this is the Wikipedia of Maps.

Utrecht on OSM

Much better than the alternatives, at least in cities, and getting better elsewhere. BUT: it is data, not just images.

Big Data

Open data is usually mentioned in the same breath as big data.

And indeed, many applications of Open Data is on large data sets.

But, I want to talk about the Cinderella of open data, Small Data.

And I want to talk about a technology, RDFa, that can be used to take Cinderella to the ball.

Small Data

There is a lot of small data on the web that made properly available can be combined in new interesting ways.

And it can be used to help the user.

Especially with some small level of integration in the browser, the user's life on the web can be greatly facilitated.

Small Data

Imagine you stumble upon a web page for an event. You see it is somewhere in Turin, on 23-24 April 2014, about Open Data on the Web. You want to go!

So first step is to add it to your agenda.

Then you Google the address to try and find where it is, and with that information you can look it up with a maps site, to get a feel for where it is.

Now you can go to a number of hotel websites, to look up locations and prices of hotels.

Next, how to get there. Which airport is easiest for that location? Maybe there's more than one. Who flies to those airports? Is the train a possibility?

So...

So you strart trawling the web, maybe going to Rome2Rio.com to get advice about the travel possibilities, then a number of airport websites, followed by a number of airline websites and travel websites. Each time over and over again, you type in where you are coming from, where you are going, the dates involved, the class that you want to travel, and so on and so on.

Why? Because nothing knows what you are doing, and nor can it.

How different it could be...

RDFa

If the website had included a couple of snippets of RDFa, then the moment you arrived at the website, the browser could immediately have noticed that the page was about an event.

RDFa is an addition to the markup of a page (and to ODF files by the way) that allows you to indicate what certain parts of are. That "Amsterdam" is a city, that a certain string is a date, and so on.

So, with machine-readable webpages

The browser could offer to add the event to your agenda for you automatically.

It would have known the exact location of the event, and could have offered to show you a map of the location, using the map service that you have entered in your settings as your preferred mapping service.

It could have looked up hotels for you using a number of hotel services, and already have a list ready; it could overlay them over a map.

But there's more

Since it knows where you are based, it can then offer to look up flights from your preferred airport to the airports with the best connections to the location where you want to go

You no longer need to input all your details over and over again.

It can locate car hire deals, or find public transport connections.

Without too much extra work, with the use of some extra services the browser could even locate restaurants and other facilities nearby that friends had recommended, as well as recommend what clothing to take based on typical weather at that time of year.

Advantage

This is only one use case of how small data can be used to ease the life of the user, but there are many more, including online shopping, holiday planning, academic reference tracing. The list is endless.

The advantage of this sort of data is that you don't have to repeat yourself. The data is both human-readable, and machine-readable.

But also that, as the English say "Many a mickle makes a muckle".

Usage

Browsers don't yet use this sort of data (though there are some plugins).

There are some webservices available that will extract all the data for a page from you (for instance http://www.w3.org/2012/pyRdfa/).

This is like scraping, but then without heuristics, and without the concomitant errors.

And Google, for instance, has started to use it.

Google reviews

For example

If you extract the data from my home page, you can find data such as my telephone numbers and an image of me (and you can also actually see that this is my home page):

<http://www.cwi.nl/Steven.Pemberton.jpg> foaf:img
    [ foaf:name "Steven Pemberton"@en;
            foaf:phone <tel:+1-617-395-1252>,
                <tel:+31-20-5924138>,
                <tel:+31-624-671668>;
            foaf:primaryTopicOf <http://www.cwi.nl/~steven> ] .

An Example of Government Use

The London Gazette is one of the official journals of record of the British government.

First published on 7 November 1665.

Published daily.

Data published in a reusable form.

London Gazette

A Standard

RDF and RDFa are both W3C standards.

Usable for all web formats, and well as ODF documents.

RDFa was recently reported to be the current fastest growing format on the web.

Conclusion

Open Data is important. for transparency, for usability, and it is both important for the users and the providers.

Don't forget to make the small data open as well.