This site was developed to support the Government 2.0 Taskforce, which operated from June to December 2009. The Government responded to the Government 2.0 Taskforce's report on 3 May 2010. As such, comments are now closed but you are encouraged to continue the conversation at agimo.govspace.gov.au.

Making Government Data More “Hack”able

2009 October 28

At Google, we think it’s pretty awesome that the government is holding a contest to mash government data. As a company with a lot of APIs, we love when people use them to make mashups, and as a company with a mission of making data universally accessible and useful, we love to see governments opening up their data. So we’ve arranged a couple of events in support of the contest. We held a 3-hour “MashupAustralia HackNight” on October 14th, we’re holding another one tonight, and we’re hosting the OpenAustralia HackFest from Nov 7-8. At our first hack night, we started off with talks on the contest, mashups and APIs, and putting data on maps. Then, since we conveniently had a representative from data.australia.gov.au at the event, we took the opportunity to search through their database and find useful datasets. We found a couple really good ones — the NSW Crime set and the Victoria Internet locations set — but we also found a lot of really hard to use sets. Since part of the goal of this contest is to figure out what characters define a useful dataset, and to encourage governments to adopt those, I thought I’d take this opportunity to give a few basic tips:

  • Format: Generally not a good idea to share data in a binary format. It is more compact, but it is less accessible to developers. The best format is an API (REST or XML-RPC) or more simply, an RSS feed with all the entries. The next-best format is a well-structured CSV or spreadsheet, as many database systems can easily input those. If you are going to use a more obscure format, provide tips on how to use it. (This is something that the data.australia.gov.au site could also provide).
  • Size: Some data sources provided zip files that were around 300 megabytes. Most developers aren’t going to download 300 megabytes if they don’t know what the data looks like, and what makes up that size. If you are going to provide a large file, I suggest also providing a preview file.
  • Geo data: The vast majority of the data sources are related to geographic regions or points, but the vast majority also didn’t provide enough geographic data. If possible, you should provide the address and the latitude/longitude coordinate. If the data describes a region, provide an array of coordinates. A great example of this is the NSW fire feed – it provides an address, a point, and a polygon.

These are simple suggestions, but they can make a world of difference in terms of making data useful. We hope to see more government agencies opening up their data for developers and evaluating how they’re doing so. But we also hope to see developers using the current data as much as possible, and coming up with more ideas. Please join us at one of our future events!

10 Responses
  1. 2009 October 28

    I think you’ve hit most of the basics. I think the other issue concerns metadata about the data itself – people have raised concerns numerous times about assurances of data quality, licensing, ownership etc but this is missing in some instances. Even a relatively stable list of office locations needs this, because things do change over time. I don’t think this matters so much right now while we’re in a beta phase, but later down the track this will become more important.

  2. 2009 October 30
    Alf Ingham permalink

    For those that can’t make these events can you explain exactly what this is..?

    Access to all government databases..?

    A bit more explanation please

  3. 2009 October 30
    Bert Coupar permalink

    At MSN, we think competition in the public data arena is a good thing. We don’t think that all your data being help by one organisation is healthy.

  4. 2009 October 30
    Gordon Grace permalink

    Guidance and Better Practice from AGIMO regarding publishing spatial data online:

    Spatial Data (Web Publishing Guide)

    Spatial Data on the Internet (AGIMO Better Practice Checklist)

  5. 2009 October 30
    Hugh Barnes permalink

    A great thing to talk about.

    Format: the most important consideration is actually openness – can I, without license encumbrance, unpack the files and work with open source tools to do so? The next is suitability – certainly a rich and RESTful web service is ideal. I take issue with RSS as a general format. It’s OK for syndication (though its entry content is just a text blob of CDATA – Atom allows richer markup within an entry), but not always suited to the data being provided. By all means use it to notify consumers of changes to the datasets.

    Size: good plan to provide a preview or maybe a data format walk-through. Perhaps also publish diff files when the datasets are updated so that developers updating their dataset copies don’t need all that bandwidth every time. Consider seeding a torrent file for distribution or hosting a Metalink.

    Spot on about geo data. GeoRSS, like your NSW fire example uses, is a little harder to parse than native XML geo formats like GPX, GML, or KML. Further, users can readily load them onto GPS devices or overlay them on online maps. Oh, and some of the geographic datasets I’ve looked at on data.australia.gov.au (was it really necessary to put “.australia” in there?) are in ESRI format, which I can’t easily use, which reinforces my point about openness I made earlier on.

    Good luck at the events. Wish I could be there.

    • 2009 October 30
      Hugh Barnes permalink

      For so many concepts, my comment was rushed. This might seem like “goes without saying” stuff, but I forgot to mention some points I would add to the original post (which, to be fair, never claimed it was anything more than few tips):

      * Datasets must be posted at persistent addresses (URLs – and all that that entails). If you are lucky, applications will be built (desktop/mashup etc) that pull data from your dataset, and this must be able to happen without risk of breakage. It might be a good idea to post reassurances about your intention to persist your dataset URLs.

      * Use autodiscovery links from related resources where possible and appropriate. For example, add <link rel="alternate"… in the head of HTML pages which textually express or describe roughly the same information. This helps intelligent discovery agents know that the resources are related or contain the same content in different formats. Something like that :)

  6. 2009 November 4

    I read the excellent article in the SMH today and followed the links to the mashups page. The attention that has been paid to the layout of the catalogues is very refreshing – the information about the datasets is very clear and user friendly – organised by category such as education
    Also the frog atlas and was all clearly documented within the spreadsheet.

    I agree with James about that ‘down the track’ we must have processes in place to keep the data up to date but that will come with a groundswell of demand. Having a link simply to the source organisation’s home page doesn’t help a user who wants to report an error or get an update.

    This is an exciting initiative, well done the Taskforce for kicking it off.

  7. 2009 November 23
    Eddie permalink

    This competition is a complete disaster.

    1. Voting can be exploited.
    2. Entries can be entered past the closed off date.

    These guys need a BIG education on how to run a proper competition.

  8. 2009 November 24

    Hi Eddie,

    To answer your points:

    1. Yes, there were voting irregularities – but the judges are aware of these and will take them into account in the judging of the People’s Choice Award.

    2. No late entries have been accepted into the contest. Arts on the Map was submitted before the November 13th closing date, but was not received by our system due to technical reasons. After investigating and discussing this issue with the affected entrant and the judges the decision was made to put this entry on the site, where it appeared online on November 23rd.

Comments are closed.