Civil Society and Spending Data: Who is mapping the money?

January 12, 2012 in Contribute, OSF

This post is by Lucy Chambers, Community Coordinator on the OpenSpending project at the Open Knowledge Foundation. The post is cross-posted on the Open Knowledge Foundation blog.

We’re excited to announce that, thanks to the generous support of the Open Society Foundations, OKFN’s activities around financial transparency will expand to include a second pillar: next to the OpenSpending platform, we have just started a 6 month project to map the technology needs of Civil Society Organisations in relation to public spending and budget information.

We’re going to be working on…

  • Identifying CSOs around the world who are interested in working with spending data – building on the existing network of contacts from the OpenSpending.org project.

  • Connecting these CSOs with each other, with open data communities and with other key stakeholders to exchange knowledge, experiences and best practices in relation to spending data

  • Establishing how CSOs currently work with spending data, how they would like to use it, and what they would like to achieve – including:

  1. what existing tools are being used
  2. what current technical needs are unmet
  3. what would be required to meet these needs and how feasible is it to tackle them
  • Creating a registry of spending datasets, from official and unofficial sources in theDataHub.org
  • A Spending Data Manual – A wiki-like, community driven manual on acquiring, working with, publishing and archiving spending data, based on input and exchanges with CSOs we talk to.This will augment and reference existing publications from numerous organisations as well as channelling the results of our research into two areas:
    • A section to help CSO’s clarify their demands towards governments: e.g. guidance on open licensing and structured data formats, applicable for spending data.
    • A section focused on best practice for CSOs when using and reusing spending data: for example collaborative processes such as data-sharing.  
  • Running Spending Analysis Sessions with CSOs, both in person and virtually. We’re interested in learning from about what data people are trying to acquire / having difficulty in doing so, how they plan to use the data to further their mission and learning what barriers, legal, technical and otherwise could be removed to make their jobs easier.

  • Getting Spending Data from numerous countries loaded into OpenSpending.org – with the support of CSOs, OKFN developers, and volunteers from the open data community. We we’re interested in are using the OpenSpending.org tools, and collect input from them on how these could be improved to meet their needs.

Vision: Improved Spending Data Literacy, Sharing and Re-use amongst CSOs around the world

We are very keen to help more groups and individuals around the world to use and work with spending data more effectively to do the things they care about – whether this is investigative journalism, evidence based policy-making, political campaigning, budgeting or creating new useful applications and services.

In particular, we would like to document and spread best practices in the legal and technical aspects of reusing public information, and enabling re-use and better collaboration around this material.

Ultimately we would like to:

  • Build stronger, broader communities of groups and individuals who work together to acquire, use, and openly share spending data
  • Increase ‘literacy’ around spending data – enabling more CSOs to understand and work with large and complex spending datasets to help them to pursue their objectives
  • Encourage more CSOs to publish datasets which they acquire, use or create in machine readable formats, under open licenses, to avoid duplication of effort and enable CSOs to build on each others’ work, to harness external expertise more effectively and to facilitate stronger collaboration between different organisations who are interested in spending information

How can I get involved?

  • Join the Working Group on Spending Data. The working group will bring together data experts and CSOs who will help to weave a community of best practice around spending data, collect and provide feedback on material for the manual and help to develop the network of those collaborating around and sharing spending data. More details about the working group can be found on this wiki page.

  • Write for the Spending Data Blog – we’re interested in posts by and about CSOs who work with spending data, observations on the current status quo on releasing data in your area. Anything from short comment pieces to full proposals for what could be done, legal, technical or otherwise, to improve the situation in the sphere where you work. Contact details as above.

If you would like to get started, or know of organisations we should extend the invitation to: drop us an email via the mailing list or contact me directly via info [at] openspending.org.

Data = Seized, Sanitised and Sanity-checked. Open Data Day 2011

December 12, 2011 in events

This post is by Mark Brough, Research Officer at Publish What You Fund, Lucy Chambers, Community Coordinator for OpenSpending, and Irina Bolychevsky, Product Owner for CKAN. It is cross-posted on the OpenSpending Blog and the Open Knowledge Foundation Blog and Mark Brough’s contribution is also featured on aidinfolabs.org.

Saturday, December 3rd was Open Data Day, and London took the challenge to throw a hackday to help data be opened, cleaned and shown off to the world…

Fuelled only by enthusiasm, caffeine and 5 packets of ready-made popcorn, the CKAN, OpenSpending and IATI teams, along with some new faces, joined forces to liberate as much data as they could…

OpenSpending + IATI + CKAN

As part of the IATI Open Data Day challenges, Mark Brough did some work to get the existing IATI Data into OpenSpending. David Read, from the CKAN team, and a new face to the data wrangling crew, Johannes, scraped data on aid donations from France and Austria that were locked-up in web apps in order to help fill in the gaps in the global aid data jigsaw puzzle. You can see the results on OpenSpending.

The French (AFD) and Austrian (ADA) aid data appears to be incomplete: the AFD’s [2010 Annual Report]http://www.afd.fr/jahia/webdav/site/afd/shared/PUBLICATIONS/Colonne-droite/Rapport-annuel-AFD-VF.pdf suggests that South Africa is the biggest recipient country, receiving €403 million, but in the data, Morocco is the biggest recipient and there are no transactions in South Africa.

The Austrian Development Agency data was carefully cleaned by Johannes, with region and country codes being added for all entries to create a tidier dataset. However, the original data contained, for example, four different spellings of Bosnia and Herzegovina, suggesting that countries are being manually entered rather than selected from an existing list. [For 2010]http://openspending.org/ada/?_time=2010&_view=country, the second biggest recipient of the Austrian Development Agency’s aid (after aid not going to a specific country) appears to be Austria.

Nevertheless, despite the issues surrounding data quality, it was a useful exercise to show both the value of open data – that if you release your data, you can do pretty cool things with it – and the costs of keeping it locked away, namely that the data then has to be scraped from sites in quite a labour-intensive way.

These, along with many other datasets discovered on the day via tweets and emails have been added to the Open Data Day Group on theDataHub.org.

On the same day, we worked to get the data released as part of the International Aid Transparency Initiative into OpenSpending. You can see the results of the IATI wrangling process on OpenSpending.org/iati. This following section is written by Mark.

1. Getting the data

Downloading the existing IATI data has already become quite a big task; with 19 publishers so far, the data currently amounts to over 750MB with 1169 packages. Fortunately this is made easier by the IATI Registry, which provides an API to access all existing datasets, and a simple script (links at end) can retrieve all of the data.

2. Extracting the data

Extracting the data from the XML files is more complicated. Although IATI data uses a standard schema, there are a few cases where publishers have either used the markup incorrectly, or else interpreted the definitions slightly differently. This can be simple problems such as stating that an organisation is “implementing” rather than “Implementing”, or placing the date within the text of the tag and not the “iso-date” attribute of that tag, or more significant problems such as placing implementing organisations in the “accountable” organisation field.

However, these problems are still fairly limited and follow fairly regular patterns, so they are not too hard to overcome. There are more significant problems when some donors have for example used three-letter (ISO-3) country codes, rather than two-letter (ISO-2) country codes. (This is considered below in “next steps”.)

3. Wrangling the data

OpenSpending is designed to show spending data, and has a powerful aggregation system to show large collections of transactions in a meaningful way. However, IATI data is organised by activities, with transactions nested within activities (projects), and – reflecting the business models of funders – activities sit within other activities (e.g., projects within programs), although they are not nested in the actual XML. Furthermore, one of the significant advantages of IATI compared to other aid data formats is that it permits multiple sectoral classifications, allowing you to assign a proportion of the value of an activity to each sector. So, you might have an activity that is 50% related to health and 50% to education.

To prepare the data for OpenSpending, each transaction inherits the properties of its activity (and, if that activity has a parent, that parent activity’s title and description). Then, the transaction is broken out into mini transactions, with the proportion of the activity assigned to each sector used to assign a proportion of the value of the transaction to each sector. So, from transactions, you get mini “sector-transactions”.

This takes about 40 minutes to compile, and then one final step remains: to convert the currencies to a single currency. Currently, USD, EUR and GBP amounts are used in the IATI data. All data is converted to USD using the average for 2010 from the OECD’s Financial Indicators (MEI) dataset. (This is also considered below in “next steps”.)

4. Loading the data

OpenSpending’s new web-based loading interface makes it relatively easy to load data in, although you currently also have to write a model and views (links at end).

Results

The results can be viewed in the OpenSpending IATI dataset. You can explore the data by recipient country, sectors, funding organisation, and drill down through the data to see the data for an individual country.

Problems with the data

So far I’ve noticed the following problems:

  • “Unknown” recipient location is incorrectly marked as “South Sudan”
  • Recipient countries are listed twice, as Spain has used ISO3 rather than ISO2 country codes.
  • Sweden is listed as “Ministry of Foreign Affairs” (this is how they have listed themselves as the Funding Organisation in the data)
  • Sweden’s implementing organisations have been lost as they placed them in the accountable organisation field.

Please let me know if you see anything else problematic, if you have and criticisms of feedback of the way the data has been presented, or if you think there are other ways you’d like to be able to explore the data, based on the available dimensions.

Next steps

As mentioned above, there are some problems with the data which should properly be dealt with at the level of the donor agency. But there are others that will probably have to be dealt with by users of the data:

  • Mapping between different sector vocabularies, so that you can see all “Health” projects, and not only the health projects according to a single vocabulary
  • Mapping between countries and regions, so that every project in a country has a related region
  • Correctly converting currencies using the “value-date” column to get a more precise (at least month-specific) conversion.

What else have you noticed with the data? Is there anything else that should be changed? Anything interesting?

You can contact Mark about this data via the OpenSpending mailing list

Useful Links

How Spending Stories Spots Errors in Public Spending

December 5, 2011 in Data Journalism, Spending Stories

This article was originally published on MediaShift Idea Lab and was co-written by Martin Keegan, project lead for Spending Stories and Lucy Chambers, Community Coordinator for OpenSpending.

How public funds should be spent is often controversial. Information about how that money has already been spent should not be ambiguous at all. People arguing about the future will care about the present, and if data about past or present public spending is available, many will certainly look at it. When they do, occasionally they will find errors, or believe themselves to have found errors.

OpenSpending, which aims to track every (public) government and corporate financial transaction across the world, encourages users to:

  • augment the existing spending database with additional sources of data
  • use that data — e.g., to write evidence-based articles and formulate informed decisions about how their society is financed.

Spending Stories is our effort to make OpenSpending a natural way to do data journalism about public spending.

openspending.jpg

The Problem

FACT 1: Errors occur in data, no matter how official the source.

FACT 2: Data wrangling (manipulating or restructuring datasets to correct inaccuracies, remix with other datasets to augment the data, or perform calculations on the data), generally improves data quality, for example, through reconciling entities and flagging amounts that are obviously incorrect.

FACT 3: Data wrangling can also introduce errors if not tackled correctly.

Crucial to ensuring the use of this data in articles or ensuring re-use by concerned citizens is the ability to show that the data is valid. In addition, maintaining a good relationship with public bodies who are confident that they are not being misrepresented in the data is vital to ensuring the data continues to be released in the first place. In practice, this means that the provenance of the data has to be clear including:

  • where the data originally came from (preferably a URL)
  • whether anyone (e.g., government, community data wrangler, or OpenSpending) has worked on the data since it was published, and what steps they took to change the data (i.e., these steps should be reproducible to produce the same result)

The OpenSpending team has gone to lengths to retain enough information to say who was responsible for both of the above.

OpenSpending is a system, somewhat like a wiki, which allows you to track back through the data wrangling process and work out what changes were made to the data, when and by whom.

Error reporting in practice

OpenSpending recently received a pointed inquiry from the U.K. Treasury disputing the claims we were making about the payment of British public money to a private company. Believing that an error had been introduced, we attempted to retrace our steps and find out where this had occurred, and who was responsible.

As we discovered, the payment had actually taken place, but the the OpenSpending descriptions used to label the transaction were not sufficiently detailed to accurately reflect the item in question.

With Spending Stories, we were able to retrace our steps because we had preserved a copy of the software tools we used for collecting the data (the data is published by about 50 public bodies, and must be downloaded, stitched together, and firmly molded into shape). These tools had been also made available to the public, so the Treasury and other concerned citizens could have checked our work themselves; the availability of this kind of check keeps all participants in the fiscal debate honest.

What had gone wrong was a problem of terminology: The transactions existed, but ambiguous language had been used to describe them, glossing over the distinction between the government department reporting what money had been spent and the government agency which actually spent the money. The bodies in question were the Department of Health and a regional health care trust; this distinction is certainly one which a concerned citizen would expect to be made clearly — so we should make sure our system makes it easy to know which question is being asked.

Checkpoints in OpenSpending

In the short term, we are mitigating the problem of data errors as follows:

  • Data provenance – is the source identifiable and the process reproducible? OpenSpending encourages people to add modified datasets to a “package” in the Data Hub. This allows other users to see the original document alongside any modified documents and track the chain of changes made to see clearly which points errors could have been introduced.
  • Crowdsourcing feedback on spending data.
  • Permitting re-use of the structured data we present, so that it can inform decisions in other fact-checking systems.

Ultimately, we will build our part of the ecosystem to provide feedback to the political process, by improving democratic discourse about the public finances.

Lucy Chambers is a community coordinator at the Open Knowledge Foundation. She works on the OKF’s OpenSpending project and coordinates the data-driven-journalism activities of the foundation, including running training sessions and helping to streamline the production of a collaboratively written handbook for data journalists.

Martin Keegan is a software engineer and linguist, currently leading the Open Knowledge Foundation’s OpenSpending project. He is also on the Open Knowledge Foundation’s board, and has worked for SRI, Citrix, University of Cambridge and co-founded and worked for various civil society organizations.

OpenSpending visualisations featured in the Guardian

November 28, 2011 in Coverage

This post is by Lucy Chambers, Community Coordinator on OpenSpending.

On Friday, the Guardian Poverty Matters blog published a piece on the Uganda visualisation that the OpenSpending team had been working on with Publish What You Fund.

From the article

“The Publish What You Fund campaign group and the Open Knowledge Foundation have now produced a visualisation of Uganda’s aid and budget data for 2003-2006, billed as the first time both sets of data have been displayed together in a way that is easy to explore. A quick look shows just how big a piece of the puzzle aid spending is – more than 50% of overall resources available in Uganda for 2005-2006. The vast majority of this $1.1bn in aid was spent directly by donors on various projects, with only a third given to the government to spend along with its domestic resources. Interestingly, aid money made up only a small proportion of resources for education, while accounting for the majority of resources for health, agriculture, water and the environment.”

Busan Aid effectiveness meeting

The release of the visualisation comes ahead of the Busan aid effectiveness meeting and highlights some of the key benefits of opening up spending data, both to the donor organisations and the governments of the recipient countries themselves:

“Four years ago, researchers at the London-based Overseas Development Institute took up the enormous task of trying to figure out how dozens of donors were spending aid in Uganda, and how that compared with where the government was allocating its own resources. The results were striking: it turned out the Ugandan government was only aware of half the aid being spent in the country, despite routinely requesting this information from donors.”

It is hoped that visualisations such as these will make it easier to digest complex datasets of this type, where a government receives support from multiple sources. It is also hoped that discussions around the topic will result in the more timely and regular release of data to help highlight practices that will lead to aid money being most effectively spent.

Read the full Article in the Guardian Poverty Matters blog.

Have data similar to this you would like to create a similar visualisation for? Drop us an email via the OpenSpending mailing list.

OpenSpending v0.11 Released

November 16, 2011 in Releases

We are happy to announce the release of the latest version of OpenSpending. Most of our work has been to improve how we store and or organise spending data. Users will notice that the web frontend has been refreshed and is now much better integrated.

New features include

  • New backend using a conventional relational database allowing clean separation of datasets, and better scalability. The database backend is also much more familiar to developers than the previous backend
  • Lots of documentation for API users and visualization hackers
  • New theme based on twitter’s bootstrap framework
  • Begun support for i18n/translation of the frontend
  • Better validation of input data and model.

Feedback on the new site and features are welcome. Please drop us a line via the mailing list.

New Translation Documentation for OpenSpending

November 11, 2011 in Contribute

There’s been a lot of demand for us to document the translation procedure for OpenSpending, so this is now up and live on the wiki

For reference, I’ve also briefly included the steps here:

In order to translate OpenSpending, please follow the following steps:

  • Create an account on Transifex
  • Email info [at] openspending.org with your Transifex Username and ask to be added to the group for your language of translation.
  • Proceed to the following link
  • Click ‘Add a translation’ and follow the instructions on-screen.

When you have finished your translation…

  • Drop us an email to info [at] openspending.org and we will include it into our next release.

Things to be aware of

  • With each new release of the code, you may need to update your translation, to make sure all the new commands are accounted for…We are currently building up to a big code release and will inform the list when the strings are stable. If you are eager to get going, you may start translating, most of your translation should be preserved, but there will be a little additional work to do before the release.

Happy translating, drop me an email if you have any questions via the mailing list.

Thoughts from the Global Investigative Journalism Conference

October 27, 2011 in Data Journalism, Spending Stories

This post is by Lucy Chambers, community coordinator at the Open Knowledge Foundation, and Friedrich Lindenberg, Developer on OpenSpending. They recently attended the Global Investigative Journalism Conference 2011 in Kyiv, Ukraine, and in this post, bring home their thoughts on journalist-programmer collaboration…

The conference

The Global Investigative Journalism Conference must be one of the most intense yet rewarding experiences either of us have attended since joining the OKF. With topics ranging from human trafficking to offshore companies, the meeting highlighted the importance of long-term, investigative reporting in great clarity.

With around 500 participants from all over the globe with plenty of experience in evidence gathering, we used this opportunity to ask many of them how platforms like OpenSpending can contribute, not only to the way in which data is presented, but also to how it is gathered and analyzed in the course of an investigation.

Spending Stories – the brainstorm

As many of you will be aware, earlier this year we won a Knight News Challenge award to help journalists contextualise and build narratives around spending data. Research for the project, Spending Stories, was one of the main reasons for our trip to Ukraine…

During the data clinic session as well as over drinks in the bar of “Hotel President” we asked the investigators what they would like to see in a spend analysis platform targeted at data journalists. Cutting to the chase, they immediately raised the key questions:

How will it support my work?

It was clear that the platform should support the existing journalistic workflow through publishing embargos, private datasets and note making. At the same time, the need for statistical and analytical heuristics to dissect the data, find outliers and visualize distributions was highlighted as a means to enable truly data-driven investigations of datasets. The goal in this is to distinguish anomalies from errors and patterns of corruption from policies.

What’s in it for my readers?

With the data loaded and analyzed, the next question is what value can be added to published articles. Just like DocumentCloud enabled the easy embedding of source documents and excerpts, OpenSpending should allow journalists to visualize distributions of funds, embed search widgets and data links, as well as information about how the data was acquired and cleaned.

What do I need to learn to do it?

Many of those we spoke to were concerned about the complexity required to contribute data. The recurring question was: should I even try myself or hire help? It’s clear that for the platform to be accessible to journalists, a large variety of data cleansing tutorials, examples and tools need to be at their disposal.

We’ve listed the full brainstorm on the OpenSpending wiki

You can also see the mind map with concrete points below:

Hacks & Scrapers – How technical need data journalists be?

In a second session, “Data Camp” we went through the question of how to generate structured data from unstructured sources such as web pages and PDF documents. We tried to emphasize the value of easily machine-readable data over less structured information by pointing to some examples on ScraperWiki.

As we went through basic steps needed to scrape a web page, the questions began turning towards the purpose of the exercise:

“So why do we need to learn how to scrape? Can’t we just hire someone to do this for us?”

Our answer went something like…

“Well, yes – you can actually, but…”

…it may be a good idea to have some understanding of which data can be easily retrieved and what difficulties and errors you might encounter in the extraction process. This includes:

  1. Understanding the possibilities and limitations of various data structures on the web to understand how to approach programmers and what to ask for (and importantly, what it is reasonable to pay).
  2. Understanding how to quality-check data extracted from the internet and where errors could be introduced.
  3. Appreciating that programmers are expensive and that having a basic understanding of some of the principles behind screen scraping yourself could save your organisation quite a lot of money for simpler tasks

The notes from the scraping session are available on this pad

So how do I hire a hacker?

The final thing that became blatantly apparent in sessions such as “Journalist or Programmer? Do Reporters Need to become Coders?” was that there is a huge void that needs to be bridged between the hacker and journalist world. If I had a pound for every time someone at the conference asked me how they could find a hacker, would be mighty happy. We pointed people in the direction of Hacks and Hackers meetings but there is clearly a need for a more extensive ‘address’ book of reliable contacts is obvious.

I will attempt to pull together some of the thoughts we had about how to find (and trust!) your hacker in a separate post to address some of these needs. If you have further advice or anecdotes on this subject, please don’t hesitate to get in contact via the OpenSpending mailing list.

Uploading Data to OpenSpending

March 20, 2011 in Contribute

The amount of datasets that are available on OpenSpending.org are growing fast and we want more! Currently the process looks like that:

  1. You give us data.
  2. We look at it, try to understand it, possibly ask you some more questions.
  3. We write a custom loader script to load the data.

To make this process easier for us and faster for everybody, we offer an alternative process that requires a bit more work from you. But if you know how to transform your data to our CSV format, you will have your spending data online on OpenSpending more quickly and we can spend more time developing features! Here is how it works:

  1. You create a CSV file that is formatted according to our CSV schema. Here is a really simple example of a CSV file.
  2. You use our new web based uploader that automatically checks your CSV file for errors and stores it along with some meta data.
  3. Contact us and we will do the final step and load the data into OpenSpending.org.

The schema and this alternative process are by no means set in stone: any feedback is appreciated! Most important: if you have spending data, but can’t provide it in our CSV format, don’t worry and just contact us. We always prefer some data over no data!

‘Where Does My Money Go?’ Goes international. Welcome to OpenSpending.

March 1, 2011 in Uncategorized

This post is by Friedrich Lindenberg, one of the developers working on OpenSpending.

Our primary goal has to be to grow WDMMG as an open platform, similar to Open Street Map: while on OSM you sketch out your local streets, WDMMG should become the place to upload and analyze your local or state governments spending. Therefore, our priority has to be providing the right tools to allow people to contribute to this effort themselves: either by loading data, annotating spending or visualizing it in custom ways.

As such transparency is needed not only in the UK but all over the world, we want to re-label the data part of the site (what is now data.wdmmg.org) to the more international OpenSpending. This would both serve as an accessible means to handling financial data and as a backend to more specific sites, such as the UK’s WhereDoesMyMoneyGo visualizations and Germany’s OffenerHaushalt.

I’d like to invite all of you to follow up on the remainder of our discussion, which is archived at http://wiki.openspending.org/Status_2011-02-10 and to contribute your own thoughts.