You are browsing the archive for Data Journalism.

How Spending Stories Fact Checks Big Brother, the Wiretappers’ Ball

February 24, 2012 in Data Journalism, Spending Stories

This piece was co-written with Eric King of Privacy International and comes as Privacy International launches a huge new data release about companies selling surveillance technologies. It is cross-posted on the MediaShift PBS IDEA LAB

Today, the global surveillance industry is estimated at around $5 billion a year. But which companies are selling? Which governments are buying? And why should we care?

We show how the OpenSpending platform can be used to speed up fact checking, showing which of these companies have government contracts, and, most interestingly, with which departments…

The Background

Big Brother is now indisputably big business, yet until recently the international trade in surveillance technologies remained largely under the radar of regulators and civil society. Buyers and suppliers meet, mingle and transact at secretive trade conferences around the world, and the details of their dealings are often shielded from public scrutiny by the ubiquitous defence of ‘national security’. Perhaps unsurprisingly, this environment has bred a widespread disregard for ethics and a culture in which the single-minded pursuit of profit is commonplace.

For years, European and American companies have been quietly selling surveillance equipment and software to dictatorships across the Middle East and North Africa – products that have allowed these regimes to maintain a stranglehold over free expression, smother the flames of political dissent and target individuals for arrest, torture and execution.

They include devices that intercept mobile phone calls and text messages in real time on a mass scale, malware and spyware that gives the purchaser complete control over a target’s computer and trojans that allow the camera and microphone on a laptop or mobile phone to be remotely switched on and operated. These technologies are also being bought by Western law enforcement, including small police departments in which the ability of officers to understand the legal parameters, levels of accuracy and limits of acceptability is highly questionable.

The data that has just been released on the Privacy International Website included the following:

  1. An updated list of companies selling surveillance technology, and
  2. Naming all the government agencies attending an international surveillance trade show known as the wiretappers ball.

Some names are predictable enough: the FBI, the US Drug Enforcement Administration, the UK Serious Organized Crime Agency and Interpol, for example. The presence of others is deeply disturbing: the national security agencies of Bahrain and Yemen, the embassies of Belarus and the Democratic Republic of Congo and the Kenyan intelligence agency, to name but a few. A few are downright baffling, like the US department of Commerce or the US Fish & Wildlife Service and Clark County School District Police Department.

Now, with the aid of OpenSpending, anyone can cross reference which contracts these companies hold with governments around the world. The investigation continues…

Using OpenSpending to speed up fact-checking

Privacy International approached the Spending Stories team to ask for a search widget to be able to search across all of the government spending datasets for contracts held between governments and these companies (until this point, it had only been possible to search one database at a time).

The Spending Browser is now live at http://opendatalabs.org/spendbrowser. And, as the URLs correspond to the queries, individual searches can be passed on for further examination and, importantly, embedded in articles directly. Try it yourself against the list of companies listed in the Surveillance Section of the Privacy International Site (Just enter a company e.g. ‘Endace Accelerated’ into the search bar).

The Spending Browser will become increasingly more powerful as ever more data is loaded into the system.

Want to help make this tool even more powerful? Get involved and help to build up the data bank.

Coverage

You can read more about the background to these stories on the Privacy International Site and recent coverage by the International Media:

Hakuna My Data: NBO Data Bootcamp

January 30, 2012 in Data Journalism, events

This post is by Friedrich Lindenberg, developer on OpenSpending.

“My Name is XXXX, I am a member of the Kenyan parliament for the constituency of XXXX in the 2007-2012 election cycle. During my time in parliament, I have positioned myself against taxes for MPs.

Of the Development Funds allocated to my constituency, I have spent 12mn KSH in 2010 and 8mn KSH in 2009. Since 2007, I’ve funded 201 projects, of which 72 (9mn KSH) related to Education, 56 (7.2mn KSH) related to Health and 20 (4.2mn KSH) to Infrastructure.

The largest projects I have funded include… “

Auto-generated, spending data-driven campaign speeches like this are just one of the many ideas of the Data Bootcamp that took place in Nairobi last week. Invited by the African Media Initiative and the World Bank Insititute, about 70 participants – both journalists and developers – met on Strathmore University’s campus to learn and practise both the skills and tools required for data-driven reporting.

The four-day programme combined tools training with practical work in small groups. Elena Egawhary (BBC NewsNight) gave a workshop on data analysis in Excel, Sreeram Balakrishnan (Google Fusion Tables) introduced both Refine and Fusion Tables. Team members from both the Kenya data portal and the World Bank finance site presented their respective offerings, while Gregor and myself from the OpenSpending team gave intros to web scraping and advanced map visualisation.

During group work, journalists and developers teamed up to try their newly learned skills in different domains ranging from sports (football player profiles) to education (missing toilets in schools, “The Shit Ordeal”) and the financial transparency story-telling mentioned above.

The workshop also served as a community-building event for Kenya’s young and impressive Open Data initiative. Future events, aimed at civil society organisations and polictical actors will help to further promote the re-use of government information released through the initiative.

All this is happening in a place where transparency is an essential tool to be developed: Not only is the access to information now guaranteed by the 2010 Kenyan constitution, there are also major political issues that deserve close attention from local and international watchdogs. These include not only the ongoing incursion of Kenyan troops into Somalia in an effort to fight Al-Shebab terrorist groups, but also the upcoming nationwide elections in December 2012. The elections will instate a new bicameral system of government, with many previously unknown candidates standing for office. In the previous 2007 vote, bad polling station data had quite literally led to widespread unrest and thousands of deaths across the nation.

In all, it was a fantastic to get in touch with the Kenyan participants of the workshop and to see how the organizers of the event – a brilliant team including Craig Hammer, Justin Arenstein and Jay Bhalla – are working to foster an open data community in this bustling developing nation.Given the great ideas generated during the team sessions, I’m sure this work will soon bear its first fruits.

Transparency and technology in Brazil: linking politicians to bad entrepreneurs

January 23, 2012 in Data Journalism, Spending Stories

This story by Fabiano Angélico, who formerly worked at Transparencia Brasil, is about how technology and the help of coders can be used to highlight links between politicians and corrupt entrepreneurs. It is followed by a brief “Behind the News” interview which shows some of the time costs of datawrangling and problems faced when getting the story out.

How can transparency and technology point out connections between politicians and bad entrepreneurs? Well, first of all you will need some information about the politicians and about the entrepreneurs.

In Brazil, in spite of the historical lack of transparency in governments (Brazil’s freedom of information law was sanctioned just late last year), the Electoral Court has been proactively providing information on political candidates since 2002. One piece of info is the financial donation to the candidates, containing info about who is donating to whom and how much. Although this database is released only after the elections — the info would surely be more powerful if it were released DURING the political campaigns –, one must admit this is a rich source of information.

January, 2010. Elections for President and for the Parliament, as well as for State Governors and State Parliaments, would happen in only 9 months time, in October. However, many people were already discussing them.

At that time, 2010 had just begun, I was at work, thinking of how to find rich and useful information on the candidates. Then I was reminded of the so-called “Dirty List” — this is a list regularly published by the Ministry of Labour which indicates the companies and farmers who are caught by government officials using workers in very lousy conditions, similar to slavery.

The list published in the Ministry’s website is in not-so-friendly PDF format, but it has a plus: there is not only the name of the companies or the entrepreneur/farmer, but also their registry numbers within the government. I remembered that in the Electoral Court one can also find the numbers. That was important because having the registry numbers would avoid ambiguities.

I had both lists: the donators to the previous elections (2008, 2006, 2004 and 2002) and the “Dirty” companies. But I had a problem; I did not know how to matchup the datasets. My tech knowledge allowed me to transform the PDFs into CSV, but I could no go further without help.

I then sent the datasets, in CSV format, to Transparencia Hacker, a Google Groups list which now gathers over 800 people interested in the connections between transparency and politics/public administration.

Within 2 days, the guys made the datasets talk, and we found that 16 politicians had been elected with the help of “Dirty” money in the 4 previous elections. Other 13 politicians had received donations from the “Dirty List” but had not succeeded in winning the elections.

A local newspaper told the story.

In October 2012, there are local elections in Brazil. Hope we can shed even more light in the candidates.

Behind the news:

Roughly how long did it take you to extract the data from the PDFs? Do you know how long the guys from Transparencia Hacker spent working on the data?

This was kind of easy. It took me just some minutes. The “Dirty List” is a 20-page PDF. I always use a website to convert it into xls or csv (I like Cometdocs for this work).

Here is the Dirty List, in PDF (last updated on the 8th of November, 2011; the list we used is in CSV but it it very outdated because it was due to January 2010) Here are the Electoral Court pages for the list of donators: 2002, 2004, 2006, 2008 and 2010.

What I asked the Transparencia Hacker community was to check whether the CNPJs (companies register number within the governments) in the CSV would match any item in the Electoral Court webpage. The guys worked on the data for 2 days.

Is sufficient data available to visualise the total amount lobbyists donated to political campaigns, and would it be useful to / no? If you were to visualise the info – what would the priorities be to show? Would any tools be useful to explore the data?

Yes, there is enough data. And YES, it would be very useful to visualize those links. I would prioritise the presidential and governor candidates as well as some Congressmen who hold top-positions in both Houses of Congress. Also, the donations to political parties (not to individual politicians) would be a plus.

A search form would be very useful. The search could have filters for position (Presidential candidate, governor candidate, political party etc), geography (Brazil, states) and donators (with no filters, just a blank for writing)

In your ideal world, in time for the impending elections – what would be done differently from last time? Any additional data you would like to see released?

I’d have to think more carefully to respond that, but concerning additional data: the number which identifies the market (the field) in which the companies work.

Interested in writing a “Behind the News” piece for the OpenSpending blog? Get in touch via our twitter account or email info [at] openspending.org.

Some useful links (mainly in Portuguese):

How Spending Stories Spots Errors in Public Spending

December 5, 2011 in Data Journalism, Spending Stories

This article was originally published on MediaShift Idea Lab and was co-written by Martin Keegan, project lead for Spending Stories and Lucy Chambers, Community Coordinator for OpenSpending.

How public funds should be spent is often controversial. Information about how that money has already been spent should not be ambiguous at all. People arguing about the future will care about the present, and if data about past or present public spending is available, many will certainly look at it. When they do, occasionally they will find errors, or believe themselves to have found errors.

OpenSpending, which aims to track every (public) government and corporate financial transaction across the world, encourages users to:

  • augment the existing spending database with additional sources of data
  • use that data — e.g., to write evidence-based articles and formulate informed decisions about how their society is financed.

Spending Stories is our effort to make OpenSpending a natural way to do data journalism about public spending.

openspending.jpg

The Problem

FACT 1: Errors occur in data, no matter how official the source.

FACT 2: Data wrangling (manipulating or restructuring datasets to correct inaccuracies, remix with other datasets to augment the data, or perform calculations on the data), generally improves data quality, for example, through reconciling entities and flagging amounts that are obviously incorrect.

FACT 3: Data wrangling can also introduce errors if not tackled correctly.

Crucial to ensuring the use of this data in articles or ensuring re-use by concerned citizens is the ability to show that the data is valid. In addition, maintaining a good relationship with public bodies who are confident that they are not being misrepresented in the data is vital to ensuring the data continues to be released in the first place. In practice, this means that the provenance of the data has to be clear including:

  • where the data originally came from (preferably a URL)
  • whether anyone (e.g., government, community data wrangler, or OpenSpending) has worked on the data since it was published, and what steps they took to change the data (i.e., these steps should be reproducible to produce the same result)

The OpenSpending team has gone to lengths to retain enough information to say who was responsible for both of the above.

OpenSpending is a system, somewhat like a wiki, which allows you to track back through the data wrangling process and work out what changes were made to the data, when and by whom.

Error reporting in practice

OpenSpending recently received a pointed inquiry from the U.K. Treasury disputing the claims we were making about the payment of British public money to a private company. Believing that an error had been introduced, we attempted to retrace our steps and find out where this had occurred, and who was responsible.

As we discovered, the payment had actually taken place, but the the OpenSpending descriptions used to label the transaction were not sufficiently detailed to accurately reflect the item in question.

With Spending Stories, we were able to retrace our steps because we had preserved a copy of the software tools we used for collecting the data (the data is published by about 50 public bodies, and must be downloaded, stitched together, and firmly molded into shape). These tools had been also made available to the public, so the Treasury and other concerned citizens could have checked our work themselves; the availability of this kind of check keeps all participants in the fiscal debate honest.

What had gone wrong was a problem of terminology: The transactions existed, but ambiguous language had been used to describe them, glossing over the distinction between the government department reporting what money had been spent and the government agency which actually spent the money. The bodies in question were the Department of Health and a regional health care trust; this distinction is certainly one which a concerned citizen would expect to be made clearly — so we should make sure our system makes it easy to know which question is being asked.

Checkpoints in OpenSpending

In the short term, we are mitigating the problem of data errors as follows:

  • Data provenance – is the source identifiable and the process reproducible? OpenSpending encourages people to add modified datasets to a “package” in the Data Hub. This allows other users to see the original document alongside any modified documents and track the chain of changes made to see clearly which points errors could have been introduced.
  • Crowdsourcing feedback on spending data.
  • Permitting re-use of the structured data we present, so that it can inform decisions in other fact-checking systems.

Ultimately, we will build our part of the ecosystem to provide feedback to the political process, by improving democratic discourse about the public finances.

Lucy Chambers is a community coordinator at the Open Knowledge Foundation. She works on the OKF’s OpenSpending project and coordinates the data-driven-journalism activities of the foundation, including running training sessions and helping to streamline the production of a collaboratively written handbook for data journalists.

Martin Keegan is a software engineer and linguist, currently leading the Open Knowledge Foundation’s OpenSpending project. He is also on the Open Knowledge Foundation’s board, and has worked for SRI, Citrix, University of Cambridge and co-founded and worked for various civil society organizations.

Thoughts from the Global Investigative Journalism Conference

October 27, 2011 in Data Journalism, Spending Stories

This post is by Lucy Chambers, community coordinator at the Open Knowledge Foundation, and Friedrich Lindenberg, Developer on OpenSpending. They recently attended the Global Investigative Journalism Conference 2011 in Kyiv, Ukraine, and in this post, bring home their thoughts on journalist-programmer collaboration…

The conference

The Global Investigative Journalism Conference must be one of the most intense yet rewarding experiences either of us have attended since joining the OKF. With topics ranging from human trafficking to offshore companies, the meeting highlighted the importance of long-term, investigative reporting in great clarity.

With around 500 participants from all over the globe with plenty of experience in evidence gathering, we used this opportunity to ask many of them how platforms like OpenSpending can contribute, not only to the way in which data is presented, but also to how it is gathered and analyzed in the course of an investigation.

Spending Stories – the brainstorm

As many of you will be aware, earlier this year we won a Knight News Challenge award to help journalists contextualise and build narratives around spending data. Research for the project, Spending Stories, was one of the main reasons for our trip to Ukraine…

During the data clinic session as well as over drinks in the bar of “Hotel President” we asked the investigators what they would like to see in a spend analysis platform targeted at data journalists. Cutting to the chase, they immediately raised the key questions:

How will it support my work?

It was clear that the platform should support the existing journalistic workflow through publishing embargos, private datasets and note making. At the same time, the need for statistical and analytical heuristics to dissect the data, find outliers and visualize distributions was highlighted as a means to enable truly data-driven investigations of datasets. The goal in this is to distinguish anomalies from errors and patterns of corruption from policies.

What’s in it for my readers?

With the data loaded and analyzed, the next question is what value can be added to published articles. Just like DocumentCloud enabled the easy embedding of source documents and excerpts, OpenSpending should allow journalists to visualize distributions of funds, embed search widgets and data links, as well as information about how the data was acquired and cleaned.

What do I need to learn to do it?

Many of those we spoke to were concerned about the complexity required to contribute data. The recurring question was: should I even try myself or hire help? It’s clear that for the platform to be accessible to journalists, a large variety of data cleansing tutorials, examples and tools need to be at their disposal.

We’ve listed the full brainstorm on the OpenSpending wiki

You can also see the mind map with concrete points below:

Hacks & Scrapers – How technical need data journalists be?

In a second session, “Data Camp” we went through the question of how to generate structured data from unstructured sources such as web pages and PDF documents. We tried to emphasize the value of easily machine-readable data over less structured information by pointing to some examples on ScraperWiki.

As we went through basic steps needed to scrape a web page, the questions began turning towards the purpose of the exercise:

“So why do we need to learn how to scrape? Can’t we just hire someone to do this for us?”

Our answer went something like…

“Well, yes – you can actually, but…”

…it may be a good idea to have some understanding of which data can be easily retrieved and what difficulties and errors you might encounter in the extraction process. This includes:

  1. Understanding the possibilities and limitations of various data structures on the web to understand how to approach programmers and what to ask for (and importantly, what it is reasonable to pay).
  2. Understanding how to quality-check data extracted from the internet and where errors could be introduced.
  3. Appreciating that programmers are expensive and that having a basic understanding of some of the principles behind screen scraping yourself could save your organisation quite a lot of money for simpler tasks

The notes from the scraping session are available on this pad

So how do I hire a hacker?

The final thing that became blatantly apparent in sessions such as “Journalist or Programmer? Do Reporters Need to become Coders?” was that there is a huge void that needs to be bridged between the hacker and journalist world. If I had a pound for every time someone at the conference asked me how they could find a hacker, would be mighty happy. We pointed people in the direction of Hacks and Hackers meetings but there is clearly a need for a more extensive ‘address’ book of reliable contacts is obvious.

I will attempt to pull together some of the thoughts we had about how to find (and trust!) your hacker in a separate post to address some of these needs. If you have further advice or anecdotes on this subject, please don’t hesitate to get in contact via the OpenSpending mailing list.

Please create an account to get started.

Subscribe to the OpenSpending blog

Tweet Blender

LaurieJLaurieJ: @Peston @hmtreasury there's also http://t.co/q1YHFkCy for simple web visualisation of UK tax from @openspending
13 months ago from Twitter for Mac
openspendingopenspending: Rolling out a new content management system. Let's start with something easy: EU spending overview - http://t.co/H7pUum19 #openspending
13 months ago from Twitter for Mac
openOVopenOV: @jjovanos En #D66 #CDA #PvdA en #VVD kiezen voor het tekort in potentie met 40 miljard vergroten. #openspending nu echt hard nodig.
13 months ago from Qwit