Subtitles section Play video Print subtitles FUMI YAMAZAKI: OK. Hello everyone. Thank you for coming. I'm super excited that we're having Joel Gurin the author of this book "Open Data Now" to Google. JOEL GURIN: OK. Thank you all so much for coming. I want to say a couple of quick things before we get started. You can see on this slide I have a website as well as a book. The website is also open data now, just for the sake of simplicity. I use @joelgurin for Twitter I also use the hash tag #opendatanow. There is a pattern there. I'm very happy to be speaking to you today. Also, if you didn't see it on the way in, on the way out, there is a sign up sheet if you're interested in getting free email updates from my website or from the GovLab. Please sign up and we'll keep in touch, because there is a lot to talk about. So why open data and how did I get into this particular area? I have to start by saying I am probably by a couple of orders of magnitude the least technical person in this room right now. So what you're going to hear from me-- and it is a little humbling, to say the least, to come to talk about data to Google-- but what I hope I can bring to this is a sort of sense of overall perspective and context from the work that I've done in government and non-profits, as a journalist and now in academia. I really have tried to get us a sense and sort of paint a picture of how open data is being seen and used in society today that I hope will be helpful to all of you. And I certainly hope we have a little time for questions. So my background very briefly-- as Fumi told you, I began as a science journalist. I was editorial director, and then executive vice president of "Consumer Reports" when we launched consumerreports.org, which is the largest paid information subscription site on the web now with about 3 million active paid subscribers. Shortly after that, I went to the Federal Communications Commission, began as head of the Consumer Bureau. And at that point our chairman, Julius Genachowski, was very interested in figuring out how we can give consumers help in a simple decision like choosing a cellphone plan. Well choosing a cellphone plan ends up being kind of like solving some difficult problem in topology or some such thing or at least in statistics, because there are about 1,000 different cellphone plans offered by a company like Verizon. You multiply that by the number of companies, you factor in the fact that every consumer has different needs, so it became pretty clear as I was looking at this and this is a problem is more complicated than it looks at first. It also turns out to be very similar to problems that other government agencies face in trying to advise consumers on things like financial services, housing, mortgages, education, and so on. So I began talking to people in other agencies about consumer information, generally. Out of that I was invited to chair the White House Task Force on Smart Disclosure. Smart disclosure being the term that we developed to describe giving data to consumers that they can use to make complex decisions. That report came out last May. And from that work I became more involved in open data and open government more generally. I met Beth Noveck, who some of you may know as the head of the Open Government Initiative during President Obama's first term and a real pioneer and open government and open data. She has now invited me to come work at the GovLab that she founded at NYU. And I'll tell you a few things about that. And I also have this website and this book on open data now, so I am sort of running the open data practice for the GovLab and looking at the implications of open data in many ways. Just a couple words about the GovLab. I won't read what you can see on the screen, but our basic hypothesis and our mission is to figure out how to use technology and collaborative platforms and basically 21st century approaches to help improve governance and government, and the way that citizens and government interact. We think that people should interact with government more than when they vote once a year or when they happen to make a comment on We the People petition website or something like that. We're looking at ways to really develop a different level of engagement that is good both for citizens and for government as well. And this model of collaborative democracy we feel has three major modes of operation- the first one is sharing responsibility, where a government can take a piece of what has been a government responsibility and delegate that to citizens. And here the paradigm is participatory budgeting, where in 1,500 cities around the world now the city government is saying, you take a chunk of the budget and spend it as you wish. We think that can be done in many other kinds of governance situations and that would be very productive. The second modality is getting knowledge and expertise in. Figuring out ways that not just the traditional government advisers, but people with technical abilities, technical skills, insight into community issues, and so on, can advise government at the federal, state, and city level and we're seeing a lot of models for that. And then the third modality is getting open data out, which is what I work on and what I'm going to talk about today. So what is open data? There are a number of good definitions that have been done by different groups like the Open Knowledge Foundation and the Sunlight Foundation. What I did in writing this book was to choose a fairly general definition- that open data is accessible public data that people, companies, and organizations can use to launch new ventures, analyze patterns and trends, make data-driven decisions, and solve complex problems. This definition incorporates not only open data from government, which is where a lot of the focus has been, but also open data from sources like social media. For many sources that are accessible to you at Google, and from other kinds of data that companies themselves may choose to release in different ways, as well scientific data. So what you're going to hear me talk about today is open data in all of those forms and how they relate to each other, and how they relate to social and business goals. I do think-- and I'm certainly convinced having now worked in this area for a couple of years-- that we're talking about a phenomenon that has tremendous implications and tremendous impact potentially not only for business, but also for as for scientists, for journalists, for consumers, and for government. And in many ways, we're starting to see a convergence of the civic and the commercial uses of open data, where we're seeing some ventures that may start as non-profits that turn out to have a sustainable business model. And we're seeing businesses that turn out to be actually extremely mission driven in their use of open data. And you'll see many examples today. What is open data not? Open data is not the same as big data. And it's not the same as open government and it's not even really a blending of big data and open government. It's a different kind animal. Big data also has many definitions. I think the only thing everybody agrees on, at least when I ask them, is that when you're talking about what you mean by big data, we mean like really, really a lot of data. Really big data sets, which is not too surprising. I think you can more accurately say that big data involves data sets that are at the current limit of our ability to analyze and use, but, of course, that limit changes every day. I do think there are real ways in which the quantity of data has a qualitative impact. In the same way that when I came here from New York, I theoretically could have ridden a bicycle, or even walked, and taking a plane is really more than an accelerated way of doing that, it's a whole different kind of travel. So I think that big data does have that kind of impact. But it's not philosophically different in my view anyway from smaller data problems in the way that open data is philosophically its own thing. Open government is very closely related to the concept of open data, but it's broader. Open government includes all kinds of government transparency. It also includes the kinds of collaboration and things that I just showed on the GovLab's slide. So part of it is data related but part of it is really other kinds of citizen engagement. So the book does present the grand unified theory of what is open data in a simple Venn diagram that you will find in appendix A in the book, and you can also find on my website opendatanow.com in a fairly lengthy blog post. I won't go through all this to analyze it, but the most important thing to notice here is that big data, open data, and open government have several points and areas of intersection. They are distinct, but they overlap. And when they overlap, it gets really interesting. The point in the middle, sector six there, which is all three things-- these are large public government data sets like weather, GPS, Securities and Exchange Commission data center. This is where we're going to see some of the highest economic value and some of the highest potential civic value. But it is by no means the only thing that's important about open data, or the only kind of open data that's important. So that's the terrain. Having said that, I'm going to take you very quickly, believe it or not, through what I see as nine open data trends. I'm going to pose three open open data questions that I don't have the answers to, but I think we can all probably discuss and think about. And I'm going to describe a study that we're now doing at the GovLab that I think will be of interest you called the Open Data 500, that I think is going to really help advance this field. So let's get started. So the first trend is liberating government data. It's undeniable that governments at all levels not only in the US, but in countries around the world are now focusing on ways that they can take data that they control and make it available to the public as open data. And in the US, we've seen a major step forward. Most recently last May, when President Obama announced the new Open Data Policy. This policy has been called the biggest change in how we deal with federal information since the Freedom of Information Act the 1960s. It's potentially that big. There are a lot of questions about how we implement it, which I'll talk about. But it is a very ambitious and I think a very right-thinking kind of program to make government data open by default. Meaning that unless there's a security reason or privacy reason or something like that to keep it hidden, it's open and anybody ought to use it. Now one thing that's really significant is that when the president announced this policy, he used these words, which you could see are very business focused and he actually chose a technology center in Austin Texas to do it. So the administration is really positioning open data as a job creator and as a business driver. That's partly why we at the GovLab are studying it through that lens, because we want to see to what extent that is really a defensible proposition. I think it is, but I think it's still a work in progress. So the whole question of why should government go to the trouble and the expense, and the time of making data open, part of the answer people think, is that it's going to have an economic benefit as well as being a social good. This Open Data Policy, which was announced last May, talks about presumption of openness or open by default making data machine readable, reusable timely. This was based in many ways on the definitions of the Open Knowledge Foundation and the Sunlight Foundation developed several years ago. One really interesting difference is that those definitions said and the data has got to be free and the government definition doesn't quite say that. So there's still some room for agencies to charge for data, but I think the direction is very much towards free open data. In addition to this policy, there is now something called the DATA Act, which may be the only thing in the known universe that Ralph Nader and Grover Norquist actually agree on. It is it's an extremely bipartisan movement it stands for Digital Accountability and Transparency act. This is another part of open data. So one part of open data, like the Open Data Policy, is let's release data we have on weather, satellite data, GPS, health data, et cetera-- data that government collects that is useful to the public. This is data that government has about itself. The goal of the DATA Act is to make government spending data more thorough, more transparent, more usable than it's ever been by a lot. To be able to make it go all the way down, not just to the contractors to government, but subcontractors, sub-subcontractors, and to do it in a way that is really accurate. There is a website called usaspending.gov. It was intended to do this. The Sunlight Foundation recently calculated that it is inaccurate to the tune of $1.55 trillion a year. Otherwise, it's perfect. So the DATA Act would automate this in a way that would really solve that kind of problem and there is a lot of push now in Congress to pass the DATA Act, which I think would be another major step forward. So this is just at the federal level, but you're seeing similar kinds of activity in cities, in states, in the 60 countries that now belong to the Open Government Partnership. All of which are making similar kinds of commitments to open data for both civic and job-creating reasons. That's one trend. The next trend, which comes right out of that, is that we are actually seeing open data begin to drive business growth in a number of ways. And you can find examples all over the place- health, education, transportation. My book has a number of examples. Somebody tweeted recently, there's so many apps and businesses in here. I can't even count them. So I figured I would count them. There are to the best of my knowledge, 183 of them. So, happy reading. You will find companies in all of these sectors and they're doing some very creative things with open data that are showing that you don't have to own the data in a proprietary way to make a thriving business out of it. I'll just show you a couple of examples. So the Climate Corporation based here in San Francisco has become in many ways sort of the poster child for the commercial use of open data. I like to say I sort of knew them when. They've gotten a fair amount of publicity over the years. I was fortunate to have a long interview with their CEO David Friedburg last April. It's in the book. And there's actually a longer podcast with him on my website. And if you're really interested in the stuff I would encourage you to check out the podcast, because it's a fascinating story. And the punch line is they were recently bought by Monsanto for a billion dollars. They've been profiled in "The New Yorker," so they've emerged as everyone's favorite example, and I think rightly so, of what this kind of data can do. Their story is fascinating. They began by saying that they wanted to sell better weather insurance. And they quickly focused on farming and farmers as their target. They figured that if they could get all this data from the National Oceanic and Atmospheric Administration, from NASA weather data, et cetera, and they applied really extremely smart analytics. And the guy who started it just hired brilliant people. I think he actually used to work at Google. And I'm sure he hired some people from here. But what they figured they could do was do risk calculations that would enable them to use to calculate the risk that they bore as an insurer more accurately, so that they could both help farmers and also make a business out of it. Well what happened as they got into this is that they found that there were open data sources that they could use that were much better and that would give them a much better result than the commonly used sources. So the first iteration of this is, let's use data from weather stations. Well if you're a farmer, even if you look at every weather station the US, it might be 30 miles away from your farm and it's not helpful to you. So long story short, they ended up getting data so that they could look at a piece of farmland roughly the size of this mid-sized auditorium or even smaller. They can calculate rainfall to one hundredth of an inch. They can look at soil quality in a way that they know exactly how the soil is going to respond to that amount of rain. And they're doing all of this with almost all of it with a couple of small exceptions is public open data that anybody any one of us theoretically could access, but we wouldn't know what to do with it. And knowing what to do with it and knowing how to analyze it, and bringing together both data analysts, and subject matter experts, to create this new kind of tool is how they have created a billion dollars worth of value. They also believe that they can now increase profitability for farmers worldwide by 20% to 30% and help farmers understand how to deal with climate change by changing the crops they grow and the seasons in which they grow them. So this is huge. This goes from we're insurance salesmen to we're leading the next Green Revolution. It's a direct application of free open data and it's a stunning demonstration of how even data that is free and public can be an incredibly important business driver. A lot of people think health care will be the next big frontier. This is a picture Todd Park, who was the Chief Technology Officer for Health and Human Services and for the last couple years has been CTO for the United States. He runs this event in Washington every year called The Health Datapalooza. Datapalooza, as somebody pointed out, could be literally defined as an all out crazy party of data and that is pretty much what these things are. They get about 2,000 people a year. And we are seeing a lot of activity in the health care center. iTriage is an example that uses the public registry of health care providers. So that if you're traveling and you have some symptoms, it can immediately tell you for those symptoms are serious. And if they are, it'll tell you how to get to the nearest emergency room very quickly, even if you're in a strange city. In finance we're seeing a lot of companies like this one. This is CapitalCube, which is now owned by Analytics Insight. There are about 40,000 publicly traded companies in the world for which there is enough information to say anything intelligent about them. These guys figured out algorithms to analyze all 40,000 of them update their information every single day, put their results into a prose form that any investor can read, provide graphs that show the relative risk and the expected return for a given company compared to its competitors, et cetera. Again this is not actually necessarily using a new data source. They're using SEC data that's been available for a while, but they're applying a level of analytics that probably was not possible before fairly recently. This is becoming and we're seeing a lot of businesses in the financial sector. There's stuff happening in energy. Opower is a company that's now working with utilities. It will give you back not only your own energy usage data, but an aggregate summary of your neighbors energy usage data, which is apparently the most powerful motivator to clean up your own act is the fact that you got to do as well as your neighbors. They're using this together with a lot of open data about energy and energy usage and energy efficiency to help people save energy and ultimately, hopefully help fight climate change. So there's many, many examples but those just give you a sense of how this goes. Now the interesting thing in a segue from Opower what they are ultimately about is helping consumers choose how they're going to use energy. So this gets back to what I told you was the problem that got me into this whole area of the first place- how do you choose a cellphone plan? Well this whole area of smart disclosure is about open data. It's almost like a sort of subset of open data. It's about figuring out how to get data that's going to be useful to average people to improve their lives and put it out there in a usable form. Who here has read "Nudge" by Cass Sunstein and Richard Thaler? It's a great book if you're at all interested in behavioral economics. It's a perfect read and it's also interesting, because it inspired a ton of work in the Obama administration. So it's very much about how understanding collective behavior in psychology can help you make policy decisions. It was actually tested during the first Obama campaign. One simple example is they found that if they planted-- not planted, that's too strong a word-- if they promoted news stories before every state primary election that there was going to be huge voter turn out, there would in fact be huge voter turnout, because nobody wants to be left out when there's going to be huge voter turnout. So it became a kind of self-fulfilling prophecy, because they knew that more voter turnout would be helpful to them. Actually that may have been in the election itself, not the primaries-- correcting myself. So anyway Cass Sunstein, who was the regulatory czar for the Obama administration, is a big thinker in this area. Richard Thaler, who's an economist at the University of Chicago, is as well. Their book "Nudge" was about how you can create behavioral cues and use information in ways that nudge people to make choices that are better for them. Well, so here's an example- so while Cass was regulatory czar, one of the things that they did is they reformed the label that you see on cars around energy efficiency. And you can see very clearly the most obvious change here. So they go from the small type saying, estimated fuel cost 2000 something a year to you save $1,850 in fuel costs over five years. So it's a very simple example but a pretty compelling one of how the way you present information affects what people get from it and how they make decisions. OK that was very much the basis of the Smart Disclosure Task Force. And what we set out to do was to say, how do we use these kinds of principles at a time when most people are getting information either on their smartphones or on the web and where we're really trying to figure out how to give people information that is personalized to them? So think about how Kayak works. I mean this is a pretty amazing tool that allows you to go online and choose the flight that you want to take tomorrow to wherever you want to go out of literally thousands of flights and you can do in about 10 minutes. So the question we start to ask is what if there was a Kayak for everything? What would that look like? There was a lot of work now to try to figure out, how do you do this for financial services? How do you do this for health care insurance? How do you do it for mortgages, credit cards-- all these decisions that frankly drive most of us completely nuts every day, either that or you just sort of pick one and hope you're right. Going back to cell phones as the paradigm here, it's been calculated that Americans lose something like $13 billion a year collectively, because we're not using most efficient cellphone plans. So this is real money and in many cases of like health insurance it's also safety, and quality of care, and quality of service. So there have been a couple of successful experiments here. One of the ones I like a lot is a site called greatschools.org. This is a nonprofit. They use state data. They use state data to analyze the quality of public schools and help people make those choices. This thing is now used by more than 40% of all K through 12 households in the US, which is just kind of fantastic and shows you how much hunger there is for this kind of information. Another success-- this one is from the UK-- I always like this because it's just sort of so bizarre so this is a site called comparethemarket.com One night, probably after a couple of vodkas, somebody must have been kidding around. They were trying on Russian accents and somebody said it's like, comparethemeerkat.com. Somebody then said, that is a brilliant idea. They decided that there are a symbol should be a meerkat. And there is now the spokes-meerkat in the UK called Alexander Orlov, who is the spokes-thing for comparethemarket.com. This thing became so popular that Harrods was going to-- yes, you can collect all six exclusive meerkat toys. This is like a car insurance shopping site. This is like as if the Geico gecko was sextuplets or something. I don't know what it's like. But this thing became so popular that they were going to sell these one year one Christmas at Harrods and the CEO apparent said we can't do that there's going to be a run on the store. We're just going to give them all to charity. It has also made them a very successful company. Now what this shows-- beyond the fact that people like fuzzy stuffed animals and that marketers have bizarre, but successful ideas-- what this shows it is also possible to build a successful business doing comparisons of car insurance, home insurance, life insurance, energy, credit cards, travel insurance, et cetera. Nobody has yet made this model really successful in the US, but it is a huge consumer need. And I think one of the things that we're going to see in the years ahead is that smart disclosure people are going to figure out how to really do smart disclosure the right way and it'll be both the consumer service and a successful business model. AUDIENCE: [INAUDIBLE]? JOEL GURIN: Why hasn't it made it in the US? I think there is a couple of reasons. I think one is that people haven't quite found the right business model yet that will do it in an honest way and yet also be successful. A lot of this works off of lead generation. Lead generation gives you the incentive to game the system, which is unfortunate. So that's been a bit of a problem. I think also-- I don't actually really have a good explanation. I think for some reason this started culturally in the UK with smaller companies about 10 years ago and it hasn't doesn't seem to have caught on here in the same way. And there are a lot of inherent challenges in trying to do comparisons for 10 different things at once. Like the fact the people generally shop for any one of those only once every couple of years. But one way or another, I think it's still a model that ought to be applicable here, because this is actually one and only one of several sites in the UK that have been operating successfully. Anyway somebody should figure this out. I think it's an interesting challenge. Next trend- we're seeing a lot of use of open data in an investment context, which I think can be good for society as well. This is a British company-- there's a lot of work going on in London-- that is making open data available about small to medium size enterprises. Private companies that have had trouble attracting investment because the investors don't want to go to all the trouble of analyzing whether or not they're a good risk. These guys are providing enough information that they believe they can get about $250 billion more dollars invested in these companies by simply providing the information that lets investors invest with confidence. So that's a good thing for business. But a lot of the potential I think is in what used to be called corporate responsibility-- what's now being called environmental social governance measures, because we're seeing more and more investors who consider good sustainable practices to be a sign of good corporate governance. So for example, the Carbon Disclosure Project collects data on carbon footprint from most of the major companies from Fortune 500 and other companies. They represent institutional investors who collectively have about $87 trillion to invest. So we're seeing some real interest from that community. We're seeing the same kind of thing being applied to the consumer field, particularly by a company in San Francisco called GoodGuide, which it provides a lot of information to consumers about the environmental impact of the products and services they buy. Much of this based on EPA and other open data. Companies are now becoming more and more interested in this because they want to see if they have a good profile the consumers will like. And then finally, the Securities and Exchange Commission has begun to demand that companies that report to them include information on things like whether or not they use conflict minerals, which are minerals that are mined under pretty horrible conditions in the Republic of Congo. That kind of thing, which happen under Dodd-Frank could be the beginning of the SEC demanding more and more environmental social governance measures. If that were to happen, we could see some real changes in corporate practices. So I think this is a case where open data, because it's of interest not only to citizens, but also to the investor community, can have a lot of leverage in improving corporate behavior. We're seeing open data shape reputation and brand in some powerful ways. Part of this is public complaints and what happens when you make complaints about a company public. So these two people founded a company called PublikDemand, which takes complaints from consumers, amplifies them through social media to an extent that a company like AT&T or United Airlines has to immediately pay attention. And in many cases they've gotten very rapid solutions to problems that otherwise would have gone back and forth with customer service for months. Well this is a strategy that regulatory agencies are also following. The Consumer Financial Protection Bureau in particular has made its complaint database public. And banks are now paying much more attention to customer complaints and customer satisfaction than they ever would have because of this open data. Both "Forbes" and "American Banker" have written about how this is really changing the banking industry, because they have to listen collectively to consumers whereas they could ignore people one at a time. The next stage of this, I think, is analyzing social media. Since we are now at this stage of 2 billion tweets a week which is-- I don't know about you-- I find that somewhat terrifying. But not only through the kinds of reviews and comments people do on Google, but these other sites as well. We're seeing a whole huge amount of social media commentary and you would think that if you could actually figure out how to analyze this and do something with it, you would have a very powerful form of open data that has huge business relevance. Well one company that is working on this is reputation.com, which is in the business of helping people improve their online reputations mostly by promoting more positive and genuinely positive feelings about what they have to say. But there is a whole other level of this-- of sentiment analysis-- which many of you may be familiar with. So I always like to ask how many people know who the woman on the left is? How many people know who the guy on the right is? OK, at least a couple generally in every tech audience more people recognize Alan Turing than recognize Jane Austen but that's who they are. And if Jane Austen and Alan Turing had a love child, it would be sentiment analysis. Because sentiment analysis essentially is this technique of doing text analysis to figure out what people feel about brands, celebrities, TV shows, specific products, specific services, et cetera. There is an annual conference now held in New York-- well, I think it's usually New York-- every March where people get together talk about this stuff. It's a chapter in my book. I've also done a podcast with a guy named Seth Grimes, who's a guru in this area. That's on my website. It's absolutely fascinating. It's not yet a mature technology, but ultimately you can see where this is going. This is going towards treating all of social media as an analyzable, quantifiable form of open data that can have a lot of implications in a lot of areas. Personal data is a specific kind of open data in that this is about making data about my medical records available to me, or like Opower, my energy usage available to me. It doesn't really fit the classic definition of open data. It's not like available to everybody for free, but it's a very important part of the ecosystem. Partly because opening data to me is a different kind of thing that me not being able to access my own data. And also because in many applications of big open data, having the ability to match it up with personal data is an important part of the puzzle. This is actually the diagram from a report by the World Economic Forum. They've now done a couple of reports on unlocking the value of personal data. The basic idea that people are talking about is what if you could establish a data vault. So I'm seeing this as probably a concept that many of you thought about a lot. It's been kicking around for a while. It may or may not be getting to a point the applicability or maturity. There are companies like reputation.com, personal.com in DC, and others that are looking at this. But the basic idea is, if you had access to your personal data, if you can hold it securely, and if you could then release it selectively to other people or to marketers, what would happen? Well one model, which is being called vendor relationship marketing by Doc Searls who talks about it in his book, "The Intention Economy," one model is that instead of marketers targeting you, you target them. It is worth about $2,000 for a Mercedes Benz dealer to get a qualified buyer on the lot based on the probability that they're going to buy a car. So it might be worth a couple dollars for that person to find you if you wanted to release demographic or whatever kind of information that made you look like a good customer and actually pay you to make a visit. That's a kind of simple form, but some of the people working in this area think there's a lot economic potential there. I think it's still hypothetical, but at least points towards the greater degree of consumer control over how we are all marketed to. On the other end of the spectrum, there is potentially tremendous public value in sharing personal data. This is this app PulsePoint, which is essentially if you are a person who knows CPR, you tell them that. If there's somebody who is having cardiac arrest, they then immediately send a message to everybody nearby who knows CPR. They can get to them faster than an ambulance can. They can potentially save a life. So this is the use of personal data that I'm not sure anybody would have thought of a couple years ago, but it's the kind of thing that when you start thinking of personal data as a form of open data on a voluntary basis some really interesting things can happen. They talk about themselves as enabling citizen superheroes and I think that's actually pretty accurate. Open data and research- this is another area where I think we're going to see potentially huge benefits. We're seeing more and more interest and more and more pressure for particularly biomedical, but potentially other kinds of scientific research to be more open. Now a couple things are happening here. One is the open access movement with which, of course, Aaron Swartz was very involved in promoting and very tragically in the end. But that's very much about one state, as in a published journal, we shouldn't all have to pay thousands of dollars to get at that data in order to get at that report. And the federal government recently announced just a couple weeks ago that about half of all federally-funded research will now have to be made publicly available for free online within a year of its publication in journal. That's sort of after the fact. What gets even more interesting is data sharing while the work is in progress. So a lot of this is coming from patients and from funders. Kathy Giusti was a corporate CEO in her 30's when she discovered she had multiple myeloma. She quickly discovered that there was very little research being done. She started a foundation to fund that research. And a condition was if you take their money, you have to make your data openly available as you make new discoveries. This is in many ways the model that the Human Genome Project worked on very successfully. It's now being followed in Alzheimer's research and Parkinson's research and in other ways as well. It is potentially a transformational change in how we do science. If the business models are worked out and if there's enough cooperation from scientists and from drug companies and others to really make this the norm. We're also seeing a lot of very successful experiments in crowd-sourcing science. One of the most famous was done at University of Washington a couple years ago. They had been working for a decade trying to solve protein structure for protein related to the AIDS virus. They decided to put it on the site Foldit and asked gamers to solve it. Gamers solved it within a couple weeks. They published in "Nature" and they thanked the gamers publicly. This was, I think, eye-opening for a lot of people. Another example, any of you know Galaxy Zoo or Zooniverse? This is one of the great citizen science projects and it's really a model for many of them. This thing got started in Oxford because some poor PhD student had to look at images of the structure of spiral galaxies, which apparently computers cannot assess very well. And he had hundreds of thousands of these to look at. He looked at 50,000 in a week and he said there's got to be a better way. They decided that the better way was posting these images online, inviting just ordinary people look at them. They can do it with a high degree of accuracy. They've now taken on other scientific projects, like cancer cells as you see here. They have tapped 800,000 volunteers to help them do skilled human work in the interest of science. And then finally SkyTruth is applying the same kind of thing to the environment. This is a nonprofit in the Washington area that is now using crowd sourcing to look at things like maps of areas of Pennsylvania where fracking is going on and look at signs that fracking is damaging the environment. So this becomes environmental protection through open data from the satellites and crowd sourcing applied to that open data. Data driven cities is a huge movement right now. Right at NYU we have the Center for Urban Science and Progress, which is doing a lot of work in this area. The idea is to put sensors all over cities, to instrument cities, to see what can be learned, to improve operations, public health, emergency management, all kinds of things. You're also seeing a lot of use of data for accountability in cities like Chicago. Palo Alto has been a leader here. And there are a couple of interesting things to come out of this. One is applications like NextBus, which is now over the country, where city traffic data can be used to help you figure out when your next bus is coming so you're not waiting endlessly in the rain. To things like this experiment in Washington where they have actually solicited public input about how the different government agencies are doing. So they've actually been able to grade government agencies on the basis of both survey data that they collect and sentiment analysis of what people are saying on social media. When they first did this, four out of five agencies got a c minus one got a c plus. They were not very happy with the mayor for doing this, but they have gone public with it. It's a really interesting feedback loop and over time, the grades have gone up. So now the last trend and this is one where we're really intensely focused at NYU-- is trying to figure out when you look at all of this together, what is open data worth? And this is an important question because it is not a slam dunk or particularly easy to take data that has traditionally been siloed and open it to the public. So there have been a number of studies on this. The most recent was McKinsey study last October that says that open data is worth $3 trillion a year worldwide. That's by far the highest estimate anybody has come up with as you can see from some of the other ones on the screen. But generally the estimates run pretty high. So it's a very interesting challenge we all think there's potential there, but what we have done at the GovLab at NYU, is we've set out to do this thing called the Open Data 500, which is about figuring out exactly where the value is. Beginning in the business sector, but ultimately wanting to look at the nonprofit sector as well. We're looking at the US based companies. We have actually contacted more than 500 of them. We're in the process of finalizing the list. If this interests you, I would urge you to please go to opendata500.com because we are really seeking public comment on everything from individual companies to our methodology to whole goals of the study. Or you can tweet to hashtag #OD500 if you have suggestions for us. We have this now on a website as a work in progress, where you can filter by state or by category and see where some of these open data companies are. Not surprisingly the greatest numbers are in California, but I'm glad to say, as a New Yorker, that New York is not far behind. And we're beginning to see some interesting patterns here that I think are really going to be meaningful. So one of those patterns, which I'll show you in a second, helps answer some of the open questions about open data. So having shown you all these trends, and shown you all the stuff that's happening that I talked about my book and the website and my other work, there's still a lot we don't know. And I would say there's three major questions that where now all looking to answer. So the first one is, OK if we think open data has value, which sectors are the most promising? Well from the Open Data 500-- even though this is preliminary, I can't stress that enough because this is not a final list, et cetera-- but we're starting to see some hints of that. So the first tier, the company's the sectors that have the most companies in them are what we're calling data slash technology and finance and investment. Finance and investment probably because there's so much interest, and because SEC data and other kinds of business data has been out there for a long time and is a very rich source. Data technology because there is a whole huge emerging sector in helping figure out how to take really unwieldy government data sets and turn them into usable open data. So this is companies like Socrata, Junar, OpenGov here in Palo Alto, or nearby, and many others. That their business is making open data business-friendly. And one of the interesting questions is whether this is something that's going to be around forever, which I think it probably will, as we were talking about a little bit before. Or how much this sector may change as governments get better at releasing open data. Next we have health care-- which I think is emerging-- transportation, energy, and then the third tier, where only about a couple percent the companies we have are in each of these areas. This includes a number things. Many of which are really quite significant like education, scientific research, environment, food and agriculture, the climate corporation for example, is somewhere in this tier. So we just have a couple of initial observations and caveats about this. One is that sectors that don't have a lot of companies in them, like weather and agriculture, may have a climate corporation in there or may have a very significant company. So simply number of companies per sector doesn't necessarily tell you the importance of the sector, but it does tell you at least where a lot of the entrepreneurial activity is. We still have to do more work we are getting information on the number of employees per company, which is going to be an important metric. We're trying to get information on financial metrics and as I said the data technology category. I was very interested to see that so high up and I think it says something about the sorry state of government data. Which leads directly to question number two- how do we improve the open data ecosystem? Having worked in the federal government for a while and talked to people a lot agencies, I can tell you a lot of those government data sets are a mess and the people who run them know it. And they are trying to fix it, but it's not going to happen overnight. So there are a couple things that are happening. On a city level, OpenGov, which is a company right near here, has developed what they like to call kind of a Sim City for actual cities. So they have this thing you can see in the upper right, where they can take budget and other city data from any city, put on a platform that makes it usable, and that also makes a comparable to other cities' data. And they can then make town meetings much more productive. They can tell you why Palo Alto has a certain rate a police overtime and how that compares to San Mateo and they can learn things about city governance from that. So this is one of those data slash technology companies that's beginning to make the data more useful. Another one which is a couple blocks from us at NYU is called Enigma. They won TechCrunch Disrupt in New York last May. And what was significant about that was, not just what they got this nice large $50,000 check, but that I think there was a recognition of how important data companies are. Their whole thing is taking really unwieldy government federal data sets and making them usable on a common platform and interoperable ways. And they're getting a ton of attention right now because this is something that everybody who has ever worked with federal data has wanted. And the reason for that is that right now federal data looks something like this. For those of you who have seen "Raiders of the Lost Ark." The pathetic thing about this is not only the federal data is this bad, but that this is the metaphor that everybody in the federal government who works with data uses to describe the state of federal data. It is that bad and we know it. There's good stuff in there somewhere, but good luck finding it. A lot of the work going on in government and in third parties like Enigma and like OpenGov is to make this stuff more useful. And the open question is how can we really make this work? And this is an area where I certainly think certainly Google has done a lot and has a huge role to play. I think also part of this is going to be what I'm calling demand-driven data disclosure. The way it's worked in the past, government agencies have largely released open data when they've identified data set they think are of interest, or where they're just doing it to compare to comply with the government mandate. One of the things we want to get out of the Open Data 500 is to create a kind of round table, where data users can give much more ongoing feedback to data holders in the government agency. We think this is going to really improve the quality of data, the availability of data, and the ecosystem as a whole. And then finally, how can developing countries use open data? This is a huge question that the World Bank among others is putting a lot of effort into. I'm going to be on a panel for them the next Wednesday afternoon in DC. And there's a couple of areas here one is fighting corruption. This is the website that has I think the best name of any website I've seen it is called ipaidabribe.com. This is a website in India, where you can go and through crowd sourcing report if you had to pay bribe in a way that makes corruption transparent and ultimately decreases corruption. But we're also looking to go beyond transparency to economic development and a lot of people are asking whether developing countries can use all the data as a business resource in the same way that we're seeing in the US, the UK, France and places like that. So one of my colleagues of the World Bank, Prasanna Lal Dass, recently did a very good blog post summarizing some of the things that are needed. As you can see, it's going to take some work, but the rewards may be great. And I'm seeing a lot of interest right now in figuring out how to make that work and can we make it work. So that's the open data universe at least as I've come to see it through the work I've done here. I would recommend you to a couple of sources for more information. One is at thegovlab.org, where you can see our wiki, subscribe to our digest. You can also sign up outside to get a digest subscription. It comes out every week. It's a curated collection of material in this area. There's opendatanow.com, where I report on this stuff on a regular basis, largely with interviews with people in the field, pod casts when I can do them, hopefully as a resource to the community. And there is of course this book now available, which I see many of you have already purchased thank you and which I hope you find useful in one way or another. So we have a couple of minutes for questions and thank you very much. Thank you. AUDIENCE: Thank you very much. I have two questions. one is, if you develop your business using open data, what are the caveats, what is the license on all this data. And the second question, suppose I want some data. Where would I find it? For example I would like to see some data on education broken down by gender. Where would I go? Where do I even start? JOEL GURIN: OK so two questions, one is in terms of the license use data. It is only open data if it's released under an open license that makes it usable by anybody, and reusable, and re-publishable. So by definition, that's pretty much built in to the Open Data Policy, at least of the US government, more and more governments recognize that. In terms of where you go, if you're looking for federal data, you should go to data.gov, this is the central repository of federal data. It was originally built in a way that I think a lot of people found not as user-friendly, as they want it. They've just relaunched it. It's much better. It's getting better all the time. But that's where you would find data like that segmented by agency and by area of interest. As a start, in any case. AUDIENCE: Thank you. Obviously, you logically focused on the US. Is there any other nations that you'd point to as best in class who are really leading the field in terms of leveraging the infrastructures developed in their governments or in their societies. JOEL GURIN: Yes the UK is really the other world leader and in many ways they're doing things in a more advanced way than the US. For example, their equivalent of data.gov is all done with link data is really very beautifully and very well designed. So they are in some ways ahead of us in some ways learning from us. They also have an institute there called the Open Data Institute that is funded partly by the government of by other sources as well that's doing a ton of work really global leadership in this area. Beyond that, we're seeing a lot of interest all over the world on every continent. And I think what's happening is that different countries as I mentioned there now 60 countries or so in the Open Government Partnership, which is committed to these open government principles, part of which is open government data. And it's rapidly emerging as an international movement. I think different countries depending on the stage of development will figure out what is the most important and most appropriate form of the data for them to release. AUDIENCE: So all the applications of open data that you mentioned are all vertical. They're trying to solve a particular problem. Do you see a need or an opportunity for more horizontal plays of products that could be usable by many applications that use open data? JOEL GURIN: Well I think probably the best examples of those are these data technology companies like Enigma or OpenGov, because what they're essentially trying to do is to make data of all kinds more usable. And I think what they're hoping to do is to make possible the kinds of mash-ups or interoperability of data that can make a lot of those more complex applications possible. Right now, at least if you're working with US federal data, it's very difficult. We did a project at the GovLab simply to try to mash up EPA and OSHA data about factories and facilities that both agencies regulate. You would think this was dead easy. It's not. I mean even on that basic level, it takes work to make this stuff these data sets work and play nicely together. So companies that are making that happen, I think, are definitely taking that kind of broad horizontal view whether they're going to be helpful to a lot of other companies. Yes? AUDIENCE: Yes. Do have any comments about Aaron Swartz who tried to liberate some common government data but got sued by the government. JOEL GURIN: Yeah I write about Aaron and that in my book. I think everybody pretty much recognizes now that MIT was not a good path there, to say the least. And it gets into some complexities, but I think the short answer is what Aaron was really fighting for was open access and access to material that has already been published in a way that the public can use it. That's now just become federal government policy for about half as I said of the research that the federal government funds so I think there's a greater and greater recognition that he was right about that and that we should start getting on board. AUDIENCE: How do you think about accuracy, or even just not necessarily accuracy, but knowing what's in the data set, like keeping track of the metadata, like what's actually being counted. Who was excluded, who wasn't, how data was collected, and that kind of information that can change what the data means? JOEL GURIN: Yeah that's a great question. I would say right now that's very hard part of the Open Data Policy is to actually publicly release information about the quality of data. I think this is going to be one of the parts of the policy that federal agencies absolutely hate the most. But there are some really interesting examples of agencies facing up to this problem and dealing with it so one great example is USAID-- international development-- knew that they had lousy geospatial data on the organizations they were giving grants to. They put on a hack-a-thon, but a very careful one. They found people who are sort of geospatial hackers in the Washington area. They invited about 100 people in. They said, we're going to give you special access to our data. We want you to fix it. We'll give you all the weekend. They were done in about 15 or 16 hours. So this idea of kind of crowd-sourcing quality control is one that a couple of government agencies have become interested in. But simply knowing is very hard. And that's one of the reasons that I think establishing feedback loops-- really good feedback loops between data users and the agencies that hold the data-- is going to be critical next step. So that we can ask those questions government agencies We can see what the response is. And where there is a really serious flaw in a really important data set, they can prioritize that as something that stakeholders need fixed. AUDIENCE: And so you showed a lot of great examples. I was wondering if you think that we can leverage mobile in a specific way as opposed to the desktop sites. JOEL GURIN: Yes. I tend to show desktops because they look better on PowerPoint, but absolutely most of these things that I showed either are mobile apps or could be mobile apps as well. I think the one caveat on mobile apps is that we are a little bit risk of app mania with open data. There have been all these hack-a-thons of apps for this or apps for that, which is great, but I think there are probably some limitations in what is easy to do in that mobile environment. And there are some more sophisticated things that can be done if you, I believe, look more broadly. But definitely pretty much anything that I showed you has a mobile application attached to it FUMI YAMAZAKI: OK. Thank you very much. I think we're running out of time but Joel will be staying here for us. Thank you very much. JOEL GURIN: Thanks so much for coming. And thank you for the work you're all doing.
B1 US data open data government federal lot couple Joel Gurin: "Open Data Now" | Talks at Google 179 14 Tin Wuu posted on 2017/01/02 More Share Save Report Video vocabulary