The use of open data is becoming more popular. Data management professionals should develop familiarity with the concept, use, and challenges of open data.
The concept of open data is not new, but its use is becoming more prevalent as the availability of sources, and the validity of those sources, grows. Problems often arise because these sources are commercially valuable or can be aggregated into works of value. It is important to explore the essential concepts of open data and examine its benefits and limitations of use, especially in the context of data management.
What is Open Data?
According to the Open Data Handbook website, open data is data that can be freely used, re-used, and redistributed by anyone.
In today’s data-driven environment, it is no surprise that more open data sets exist. This increase is especially true for government sectors that collect all kinds of data that have been paid for with taxpayer dollars, and should be accessible by all citizens as a result to ensure transparency.
Recent generations have seen a socially-instigated push towards increasing government transparency at federal, state, county, and municipal areas. Because of this push, more government entities are opening previously inaccessible data sets to the public. As a result, anyone and everyone can access data, analyze it, and use it to make better-informed decisions.
But first, as was previously mentioned, data must be accessible — and available — to be considered “open.” All data management professionals should understand the difference between accessible data and available data – and how to use open data effectively.
Data Availability and Accessibility
For a data set to truly be considered “open,” the data must be both available and accessible. While the terms data availability and data accessibility might be considered to be synonyms in general parlance, each term has its own separate and distinct definition when applied to the concept of open data.
While there is a more technical definition of data availability, a plain-language definition of the term will suffice and is much easier to understand. Data availability refers to whether a data set is located on a website or server on which it can be used or obtained. In short, the data sets are easy to find or locate. For example, can any person obtain the data to be used, regardless of who they are? Can a soccer mom obtain the data as easily as a data scientist? These are some simple questions to ask to determine if a data set is truly available to as many people as possible.
Data accessibility refers to the ability to retrieve or download data sets once they are found or located. If, for example, a data set is located on website that is easy to find, it would be available. But if that same data set could only be viewed or downloaded with a password that only a few people know, then the data set would not be accessible to the general public.
Some open data sites might ask for a potential end-user’s name, email address, and other basic contact information before he or she can access data sets. In addition, some open data providers, especially those from academia, might request that end‑users agree to a set of terms and conditions before downloading their data. While these end‑user information requests might be viewed as a data accessibility issue, they are not. These requirements help those who are providing open data to end-users know who is using their data, and provides end-users with an incentive to thank the entity that provides the data if the data is used in a publication, at a conference, etc.
Another component of data accessibility is whether or not data retrieval methods are compliant with the Americans with Disabilities Act (ADA), or ADA-compliant. ADA‑compliance refers more to the medium used to store the data, which is usually a website, than the data itself. How, for example, would someone who is blind be able to access a website on which a data set is stored? ADA compliance is especially relevant for government-sponsored websites that must be follow ADA legislation under penalty of law. For more information on government-website ADA compliance, review Section 508 of the Rehabilitation Act of 1973 .
The Benefits and Limitations of Open Data
One of the major benefits of using open data is that it is free. In previous decades, obtaining access to data would be extremely time-consuming, cumbersome, and often expensive. Unfortunately, most researchers and statisticians do not want to share their data with others. That is not the case as often with government data, since most of it is considered to be for the public’s benefit. And, since government data is a public good, it should be available and accessible to its citizens.
Ideally, open data generated by public entities also leads to more accountability from public officials. If, for example, the public knows exactly what the governor of a state’s salary is, a reporter or citizen might wonder how that same governor could afford an expensive car or watch. Open data also empowers each person to become more involved in government legislation and policies, giving people the ability to become more engaged in their communities.
However, end users should also be aware of some of the drawbacks of using open data. First, open data sets might not have the content that someone is specifically looking for; what you see is what you get, so to speak. Furthermore, the content might be “there,” but not in a consistent format. For example, a date formatted as “10/11/2018” could be interpreted as October 11, 2018 or the 10th of November 2018. Without knowing how the date field is formatted, a user might not understand what is included in the data set. Thus, it is important for organizations that publish open data to provide information on how the data is formatted. Read Data Management University’s article on practical data management tips for more information on the importance of data formatting. In addition, learn more about the importance of data file code books that should accompany raw data sets, and how they provide information about data and how it is formatted.
Another major limitation of using open data is whether or not the data includes personally identifiable information (PII) or protected health information (PHI). Any organization that publishes open data should scrub the data sets thoroughly so that they do not include these information types before they are published for public use.
This is especially important for government entities and healthcare organizations, since these organizations would violate the Health Insurance Portability and Accountability Act (HIPAA), for example, if their open data sets contained PHI. An organization that provides open data should have data governance policies and guiding principles that determine the data types that are either for only private use by the organization or are also for public use by both the organization and the public. To learn more about data governance policies, read this Data Management University article on the four foundational data governance policies.
The previous point leads to the final, and most important, drawback of using open data: make sure to know who is providing the open data. Only use open data from trusted and reputable sources. The following is a short list of open data sites that are trusted and reputable:
- Gapminder, http://www.gapminder.org/data a Swedish foundation that provides data on social, economic, and environmental development at local, national, and global levels
- Data.gov USA, http://www.data.gov the home of the United States Government’s open data
- Data.gov UK, https://data.gov.uk, the home of the United Kingdom’s open data
- Github public data sets, https://github.com/caesar0301/awesome-public-datasets
- Open data network, https://www.opendatanetwork.com
- Data.world, https://data.world, a social networking site for data people
- US Census Bureau, https://www.census.gov/data.html
- Pew Research Center, http://www.pewinternet.org/datasets, a nonpartisan fact tank that conducts public opinion polling
Using open data can enlighten, empower, and inform people in ways that were not possible in previous decades and generations. Even though the data is open and “free,” let the “buyer,” or end user, beware of some of the limitations and challenges of using open data. To learn more about open data, visit the Open Knowledge International’s website