Andreas Weigend | Social Data Revolution | Fall 2014
School of Information | University of California at Berkeley | INFO 290A-03
Audio:
https://www.dropbox.com/s/9bbgi2pmp6gnlel/weigend_ischool2014_8.mp3?dl=0
Video:
http://youtu.be/N_C00zQpcqw
Transcript:
https://www.dropbox.com/s/sgvikvugfxecp6y/weigend_ischool2014_8.docx?dl=0

8_Data Ownership and the Future of Data


Timeline Nov 18, 2014

3:30 Setting up
3:40 CLASS BEGINS
Summary of previous class, relationship to today’s class.

6:00 SUMMARY OF THE COURSE
6:30 END

7:00 DINNER with the students responsible for this wiki page.
  1. Introduction (Linda)
    1. Data ownership (Linda)
    2. Data stewardship (William)
    3. Data governance (William)
  2. Data Exhaust (Michael)
    1. Definition
    2. Significance
    3. Archive.org
  3. How Companies Use Personal Data (Sophie)
    1. Insurance companies
    2. Online tracking (Amazon/Facebook/etc.) à Potential tracking in real life
    3. “Leaked” data not in consumer’s best interest (Uber)
  4. Mitigating Harm From Data (Holly)
    1. Rubenstein’s two types of data ownership
    2. Dangers for consumer
    3. Problems and questions to consider
  5. Customization & Privacy Settings According to Preference (Noelle)
    1. Potential for being more rewarding through personalization
    2. Methods
    3. Transparency in what consequences certain settings have
    4. Discussion Questions





Data Ownership and the Future of Data


Introduction

Humans have used devices to aid computation and storage of data for thousands of years (Wikipedia, Information Technology). The Sumer, one of the ancient civilizations and historical regions in southern Mesopotamia (Wikipedia, Sumer), for example used clay tablets to document land ownership or access to resources such as water. Such data was extremely valuable, as the owners of such data records had the control over the distribution of resources. Thus, already 3000 years Before Christ, data ownership was directly associated with power. Today, it is far easier to collect, store and process data than it was with the ancient Sumer. However, the topic of data ownership remains more important than ever. The aim of this document is to provide an overview of the concept of data ownership and to discuss the importance of data and privacy in today’s society.

external image data-ownership.jpg

What is data ownership?

According to techopedia.com, data ownership is defined as “the act of having legal rights and complete control over a single piece or set of data elements” (Technopedia, Data ownership). Being in control over data means having the ability to access, create, modify, share, sell, or remove data as well as having the right to give these privileges to trusted others (Lohsin, 2001). Assigning data ownership is not always easy and as described in Lohsin (2001), there are multiple paradigms of how to define the owner of data.

Ownership paradigms

From a business perspective, the owner can be a legal person such as an individual (e.g. a general practitioner) or an organization (e.g. an enterprise) that collects personal data from people. In such a scenario, one often speaks of data controllers. For example, a general practitioner is the controller of his patients’ data; an enterprise is the controller of client and employee data. Such data ownership comes with responsibilities. The controller, for example, is responsible for maintaining and delivering the data to its users as well as for ensuring that only authorized persons have access to the data and that necessary steps have been taken to manage data risks. Alternatively, it is also possible to see the creator of the data as the owner. For example, a geographic data consortium might collect geographical data and store it in a database. Similarly, one could argue that users of social networks are creators of data. Since the data shared on social networks is very personal, it is likely that users will claim ownership of this data. Another interesting paradigm is the model of global data ownership according to which data should be available to all without restrictions. This model is often used in scientific communities where the main goal is to share and increase common knowledge.



What is data stewardship?
A data steward is defined as “a person responsible for the management of data elements (also known as critical data elements) - both the content and metadata. Data stewards have a specialist role that incorporates processes, policies, guidelines and responsibilities for administering organizations' entire data in compliance with policy and/or regulatory obligations.” A systematic data stewardship can foster 1) consistent use of data management resources 2) easy mapping of data between computer systems and exchange documents, and 3) lower costs associated with migration to other information architectures. Studies have shown a data steward really encourages users to exploit data since they can inquire the data steward for specific data element.



What is data governance?

According to Wikipedia, Data Governance is a control that ensures that the data entry by an operations team member or by an automated process meets precise standards, such as a business rule, a data definition and data integrity constraints in the data model. The data governor uses data quality monitoring against production data in the goldensource to communicate errors in data entry back to the operations team members or to technology for corrective action. Through data governance, organizations are looking to exercise positive control over the processes and methods used by their data stewards and data custodians to handle data.


As we discussed in class, data governance has become one of the most controversial topics with the rise of social media and other data platforms, such as electronic health records. The traditional governor of the data, in this case, the Facebook users and patients, do not seem to possess the strict and complete governance of their data anymore. However, as the topic being brought up in legislation more and more often, for example, in the HITECH act, a more well-defined data governance is under way.




Data Exhaust


In the digital age, humans create, share, and record more data in a single day than we did from the beginning of history to the year 2000. Therefore, the question that becomes important is how do we manage all this data, and who owns this data? With a growing fear of a surveillance state, issues that are centered around data ownership call into account the individual’s data exhaust.


external image customer%20experience%20ecosystem.png





According to Techopedia, data exhaust is the “data generated as trails or information byproducts resulting from all digital or online activities.” As we make choices online, it contributes to storable data such as log files, cookies, temporary files, and all other digital processes or transactions. These pieces of data are collected and can be used to personalize user experience through targeted advertisements and unique recommendations. Data exhaust gives a more accurate picture into individual’s preferences, likes, habits, etc. that companies can use to provide better services and products that people are more likely to consume.


Consequently, the question arises on how individuals and companies will manage the plethora of data. With the rising amount of information, even what some people would consider irrelevant information can prove to be actually important, and even in some cases, dangerous. In class, Pete Warden who is the co-founder of Jetpac, explained a scenario where his company was collecting massive amounts of public pictures and using image processing techniques to find out what was happening in the photos. For example, through people’s Instagram pictures, they could find out where the gay bars in San Francisco are. However, in another case, Pete ran across pictures of homosexuals in Tehran, Iran, enjoying themselves at bars. In Tehran, homosexuality is a crime that is punishable by their law. This shows an example how some individuals may not be completely aware of their own digital exhaust and the possible repercussions that can occur from it, such as posting a picture.







Archive.org


A great example of the massive amounts of stored information on the web is realized through the Internet Archive, or archive.org. This is a non-profit digital library with the goal of providing a universally-free internet and access to knowledge. Brad Rubenstein joked during class that we should look into Professor Weigend’s website (weigend.com) through the archive. So that’s just what we did.






archiveorg.png




Archive.org can give us insight into a few personal bits of how often the Professor edits the website, when it started, and the former information that was posted on the website. This is a great tool for Andreas because if he loses data or a page that he wrote on his website, he can go back through the archive and find it by looking through his own edits, shown through snapshots. We can also see how his activity has changed over the years, showing a peak in the years 2005 and 2008 where the most saves were made.


So, what does this tell us about the data? It shows the incredible amount of data that is available, if one actively searches for it. However, Andreas’ data may not be important to you because he is only a single individual. But as Pete Warden pointed out, a single person’s data might not be super significant, but when individual’s data is tied with others in the aggregate, it can be extremely insightful.

Seeing all the things that we can learn about ourselves, and probably of more concern, the things that others can learn about you through the data exhaust, the question always comes back to: who owns this data? As has been underlined through the class, the data exhaust is important for products and services to best cater to you, but this data must be transparent, to create a more balanced, symmetric relationship so that some of the power can be moved towards the consumer, which is very important in the digital economy.


How Companies Use Personal Data


Insurance companies are no longer allowed to use data about pre-existing conditions for their health insurance policies, but they are looking to use new data sources. According to Robert Hunter, the Consumer Federation of America's director of insurance, insurance companies use a "data mining tool that lets insurance companies figure out which groups of customers are more likely to accept a price increase and which are more likely to shop around for a new policy."

Insurance companies have traditionally used complicated equations to calculate risk, using the information that their clients willingly offer. Now, they may have alternate methods of getting more data to help them calculate risk. Consumers have more data made available to them, but they are also generating more data that can be mined by companies, which may not always be in their favor.
Other companies, and especially web retailers, track online clicks to learn more about their consumers. Amazon is known for their personalized recommendations, which they generate by analyzing which products their consumers look at. Other companies have since adopted similar methods in order to provide recommendations for their consumers.

Facebook also tracks clicks very carefully, on their own website, and on thousands of other websites as well. Every time you are logged into Facebook, and visiting an external website that has the Facebook "Like" button embedded, Facebook knows you have visited that website. It is common that they will then serve you an ad from that website, or even the product you were looking at, shortly after.

Recently, Uber has been in the media for a variety of ethical violations. One of them includes using a "god view" of the app, which shows passenger's personal identifying information, as well as live tracking the passenger's Uber ride. There is a reasonable degree of expectation that passengers have when agreeing to share their personal information and use a service like Uber, and Uber failed to uphold this standard.


Mitigating harm from data

According to Brad Rubenstein, your presence in a data set is a danger to you whether you know it or not. Your information can always be used in ways that you might not agree with in order to exploit you. Once the information is out there, it’s impossible to get it back. Because of this, companies and induviduals collecting data have a responsibility to think of the ways in which the data they are collecting could be use to harm individuals. They also have a
  1. Rubenstein’s two types of data ownership
external image YHsTuYlPMTDrYnx3U_YeAADj7uRxueDObDcQ3gTL_uj_mFu4HIMeT8n2Y-bAnBvrrq2enbxHVx5FVUWtuttu0IiVsCeOvOPW8R27kouwXy64Vj_c31ECkDruX4nVbnjO3A


  1. Dangers for consumer


  1. Problems and questions to consider
It is difficult to approach the problem of data collection because data collection is hard to stop at source. In our current society, we cannot stop collecting data- it has become an automatic process and is an essential part of the way some companies operate. We can, however, educate people collecting data on how to mitigate harm. Journalists are starting to monitor and keeping track of the outputs/results, which increases . Shift from looking at the outcomes to looking at the use and how people are manipulating it. Like I want my medical information immediately available to the hospital and doctors treating me, but I don’t want insurance companies to get this information on me. There need to be limitations on what actions companies can take once they have that information.




Personalization of Privacy

An individual’s notion of privacy and their desire to protect or disperse it ultimately depends on their preference. Factors that affect this include knowledge of the technology and familiarity with the means of controlling how their data is being used. Brad Rubenstein describes how consumers must be responsible for actively monitoring how companies use our data, rather than just telling them that they cannot have it.

Current methods are often limited to one’s initiative to change default security settings on forms of social media or search engine use. For example, Google’s Incognito feature allows for the user to not have personal browsing information from their session be stored locally; yet, the servers of the websites you do visit still have access to your location and usage. How can this be modified to protect the identity of a user even a step further? Or on the opposing side, why do users feel the need that features such as Incognito are even necessary?

In markets where vast amounts of data would be collected for each individual, how can one accurately filter and monitor the way that their data is being received and used? The ideal solution would consider consumers becoming educated and knowledgeable of the usefulness their data can be to both parties, and the complete and universal commitment of companies who use our data to remain transparent – even making the extra effort to ensure that their consumers are aware.

If means of customizing how and what kind of data is collected from users by companies becomes easier and more openly discussed, it might be possible to convince users to reverse their aversion to freely distribute their data. However, there are currently whole markets in search engine optimization (SEO) and reputation management that aims to control user access to information and present it in a skewed manner. If users can opt-in and opt-out at their will, have reasonable incentives, and partake in an open discussion about data use, perhaps trust and rapport can be built to finally develop a mutually rewarding co-ownership of data and its experiences.



Discussion questions:


What are our options for personalization as of now?


Do people know of these options?


How clear are these options, and do we really understand how it will affect us in both long and short term effects?


Why aren’t default features for browsers like that of Google Incognito? Should it be the other way around?





References



Loshin, D. (2001). Enterprise knowledge management: The data quality approach. Morgan Kaufmann.


Technopedia: Data ownership
__http://www.techopedia.com/definition/29059/data-ownership__


Wikipedia: History of writing
__http://en.wikipedia.org/wiki/History_of_writing__


Wikipedia: Information technology
__http://en.wikipedia.org/wiki/Information_technology__


Wikipedia: Sumer
__http://en.wikipedia.org/wiki/Sumer__


EU Justice: Who can collect and process personal data?
__http://ec.europa.eu/justice/data-protection/data-collection/index_en.htm__

Techopedia: Data Exhaust

http://www.techopedia.com/definition/30319/data-exhaust

CNBC: Data mining is now used to set insurance rates
http://www.cnbc.com/id/101586404

Forbes: 'God View': Uber Allegedly Stalked Users
http://www.forbes.com/sites/kashmirhill/2014/10/03/god-view-uber-allegedly-stalked-users-for-party-goers-viewing-pleasure/