Andreas Weigend | Social Data Revolution | Fall 2014
School of Information | University of California at Berkeley | INFO 290A-03




Additional readings:


Today, in a single day, humans today create and record more data than all of mankind managed to produce from its beginnings to the year 2000 The early generation of Internet companies such as Amazon and Google pioneered algorithms including item-based collaborative filtering and PageRank to refine these data to help users make better decisions, changing how a billion people buy items and find information. In this class, we will examine the birth of the Social Data Revolution in the cradle of e-commerce and online advertising, and consider its present and future.

The physical world today is permeated with sensors: mobile phones, wireless routers, payment systems, traffic cameras, electronic door keys. The class focuses on what can be learned about people through the massive amounts of data these connected devices are feeding the cloud with, allowing the same analytical techniques built for a numeric world to be applied to our analog world, turning academic exercises into daily reality. We look into the ways in which data have entered, created by, and changed our lives, and consider a future filled with networked sensors.

Greg Tanaka, CEO of Bay Sensors, will share how physical stores are using cameras, microphones and other sensors for decisions ranging from staffing (taking into account everything including the scheduled national TV advertising campaign, weather forecast, and football schedule, how many people are needed in the store) to planogram (where to put what on the shelves), and pricing. If we had all the data available, how would you use the micro-emotions on the customer's face that the camera is picking up?

This page created by:
Carlos Miguel Lasa -
Nikhil Mane -
Daniel Brenners -
Elle Wang -
Derek Kuo -

1. Social Data

If you had all of the data in the world, readily available at your fingertips, how would you use it?

Imagine a world where data augments your entire day. The alarm clock wakes you up thirty minutes later than normal because you spent too much time last night reading tweets. Plus there isn’t an important meeting on your calendar, so you can be a few minutes late. Your wearable devices recommend a breakfast based on your biological data. A few job offers are sent to your phone based on your social network and the MOOC you just finished, but you like the job you have so you head to work.

You get to work and take a look at the visualizations on the wall of the lobby, indicating work performance of everyone on that floor. At your desk, sensors attached to your shirt buzz once you slouch in your chair or start to doze off. It’s five o’clock, and it’s time to leave work and spend some of that hard earned cash, so you head to the mall.

A store clerk presents a few shirts, recommended specifically to you (you didn’t notice the cameras in the corner that identified you as you walked in). The clerk knows that it will go great with your current wardrobe, specifically that new pair of jeans you bought last month that you still haven’t worn. A special outfit is presented to you. That girl/guy you’ve been constantly looking at on Facebook would really like that shirt they have picked out for you. A store clerk walks up and convinces you that with that outfit, you’ll definitely get some attention. Also, he whispers to you that the data indicates that he/she has been dying to go to that new Thai restaurant downtown.

Afterwards you head to the bar to grab a drink, but your credit card keeps getting denied. Apparently your bank knows you have an important presentation tomorrow morning, so less drink and more sleep. Time to head home and tweet about your day before you go to bed.

Social data is data that you share to the public. This includes tweets and Facebook updates, but also information from wearables, purchases, or other data made to be shared with corporations or friends in your social graph. From this data, you and the people you share your data with can gain valuable information and translate that into actionable wisdom. Recommending clothes, calibrating health choices, or giving visual feedback of performance are all ways that social data can augment our lives and help us make better decisions.

external image d8RBIY68Eq5Fzyx2n_SKEAAZ319nTvHtGyPQlENaGQU6Pg8bO1H5gmuKli63UQws-z81E0QqPZ_g3DgJG4ZjphNjvx3ZC1VQyEXVTsW0QABVkqI1y9pi_vKx9r2wCMDokw

Your Social Graph is the network of connections you have between individuals. Facebook, LinkedIn, and Twitter all represent different ways of looking at your social graph.

The Social Data Revolution is the prevalence of social networks facilitating the sharing of social data. Did the inherent need for social data create these applications? Or did these applications, such as Facebook, create the social data revolution? Either way, social data has become a vital aspect of our lives and the world around us.

external image ttopCIR6SdcZ873Cw95IeqzdOXM82-BJaPcYsUdxvfoXAlThS48PguQ_KFe46guO1GgSjrEQpGFCHOUCh5Z1VfvulI98IdGSg4oL8M9wcNrS-UMHHgH_wVkc0RDZnxYGaQ

2. The Evolution of Data in Influencing Decision Making

As the Information Age has progressed, we have seen the usage of data evolve as it is used by humans in making decisions. When faced with decisions, such as making an online purchase or even accepting a friend request on Facebook, one has to look at the underlying and supporting data surrounding it to ensure that the right choice is made in an informed, well-balanced manner. Here we explore how this data has evolved over the passage of time, as online systems have matured and more data has become available through the smartphones and sensors we encounter in our daily lives.

Manual data

Prior to the age of computers, most data was manually created, collected and processed into information used by humans in our day-to-day lives. This highly manual process involved subject matter experts who had extensive experience in various fields, and this could be seen in the form of expert reviews - for movies, electronics, restaurants, and basically anything that can be consumed in the mass market. These experts would go out into the field, survey what was on the market, compare and contrast items and lend their expertise in helping consumers make the right decisions. While the upside here is that you get to learn from the experts, this model of processing data is highly effort intensive and definitely not scalable across geographies. Magazines and newspapers would have experts scattered across cities and regions, and the overhead of managing that many resources has been significantly reduced by the network effects of the Internet.

Explicit data

As society has shifted to online networked systems, data collection and processing has evolved to distinguish between two key types of data - explicit and implicit data. Explicit data can be characterized as voluntarily contributed by users. This can involve posts on social media, survey answers, online registration forms or even firsthand product reviews on ecommerce websites. Many systems on the early years of the Internet processed explicit data, as these were not yet advanced enough to handle the analysis of implicit data generated by users.

Implicit data

Implicit data refers to data not provided intentionally by users but gathered through available data sources. A good example of implicit data is the analysis of the clicks and activity gathered about users when browsing through a particular website. This can include pages the user visits, certain sections the user lingers on and items the user looks at and purchases. If a person has a history of purchasing technology gadgets on an ecommerce website, it is probably safe to infer that the person is technologically literate. Beyond the digital world, this category of implicit data is also reflected by systems operating in the physical world, such as when you swipe your ID to enter a building, there is a log of the time you entered and through which entrance you came in.

Contextual data

Contextual data refers to the data not so much generated by the user, but rather data collected in reference to the surroundings and environment of a device or where an activity is being done. Examples of contextual data would be those collected by sensors such as GPS coordinates, WiFi and bluetooth connections, and the like.

Social data

The type of data that can be collected from your social graph. Examples of social data include referral programs or reviews created by your friends.

3. Where does Data come from?

Phygital systems

  • Phygital systems situate digital devices in physical environments to aid in the collection and processing of data
  • Guest speaker: Greg Tanaka of Bay Sensors
  • Bay Sensors uses multi-sensor technology to optimize the offline shopping experience
  • How can we bring the same online analytical techniques to the brick and mortar shopping world?
    • Nine out of ten transactions still happen in physical world, so this is still relevant!

Web vs. Real

Let’s bridge the physical and the digital (phygital) by determining digital analogs in the physical world
  • Site visits vs. Foot traffic
  • Conversion rates vs. Purchases

Active vs. Passive Data Sources

Capital Expense

Participant Acquisition Cost



User-centric Data

Local Centric Data

Data Source Types
Data from user participation
  • iBeacon, Facebook, Twitter, POS, CRM
  • Low number of participants
  • People centric
Data from sensors
  • Visual, WLAN/BT, Cellular, Audio
  • Location centric, many participants
  • Anonymous

Mobile Fingerprints

Mobile fingerprints

Will get cheaper to sense
More accurate to locate device in 3-4 years for future Wifi standards (inch level); MAC address scrambling present in iOS8
Getting more popular because of wearable devices


Privacy is one of the most prevalent concerns people have. What are our expectations? How much information are we willing to expose? Do the benefits of sharing social data outweigh our own privacy concerns?

Recently, several famous female celebrities’ private photos were exposed online. Lawsuits were filed against Google for refusing to take down the search link. It sparked another discussion on what the legal restrictions should be for personal data. Should there be laws established or is it just a moral issue? With security cameras almost everywhere, are we still conscious about our privacy being violated anymore; is there a boundary that has already been crossed that we, as consumers, weren’t even aware of?

Maybe the history of privacy never existed, and it is just all illusion. After all, with the government monitoring our phone records, browsing history, shopping history, etc for national security issues, etc, how much privacy we can really have remains a myth.
  • Security cameras already common
  • Best practices: aggregated, anonymous
  • The history of privacy -- "It never existed. Privacy is an illusion."

external image FJ4uevzHgJ-ylOHc5QPI1QFiVxmjw9up00TIG8m5SOjgHISELfxemPpsT3OQLytbOmpjF_ygR58BuJoRxlIksd2cNCK0rc-fjijbe6vmNFs7pxzRk82CKl9tKNWMGbCeVg

Considerations in handling data

  1. Does your customer understand the value they get when they give you data?

  2. Does your product or service get better over time with data?

  3. How much information are you willing to give up?

  4. How do you evaluate the rate of return on your information investment? Is there a standard metrics system?

How could you use all of the data in the world to empower the consumer?

  • Efficiency: If you have identified an item you want to purchase, an app that gathers your location data and leads you to the aisle where the item is located; Data can save consumers’ time and improve efficiency.

  • Staffing: In a retail store, company can gather real time data of how many items someone has purchased, how many people are in a check-out line and adjust staff/check-out counters accordingly to help save consumers’ time.

  • Experience : Consumers can track past pricing and past discount activities for an item. For example, you may find out that a particular shirt you love goes on Sale every Friday at 3 pm by looking at past data, thus data can save you money and provide you convenience.

  • Decision making: You are looking for a TV, but before you step into any store, data can show you which store has the inventory, has discount activities going on, which brand has the best customer service, etc. Thus data help with better decision making.
    In the world of massive data, what does the power dynamic between corporate/government and consumers look like? Are we better off to have information asymmetry or complete transparency? What is the best balance and how can we reach that point? We’ve all encountered this situation at some point of our lives: on the airline ticket website, it says “2 seats left.” You rush and purchase the ticket right away only to find out that there are still “2 tickets left.” Do we still fall for these marketing tricks from corporations? Some people believe that “Transparency is the new privacy.” With information symmetry, corporates can't leverage the information gap anymore to make profit. However, will corporates/governments willing to expose their data? What are the legal liabilities? In addition, from the consumer side, are we willing to give up all of our data and what are the implications of doing that?

In addition let us look at some of the existing tools that use social data to empower consumers.
Retailers use big data to target consumers. Sometimes the data is even generated by the consumers themself. Thus, for consumers, it makes sense to have ready access to the kind of information, which empowers them to make smart purchase decisions. To avoid price discrimination by the seller, the consumer must know the price of consumption for a particular product or service. Apps and websites provide the tools to research products thoroughly, make comparisons and buy based on the best offers. Reviews and recommendations from buyers, peers, experts as well as people themselves add another layer of information before making that buying decision.

Although such tools are quite powerful and useful at putting information in the hands of the consumer, they can’t ensure that the consumer doesn't get fleeced. They must be used by the discerning consumer as a reference point in terms of the cost of consuming an equivalent product or service.
  • Comparison sites - Want to buy that new book? Popular price comparison sites such as Pronto, Nextag, and PriceGrabber are useful for consumers looking to compare prices across multiple merchants. The ability to check out 70,000 million products from around 25,000 merchants is surely useful to find the right price.

  • FlightAware Insight - Want to know what is the least you can spend on a flight? Tool that can help frequent flyers find the best price for their flights; enables consumers to find out the range of prices that others have paid on particular routes.

  • MakeMyTrip’s Fare Calendar - Travelling to India? Indian travel portal MakeMyTrip’s fare calendar; displays the cheapest airfares available across the year for all flights on India’s domestic routes.

  • Cost of Wedding - Planning a wedding? Tool that tells you how much should you be spending on a wedding, even includes wedding themes and specifics such as location, gifts or food.

  • Homewyse - Home improvement on your mind? Acts as an independent reference for home product, installation and service decisions. You can even work out the specifics for home improvement such as framing and insulation based on market prices and your requirements. Also see HomeAdvisor for the average national cost (US) for a particular repair.

  • Babycenter - Planning for your kids' future? Tool which allows parents to calculate the approximate costs based on geography and household incomes to ascertain the amount of money needed for raising a child. Incorporates factors such as housing, food, transportation, school, college, clothing and healthcare in the calculation.

  • College Savings - Thinking of your kids' college education? Tool that allows estimation of the projected cost that will be needed at the time a student becomes eligible for college education; allows budgeting for inflation percentage for getting realistic and more conservative estimates.

  • Repairpal - Car repair time? Tool that allows you to figure out how much you should be paying for repair or maintenance of your vehicle in terms of cost of labor and the cost of spares. Specific to your car’s model and zip codes for precision.

  • TaxiFareFinder - Getting a taxi? Quickly find the optimum fare that you should be paying for a taxi ride in your city. Also takes care of the level of traffic and alternative routes while approximating the costs.

4. What is a Data Scientist?

Data scientists solve complex data problems by employing deep expertise in some scientific discipline. They use the ability to find and interpret rich data sources, manage large amounts of data despite hardware, software and bandwidth constraints, merge data sources together, ensure consistency of data-sets, create visualizations to aid in understanding data, build mathematical models using the data, present and communicate the data insights/findings to specialists and scientists in their team and if required to a non-expert audience. Data scientists have several characteristics:
  1. Data literate: Data scientists know what is and how to analyze data. They incorporates varying elements and builds on techniques and theories from many fields, such as mathematics, probability models, computer programming, data engineering, visualization, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. These data literacy skills are very crucial to data scientist.

  2. Able to handle large data sets: Data scientist can apply many tools and techniques to analyze “Big Data”. In the era of information explosion, data scientist most know how to handle gigabytes or even terabytes of data to aggregate and refine critical insight from them. Some popular tools that are capable for big data are distributed computing such as mapreduce, distributed file system such as Hadoop.

  3. Understands domain and modeling: A domain at center stage of data science is the explosion of new data generated from smart devices, web, mobile and social media. Data science requires a versatile skill-set. Many practicing data scientists commonly specialize in specific domains such as the fields of marketing, medical, security, fraud, finance, and advertisement. Data scientists also rely heavily upon techniques such as using machine learning, optimization, and natural language processing to analyze data and build up models .

  4. Wants to communicate and collaborate: A data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three. Therefore, data science is practiced as a team, where the members of the team have a variety of expertise. Heavily relying on teamwork, data scientists required more on collaborations with stakeholders and teammates than software engineers do. Also presentation skills are crucial to them, ex. information visualization is the skill that focus on how to interpret data and make it understandable to audience using charts and graphs.

  5. Curious with “can-do” attitude: Data scientist is a scientist. They must be very curious about their data and how to design experiments to verify certain assumptions. The experimental attitude of “can-do” are the basis of data driven innovations. Certainly a must-have for data scientists.

Business Intelligence vs. Data Science

Data Scientist
Role In Organization
Externally Driven (Cost Center)
Internally Driven (Profit Center)
Excel, R, SQL
Cloudera, Palantir
Performance Alerts
Task Time Scales
Daily (Iterate Fast)
Data Sources
Social Data
Statistics, Business
Physics, Engineering, CS

external image f6GSiX6QvbWb6DjeSNdO3KrZw2iYf7JYIcqUISU-uOorCbRYR9ZWznbKDOjL_D5ZLxYQ2axeuErBRVkgXAKxnUskLDre4lM7G4Z4WhX_LsypFzOs4Bi6I8dR8zhyZMw3cg

This page created by:
Carlos Miguel Lasa -
Nikhil Mane -
Daniel Brenners -
Elle Wang -
Derek Kuo -