Data is knowledge; its power, its insight, but at its core, data is information. Specifically, data is a unit of information, just like a mile is a unit of distance. Collected through observation, data is stored in notebooks, on hard drives, and even in your own mind—that’s right, every memory, idea, recollection, and association you can think up is a result of data you’ve collected. Data comes in two types—qualitative or quantitative. Qualitative data measures the quality of something—what it is? How it is? etc. Quantitative data measures the quantity of something—how many are there? Together, these collections of numbers, texts, and facts are used to calculate, analyze, predict, or solve some sort of unknown variable.
Unprocessed data—loose collections of numbers and facts without any clear structure or intention—is known as raw data. Raw data by itself offers little value, but when that data is refined, analyzed, and measured, analysts and data scientists can use that information to make informed decisions.
Companies can access your data in a variety of ways. Sometimes it’s first-party data that belongs to and is used by a company you already frequent, and other times advertisers purchase this data from a brokerage to market their products to you. Specifically, data can include anything from personal information about your past purchases and online browsing history to your gender and even your physical location. One thing’s for sure: as long as you have a smartphone or online presence, there will be companies collecting your data and using it to advertise their products.
Personal data is information about a person that can identify them. The Internet has made it easy to collect large amounts of personally identifiable information (PII) on people. Names, social security numbers, dates or places of birth, phone numbers, IP addresses, and internet cookies are all considered types of PII. The collection and analysis of personal data created the multi-billion-dollar data brokerage industry where companies collect and resell personal data, primarily for advertising purposes. The personal data market is so profitable that many social media platforms and apps maintain their “free” status through the massive amounts of user data they collect and sell.
While there is no set global standard or definition of PII, many regulatory agencies, including the National Institutes of Standards and Technology, have come up with their own definitions of what constitutes personally identifiable information.
Your personal data dividend is your cut of the profits made from the data that you have created. It’s exactly what Invisibly advocates for, and helps you to claim.
Data brokerages are businesses that compile raw data from numerous sources and then sort and analyze it for meaning. These brokerages then license the analyzed data to other organizations. Data brokers can also directly license another company’s data or help companies process their data to uncover more valuable insights.
Data brokers source their data based on the products their customers sell. Generally, that information is gathered through website cookies and free apps, which collect mountains of information just by people using their connected devices.
More limits and regulations mean fewer resources for marketers, developers, sales teams, and any other division reliant on data-driven customer insights. Simply put, many of the tools companies have relied on in recent years may be forced out of existence either through legislation or the court of public opinion. This shift will force data workers to revisit more traditional engagement and development processes to create a new era of audience identification strategies.
Despite GDPR policies, many companies outside of the EU have huge swaths of customer data just stored away collecting digital dust. Companies didn’t have a plan for their customer data; they just knew that they wanted as much of it as they could scavenge. As regulations continue to redefine what is and isn’t acceptable data practice, what brands do with these data graveyards is of great value for their customers. Not only are these wastelands significant financial burdens for companies to structure and maintain, but they’re a blatant target for bad actors looking to steal customer data.
Data misuse is the inappropriate use of data, where data is used in ways or by people beyond its stated intention. Every region has its own laws and policies that shape data use protocols. Generally, when data is collected, the collector is expected to outline that data’s specific intended and acceptable use.
Today, data misuse is more common thanks to employees and third-party partners that may have access to sensitive company information. Not to be confused with data theft, data misuse is rarely due to malicious intent or collection without consent; it’s more often a result of ignoring specific permissions and allowable use cases for personal data.
For example, if a credit card company employee were to peek at a friend’s balance or if a person working for a ride-sharing app were to track a specific customer’s location: both no-nos by all standards. Even something as simple as using company software for personal use can be considered data misuse.
Data misuse is a huge threat to privacy and security and often comes with specific penalties outlined in company policies. Unfortunately, many companies don’t have clearly defined cyber policies to prevent data misuse other than terminating an employee.
Data poisoning occurs when the training data of an AI or machine learning algorithm is corrupted, creating an inaccurate final output. Data poisoning is often a direct criminal attack on the integrity of the device. The difference between data poisoning and other cyber-attacks is that eventually, the poisoning becomes an accepted part of the AI. Attackers learn how the system learns and feed it the wrong information in order to exploit the model.
Another way to corrupt data is to introduce the attack before the machine learning can begin. Compared to corrupting an existing system, this model gives bad actors a more accessible breaching opportunity because it reduces the security protocols they have to bypass. Instead, criminals can poison the learning process before it starts. By the time a developer or engineer realizes something’s wrong, it’s already too late.
Data poisoning takes much longer than most other cyber-attacks, so it’s difficult to pinpoint exactly when the data was corrupted. AIs are constantly learning and updating to make the most correct predictions based on their inputs. The only way to fix the corruption is by retraining the system from the ground up. Realistically, the only way to avoid data poisoning is by preventing it in the first place through validity checks, regression training, rate limits, and various other measures. General digital hygiene can help as well—limiting who has access to the machine learning system, not sharing passwords, etc.
Data pooling is the process of consolidating data from a large number of sources in a single, centralized database where it can be analyzed and compiled into a standardized format. Database software then verifies and synchronizes that information.
Likewise, a data pool is a related set of values obtained from a single source. A data pool can be any data set meant for analysis—employee records, patient medical information, Global Trade Item Numbers (GTIN), etc. Data pools can be private or shared, but a private pool cannot be seen or shared with anyone except the administrator. Most web-based data pools are shared between different sources and can be accessed by anyone with permission.
The key attributes within each data pool allow trading partners to synchronize information easily. While most data is collected through automation, how that information is collected can affect its accuracy and, in many cases, its usability.
A data subject is any person whose personal data has been collected. Data subjects may potentially be identified through the data that has been collected on them – either directly or indirectly. Companies use this data for various reasons which should be clearly communicated to the subject, specifically what, why, and how the data is being used and collected.