What Is Data Collection?
Data collection or data ingestion, the first step in the data pipeline, is the process of gathering information from a variety of sources. The purpose of data collection is to provide the information necessary for business analytics, research, and decision-making. In many cases, data-informed decisions can take place at the point of data generation, as in the case of smart manufacturing that uses AI vision data to verify output quality on the production line. In other cases, analysis can take much longer and act on petabytes or more of data to support the most challenging computational problems, like genomic sequencing. As IoT, edge, and data center technologies evolve, data collection methods and solutions have become more diverse than ever.
Structured vs. Unstructured Data
The two primary types of data are structured and unstructured, with some experts using semistructured to describe data that has aspects of both.
- Structured data is specific and organized, allowing it to be read and understood easily by relational databases. Typically, this information is hierarchical and can be easily compared. Examples of structured data include financial transaction data, customer relationship management (CRM) information, enterprise resource planning (ERP), or health records.
- Unstructured data is more qualitative in nature, with less inherent organization or structure. Difficult to contain in a hierarchy, collection of unstructured data has long outpaced its analysis, leaving much of this data “dark” or unanalyzed by the organization producing and storing it. Typically, nonrelational databases are used to store and access unstructured data. Examples of unstructured data may include audio files, PDFs, social media posts, customer feedback, or historical paper documents.
Both structured and unstructured data may be collected alongside metadata, or data about the data itself. For example, a digital camera collects metadata on time/date and camera equipment, which is then transmitted as part of the digital photograph file itself.
Data Collection Sources and Methods
Data collection can refer to one of two processes: data scientists may collect and curate information in databases and bring it into the data center or cloud environment for processing; or IoT sensors, cameras, and other devices can collect data at the edge. For many edge IoT use cases, this data is often processed in near-real time in edge servers to enable use cases like automated defect detection in smart factories or intelligent traffic management in smart cities. Data collected at the edge can also travel upstream into the cloud for further processing and analysis.
Data collection sources and methods have diversified to include:
- IoT devices and sensors: With the development of edge technologies, data can now be collected through automated processes from more sources than ever: sensors on industrial machines, sewer pipes, bridges, and patient monitoring devices, just to name a few.
- Audiovisual data collection: As solutions have evolved to analyze unstructured data like audio, image, and video files, collecting this data has become more important than ever. These types of unstructured data often use much larger file types, requiring more processing power and storage space for ingestion.
- Real-time analytics: With real-time analytics, data is now collected and analyzed while the collection stream is ongoing. For example, capacity sensors can help retailers observe public health requirements, with real-time alerts when safe capacity numbers are approached or exceeded.
- Anonymized data collection: Privacy concerns have created a need for some data to be analyzed without a direct connection to the specific individual generating the data point. Data collection and processing may now include demographic groupings without an ability to access specific personal information.
- Data curation: Data scientists specialize in organizing structured data sources to support complex analysis of things like genomic sequencing, climate science, and financial forecasting. These data sets are typically of a magnitude that requires HPC infrastructure to analyze.
A modern data collection strategy is likely to include a diverse range of these techniques and sources.
Data Collection Devices at the Edge
The technology requirements of a data collection strategy depend on where the data is being generated and what a business wants to accomplish with that data. There are two key advantages to processing data at the point where it is collected or generated. The first advantage is that workloads don’t need to travel upstream to the cloud, so businesses save on cost from fewer network infrastructure requirements. The other advantage is that processing data at the point of generation allows for near-real-time analytics.
IoT devices can benefit from Intel Atom® processors or Intel® Movidius™ Myriad™ X vision processing units (VPUs) to deliver the performance needed for audiovisual or sensor streams at the edge. Depending on the use case, these processors are also well suited to the thermal requirements for smaller enclosures or even outdoor environments. For edge workloads that are more data intensive, such as supporting AI inference for multiple video streams, AI appliances and edge servers with 11th Gen Intel® Core™ processors or 3rd Gen Intel® Xeon® Scalable processors provide more data throughput than edge sensors alone. These servers also allow for more connectivity with PCIe expansion slots, so that system integrators can add in accelerators for specific deployments.
Data Collection Technology for the Cloud and Data Center
It doesn’t always make sense to move compute to the edge. If an implementation needs to scale up resources quickly beyond what is available in an edge appliance, then ingesting your data into the cloud is a more-effective option. Also, some workloads are so compute, memory, or storage intensive that they need data center or HPC infrastructure in order to generate results in a timely manner. In these cases, data collection technology will have the most impact in a balanced configuration that combines key upgrades for compute, storage, and networking together to achieve higher levels of platform utilization and data availability.
- Processing: 3rd Generation Intel® Xeon® Scalable processors are the ideal choice for data collection workloads in the cloud or data center. These processors deliver up to 1.92x better analytics performance vs. a five-year-old four-socket platform,1 and when supporting Intel® DL Boost with BF16, up to 1.93x better AI image classification performance vs. the prior generation on ResNet50 throughput for image classification2.
- Network: Intel® Ethernet 800 Series Network Adapters support up to 100GbE speeds with multiple form factors, broad OS support, and flexible port configuration. Embedded technology like Dynamic Device Personalization (DDP) help reduce latency with programmable behaviors for packet processing.
- Storage: Intel® Optane™ Data Center SSDs offer incredibly fast read/write speeds, high volumes for better storage density, and options for PCIe interfaces that position data closer to the processor.
Your End-to-End Data Collection Strategy
From edge to core to cloud, the extensive Intel portfolio delivers the performance, bandwidth, and data availability needed to support fast, consistent, and reliable data collection and ingestion. Intel offers an end-to-end foundation for your data pipeline, enabling intelligent edge devices, high-bandwidth networking solutions, and compute performance in multiple entry points and form factors. Intel® solutions help businesses move their data fast, driving actionable insights and high value.