The biggest roadblock to implementing a proof of concept for machine learning or deep learning is sourcing, organizing, and feeding the right kind of data into your model.
That’s why organizations leading the AI charge today are those who have built their strategies around data.
Build Lakes, Not Swamps
Many organizations have ‘dirty’ or bad data. It’s incomplete, siloed in different areas of the business, fraught with privacy issues, mislabeled.
Training an AI model on these so-called ‘data swamps’ is difficult and produces poor results. Well sourced, consistent, and labelled data sets – known as ‘data lakes’ - are the key to a successful AI implementation.
So, what are the top questions enterprises should be asking themselves while formulating their AI data strategy?
Understand whether you need structured or unstructured data, or data that’s in voice, text or photographic formats. Once the size and extent of the data required is clear, start investigating where to find it.
Collecting, annotating, and curating relevant and compliant data can be difficult and costly. Many are investing in tools and processes, while others are purchasing data sets or pre-trained models from third parties.
Training data often comes from archives of data reserves and is run on cloud infrastructure, but inference may run on live data from sensors, at the network edge. This will have implications for the size of data set to be used, and the infrastructure required.
The Internet of Things (IoT) will include a projected 200bn devices by 2020, and the data produced is expected to total 40 zettabytes by that time. Developing AI applications that can mine this massive amount of data will require advanced infrastructure, effective job scheduling and storage management technology.
Building a New Enterprise Data Strategy to Help Find Missing Children
The nonprofit organization serving as a clearinghouse for information that can help find these children received millions of tips, which previously had to be reviewed and prioritized by a team of just 25 analysts.
To help solve this problem, Intel is furnishing high-performance computing to help make analyzing the tips more efficient and more manageable. A new enterprise data strategy was implemented to allow the integration of the hundreds of databases and sources the organization deals with, which also required the implementation of modernized IT infrastructure.
Bob Rogers, Intel’s chief data scientist, notes the importance of data – the incredible number of tips received per day – to the project:
"The numbers are growing dramatically, and it’s driven by these advanced technologies… A tech coalition of online service providers has been getting better and better at pooling their resources and getting tips to the organization. They are casting a wider net, with finer mesh, more frequently."
Standardizing Data Sources to Improve Predictive Accuracy
Different systems often have vastly different naming conventions or quality of metadata available. This makes stitching disparate data sources together to create a set large enough to train an AI model costly and time-consuming.
To tackle this problem, Montefiore Health System in New York have developed the Frame Data Language, powered by Intel® Xeon® Scalable processors. The technology is used in PALM (Patient-centered Analytical Learning Machine), a model that predicts respiratory failure in patients, to translate healthcare databases into one language.
Intel® Xeon® Scalable processors:
your data foundation
From data ingestion and preparation to model tuning, Intel® Xeon® Scalable processors act as a flexible platform for all the analytics and AI requirements in the enterprise data center.
Able to handle scale-up applications with the largest in-memory requirements to the most massive data sets distributed across a myriad of clustered systems, they serve as an agile foundation for organizations ready to begin their AI journeys.