Typical unstructured data sources include web pages, emails, documents, PDFs, social media, scanned text, mainframe reports, spool files, multimedia files, etc. Extracting data from these unstructured sources has grown into a considerable technical challenge, where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as "Web data extraction" or "Web scraping".
The act of adding structure to unstructured data takes a number of forms
Hartley, Miranda. "Using AI to Extract Unstructured Data From PDFs: Benefits & Considerations". Evolution AI. Retrieved 20 November 2024. https://www.evolution.ai/post/ai-extraction-from-pdf ↩