Tools are needed to manage all available information including the Web, subscription services, and internal data stores. Without an extraction tool (a product specifically designed to find, organize, and output the data you want), you have very poor choices for getting information. Your choices are:
Use search engines Search engines help find some Web information, but they do not pinpoint information, cannot fill out web forms they encounter to get you the information you need, are perpetually behind in indexing content, and at best, can only go two or three levels deep into a Web site. And they cannot search file directories on your network.
Manually surf the Web and file directories Aside from the labor-intensive aspect of this option, the work is tedious, costly, error prone, and very time consuming. Humans have to read the content of each page to see if it matches their criteria, whereas a computer is simply matching patterns, which is so much faster.
Create custom programming Custom programming is costly, can be buggy, requires maintenance, and takes time to develop. Plus the programs must be constantly updated as the location of information frequently changes.
Inefficient methods means the information analyst spends time finding, collecting, and aggregating data instead of analyzing data and gaining the competitive edge. This also affects the application programmer who has to spend time developing extraction tools instead of developing tools for the core business.
New solutions improve productivity
Extraction tools using a concise notation to define precise navigation and extraction rules greatly reduce the time spent on systematic collection efforts. Tools that support a variety of format options provide a single development platform for all collection needs regardless of electronic information source.
Early attempts at software tools for “Web harvesting” and unstructured data mining emerged, and started to get the attention of information professionals. These products did a reasonable job of finding and extracting Web information for intelligence gathering purposes. But this was not enough. Organizations needed to reach the “deep Web” and other electronic information sources, capabilities beyond simplistic Web content clipping.
A new generation of information extraction tools is markedly improving productivity for information analysts and application developers.
Uses for extraction tools
The most popular applications for information extraction tools remain competitive intelligence gathering and market research, but there are some new applications emerging as organizations learn how to better use the functionality in the new generation of tools.
Deep Web price gathering The explosion of e-tailing, e-business, and e-government makes a plethora of competitive pricing information available on Web sites and government information portals. Unfortunately, price lists are difficult to extract without selecting product categories or filling out Web forms. Also, some prices are buried deep in .pdf documents. Automated forms completion and automated downloading are necessary features to retrieve prices from the deep Web.
Primary research Message boards, e-pinion sites, and other Web forums provide a wealth of public opinion and user experience information on consumer products, air travel, test drives, experimental drugs, etc. While much of this information can be found with a search engine, features like simultaneous board crawling, selective content extraction, task scheduling, and custom output reformatting are only available with extraction tools.
Content aggregation for information portals Content is exploding and available from Web and non-Web sources. Extraction tools can crawl the Web, internal information sources, and subscription services to automatically populate portals with pertinent content such as competitive information, news, and financial data.
Supporting CRM systems The Web is a valuable source of external data to selectively populate a data warehouse or a CRM database. To date most organizations focus on aggregating internal data for their data warehouses and CRM systems. Now, however, some organizations are realizing the value of adding external data as well. In the book Web Farming for the Data Warehouse from Morgan Kaufman Publishers, Dr. Richard Hackathorn writes, “It is the synergism of external market information with internal customer data that creates the greatest business benefit."
Scientific research Scientific information on a given topic (such as a gene sequence) is available on multiple Web sites and subscription services. An effective extraction tool can automate the location and extraction of this information and aggregate it into a single presentation format or portal. This saves scientific researchers countless hours of searching, reading, copying, and pasting.
Business activity monitoring Extraction tools can continuously monitor dynamically changing information sources to provide real time alerts and to populate information portals and dashboards.