The Web Extractor has a plethora of features to handle the toughest requirements and the most complex websites. Below are just a sampling of some of the more interesting features which take care of these issues and make the Web Extractor not only powerful but also easy to use.
Logins, Forms & Search Fields
Performing a login, a search or filling in a form action will either involve a GET or POST command. Both of these functions are mimicked through the “FORM / Build URL” box in the web extractor. Once the details required for the GET or POST are entered the extractor can login, traverse the specified website and fill in forms or perform searches.
Lists such as News pages detailing multiple articles on the same page pose no problem for the 30 Digits extractor. The user can program a “List Area” which enables the extractor to treat each of these list areas as if they were individual pages.
Using the power of the “FORM / Build URL” box the extractor is able to simulate java functions. It can submit parameters found on the page or static to java functions to get the desired response from the page. This functionality also allows the users to specify the content-type and the raw data to be submitted. The whole process is made easier with the help of the “Link Tester” which not only shows the user what link is generated but the parameters and the response from the page.
Normalization / Transformation
Manipulating parsed data is made possible with functions like “must/must not contain”, “min/max length”, replace, merge and scripts. With the help of regular expressions and java-script the functions provide the user greater control over their data.
Delivering the same data or feeding duplicates into an index has been overcome with the extractor’s de-duplication process. This process assigns a unique ID to every document extracted. The user has the power to specify what criteria can be used into creating this ID i.e. content, date or document reference number fields extracted from the page.
Information on the internet is always changing and existing documents online are being updated. The Spider, through the use of various unique checks, is able to identify that a document previously extracted has been changed and extracts it again as a new document. Thus, allowing the user to have the latest most up to date documents.
Business Rules – data enhancement
Every business and website is different; some jobs require certain conditions to be met before data should be extracted. These business rules / requirements can be setup in the extractor. If you are only interested in products on a website that fall within a certain price range (i.e. $100 – $200) you can setup a business rule to specify that only products that are between this range are to be extracted.
The “Job Wizard” helps first time users setup a basic job. It assist users to configure a job to suit their system as well as going through the details of crawling the selected website and extracting data from the page. The Wizard is very intuitive; it even suggests regular expressions to extract the correct content.
Enterprise Security – LDAP / AD
When using the web spider in an infrastructure that supports Windows Active Directory, access to the web extractor can be controlled via the use of security groups. Easily add users to the windows user group that has been assigned to the extractor and then the security protocols used in your Windows Active Directory will also be applied to the spider.
Whether as a form of security, running scripts remotely or building web extractor controls into your own system (GUI) there are several various web extractor APIs available to suit your needs i.e. start, stop, status and many more.
Whether you have 10 jobs or 100 jobs you can get a schedule list overview, including start time and frequency, all in one screen. This is helpful when there are multiple schedules that need to be run and managed.
Configuring the schedule itself is done inside each job. Setting the start time and frequency only needs to be done for the first job in the desired schedule. The remaining jobs just need to be told that they should start after another job has finished. This is done through a “starts after” option that lists all the jobs created. For ease of use a “next in list” option was added so that jobs can start one after the other seamlessly.
A useful feature for all users ranging from the beginner to the expert level is the detailed logging provided by the spider. Data from each run is logged in the history and can be viewed through the user interface. The history breaks done the job to show everything from how many links where found, documents extracted and if these documents were processed or failed to meet business rules set. This feature provides the user in depth knowledge of what happened during the run.
Sampling / Testing
Built into the extractor are multiple testing options, providing the user the ability to test their work. From the link rules, FORM / Built URLs, and document templates all of these sections have testing functions to ensure the final job runs smoothly.
Multiplatform (all Java)
The web spider is 100% Java making it a multiplatform piece of software. It can be installed on everything from Lynx to Windows, as long as the machine has a Java Virtual Machine installed.
Through the use of Plugins, users can tag the extracted documents with specific terms or keywords either from a list or database.