WebExtractor360

WebExtractor360 is a free and open source web data extractor. It allows you to extract Images, Phrases, HTML Headers, HTML Tables, URLs (Links), URLs (Keywords), Emails, Phone, Fax and ANY other information on the web by specifying a Regular Expression.

Web Crawler

The web extractor software starts by crawling the specified web URL or any local file resource. All data that maps to the Match (Regular Expression) field will be returned as a result. Upon completion of the matching process for the specified URL, the crawler will continue to process other URLs that the specified URL links to. This is as shown in the diagram below. The entire process is repeated until the Maximun URL has been reached or there are no more URLs to process.

Regular Expressions

WebExtractor360 extracts information from the web using Regular Expressions. A regular expression is a text string used for describing a search pattern. They can be thought of as special kinds of wildcards. WebExtractor360 provides many commonly used Regular Expressions for extracting data on the web.

    URLs - "(?:href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript)(?<PARAM1>.*?)(?:[\s>""'])"
    Images - "(?:src\s*=)(?:[\s""']*)(?<PARAM1>.*?\.(jpg|png|gif|emf|bmp|wmf))(?:[\s>""'])"
    Phrases - "(?<PARAM1>(.*keyword.*)?)"
    URL Title - "(?:href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript)(?:.*)(?:[>])(?<PARAM1>.*?)(?:""'])"
    HTML Tables - "(?:<table>)(?<PARAM1>(.*\r\n)*?)(?:</table>)"
    HTML Headers - "(?:<head>)(?<PARAM1>(.*\r\n)*?)(?:</head>)"
    Phone - "(?:(phone|tel|telephone)[\S\s]*?)(?[\+\(\{\[\'\d][\'\(\)\{\}\[\]\.\+\*\-\'\d ]{4,22})"
    Fax - "(?:fax[\S\s]*?)(?<PARAM1>[\+\(\{\[\'\d][\'\(\)\{\}\[\]\.\+\*\-\'\d ]{4,22})"
    Emails - "(?:mailto\s*:)(?<PARAM1>.*?)(?:[\s>""'])");

The regular expressions are provided for convenience purposes. You will also be able to overwrite the regular expressions and match it to any data that you find on the web. PARAM1 is a variable/placeholder used for storing the part of the matching data that you will like to be returned as results. In other words, if we have a phrase in HTML Bold as follow:

    Test Phrase

We will be able to use PARAM1 to store "Test Phrase" instead of the entire "<b>Test Phrase</b>". This allows us to extract more meaningful data without any further processing. WebExtractor360 supports up to 10 variables/placeholders, that is

PARAM1
PARAM2
PARAM3
PARAM4
PARAM5
PARAM6
PARAM7
PARAM8
PARAM9
PARAM10

This allows you to put in multiple variables in your regular expressions and get multiple matching data as results.

Search Options

Report "Links Found" during processing
Report all the interim hyperlinks found during the crawl.

Ignore Valid Hyperlinks Criterion
WebExtractor360 has a set of definitions of what a valid hyperlink is during a crawl. If this option is checked, the definitions will be ignored and all hyperlinks will be treated as valid.

Allow MAD CRAWL
By default, WebExtractor360 will stay within the website specified in the URL during a crawl. This option allows the extractor to 'wander off' to external sites.

Include Page URL in Results
Include Page URL as part of the results.

Download WebExtractor360

Back to Web Extractor main page.



Copyright(c) 2009-2021 ConnectCode Pte Ltd. All Rights Reserved.