Datacol – scraper for collecting information from sites

We will send the material to you by email:


    Время чтения: 9 мин.

    Do you often encounter a problem when you need to quickly download a list of products from an online store or collect information from a website? However, certain data must be saved. For example, such as the name, price, picture, product link, etc. Imagine how much time it takes to do all this manually! If this has happened, then this review is for you. Since, it will focus on the Datacol program, which is a parser for collecting information from sites. With its help, you can learn how to properly parse a site and automate the tasks of obtaining data from sites without resorting to the help of specialists.

    The program is very useful and has great features. But it is quite difficult to understand it without considering the work of the parser on a real example. I hope the review solves this problem. A support forum, video tutorials, and online help on working with the parser will also help you get comfortable with the program for parsing sites.

    What is web scraping and web scraping?

    If you ever wondered what a site parser is, then this is a program with which you can get certain data from sites. And parsing is actually the process of obtaining information from any open web resource.

    Now let’s start reviewing the Datacol parser. You can download the demo version from the official site. Its main difference from the full-featured version is the amount of data collected. In the demo version, you can get up to 25 results, but in the full version there are no restrictions. Also not available in the demo version:

    • access to the closed section of the forum;
    • paid consultations on the use;
    • ordering paid settings;
    • ordering paid plugins.

    So, we launch the program for parsing sites and let’s go!

    Before moving on to the process of setting up the parser, I’ll say that the developers have greatly facilitated the setting by creating the “Auto-Tuning” tool.

    Запуск программы

    After launch, a window appears where you will see a menu bar at the top, and below three blocks:

    • a bonus in the form of a list of campaigns already configured to perform certain tasks;
    • FAQ;
    • work statistics and project news.

    You can view the settings of existing campaigns yourself. And now, let’s see how the site parser works on a specific example. In this case, I worked with the Chance online furniture store.

    We create a new campaign. Click the Add Campaign button and the Add Campaign Wizard appears. Enter the name (call it as you like, I have it – marketshans) and click “Next“.

    Добавить кампанию

    Next, you need to enter the input data – the URL by which the Datacol parser will start its work. In the “Input data” field, we indicate the pages from which we want to receive information. For my example, I took one of the categories of the online store and indicated a link to it in this field.

    Входные данные

    If you forget to specify all the pages from which you want to parse information, then these settings can be edited later.

    Next, you need to configure the “Collecting Links”. To do this, we will use the “Picker” tool, which is located to the right of the Xpath input field.

    Сбор ссылок

    It works very simply. Open the Picker and the page should automatically load (as in a browser), which we specified in the input data.

    Now we need to get the Xpath expression with which the program will collect product links. To do this, left-click on one of the products and the Xpath is created automatically. You can see it at the very bottom line of the window in the “Xpath Selection” field.

    Подбор Xpath

    If the expression worked correctly, then on the right in the “Links” block you should see the result of the work.

    But I must warn you that in some cases, Xpath does not work correctly. And then you need to refine it. And this is problematic if you have not previously encountered the Xpath query language and do not know what it is.

    There are 2 options to solve this problem:

    • You will still have to deal with Xpath on your own (you can google it).
    • You can also use regular expressions to get links (“Help” will help you figure it out).

    I got 9 links. And it is right. Because there are 9 products on the page. Copy the resulting Xpath expression and paste it into the “Xpath for collecting links” field.

    Xpath для сбора ссылокBut we need the program to collect products from all categories. So you need to write another Xpath expression to collect links from subsequent pages. The principle is the same: open the Picker again, left-click on the next page (only one is enough, since the rule will work the same for all pages) and get the Xpath.

    Полученное Xpath выражениеThe resulting expression is also copied and pasted into a new line in the “Xpath for collecting links” field.

    поле «Xpath для сбора ссылок»

    The next step is to set up fields for collecting information. Click the “Add Data Field” button. Enter, for example, “name” and click “Save“.

    For each field, you also need to set up an expression, as we did before. Only now in the Picker we load a page with a specific product from the category and here we collect our expressions. Click on the name, get our expression and copy it. We return to the settings. Select the “Xpath Cut” item and paste the resulting expression into the “String Collection Editor“.

    Please note that at the bottom of the window there is a checkmark “Save a link to the page.” This means that the “URL” field will be added to the fields configured by us.

    Редактор коллекции строк

    But at this stage, it is impossible to check whether the selected expression works correctly. Therefore, click “Next” and complete the campaign setup. After that, it will be available in the campaign tree.

    But we haven’t set everything up yet, so it makes no sense for us to launch it. Therefore, right-click on the created campaign and open “Settings“. The following window appears.

    Настройки

    Don’t be scared. At first glance it seems scary. But in fact, nothing complicated. For most tasks, only 3 tabs are used: navigation, data collection and export. In fact, we have already configured the “Navigation” tab. It remains only to configure “Data collection” and “Export“.

    Let’s check how the “Name” data field we added works. Go to the tab “Data collection” → “Data fields“. Please note that we already have the fields created: “Name” and “URL“.

    The “URL” field was generated automatically (unless you unchecked “Save a link to the page“). You don’t need to write Xpath for it. Just make sure the “Special Values” tab is set to “URL“.

    Next, select “Name” and check that the “Xpath cut” field is not empty. If, nevertheless, it turns out to be empty, then you need to configure the XPath again using the Picker. Once you’ve done that, there’s a Test Data Acquisition tool at the bottom of the window. We enter there the link that we set up (a link to a specific product) and click the “Test” button (In the future, when checking all other fields, you can use the “Ctrl + T” combination).

    If everything works correctly, then you should see the name and link of the product.

    название и ссылка на товар

    Hooray! Everything is working!

    If suddenly you are 100% sure that the expression works correctly, and you don’t want to test it, then you need to know one nuance. In this case, you must click the “Apply” button, which is located at the very top of the window. Otherwise, your settings will not be saved.

    Similarly, all required fields are created. I’ll just tell you about one more, which is a little different.

    This is a “picture“. There are some points here that require explanation. First, everything is done, just like with the previous settings. Add the “picture” field (or “image” – whatever suits you) and select the Xpath expression. Your next action depends on what you want to do with the resulting images:

    • upload to local disk;
    • save the virtual path;
    • upload files to

    I need to save virtual paths. Therefore, go to the “Upload files” tab and check the “Upload files” box, where you need to specify the virtual path (to the folder where the pictures are located) and set the “Return virtual paths” marker.

    Возвращать виртуальные путиIf everything is configured correctly, then the test will be successful.

    Тестирование

    If the “picture” field is empty after testing, then you need to figure out what was done wrong.

    I created several campaigns in datacol and in one of them I ran into a problem when each image is in a new folder. And this means that it is impossible to register one virtual path for all images. I am still in the process of studying this problem, but I am sure that there is a solution.

    If everything works, then the last, but no less important point remains. File export needs to be configured. To do this, go to the “Export” tab. As you can see, there is a choice of export format. Choose the type of file you need and set the path to save the file in the “Export formats” tab.

    Now feel free to click the “Save and Exit” button. And the best part remains – click on “Start” and see that the parsing process has begun. This may take some time, it all depends on the number of pages it processes. Parsing results are visible at the bottom of the window. The campaign cannot be edited while running. If you want to change something, then you need to click “Stop” and only then you can change the settings.

    The file is exported automatically. Therefore, immediately after parsing is completed, you can open the folder where you saved the file and see the result of your work with the Datacol program.

    Solving other tasks using the Datacol website parser

    With the help of this website scraping program, you can solve other tasks, for example, you can parse:

    • ads;
    • prices and products of the online store;
    • information from the forums;
    • SEO parameters of sites;
    • issuance of search engines;
    • position of the site for certain queries in the search engine;
    • email addresses;
    • in principle, any content on the sites;
    • site parsing with export to WordPress and much more.

    The official website now has discounts for purchasing a license up to 55%. For professionals who often face similar tasks, such as parsing goods from any sites, I would advise you to consider purchasing a license for the Datacol program.

    Advantages and disadvantages of the Datacol website parser

    Advantages:

    • solving a large number of tasks;
    • relatively low price (considering that the prices for analogues are an order of magnitude higher, and they cope with only one task);
    • various export formats;
    • saving time.

    Flaws:

    • not always matched Xpaths work correctly (so you spend extra time to refine the expression yourself);
    • hard to understand help.

    If the disadvantages in the form of complex settings still scared you away, I recommend paying attention to alternative products:

      • Pritraxer – price parser from any online stores.
      • A-Parser – collects HTML content from any part of the site. Any metadata, page text. The parser is suitable for all search engines, various services and sites. 90+ ready-made parsers, 200+ additional parsers in the catalog.
    5/5 - (4 votes)