web crawler

One of the possible solutions would be to retrieve the available photo previews into a temporary local folder (keep in mind possible privacy issues), similarly as web browsers require to store web content locally to display it. Wget is a powerful yet simple tool to download content from web servers, supporting pattern matching of files being retrieved.

1) Determine the (potential) URL of the photos from the event of interest on FinisherPix (e.g. the Vienna City Marathon 2017 is apparently an event with the ID 1700, and photos being stored in the form https://fp-zoom-eu.s3.amazonaws.com/1700/1700_000064.JPG ).

2) Construct web content (resource) pattern for Wget (e.g. https://fp-zoom-eu.s3.amazonaws.com/1700/1700_*.JPG).

3) Download multiple available photos for the event of interest to a local preview all at once using Wget (e.g. by executing the command of the type wget -r -l1 -np "https://fp-zoom-eu.s3.amazonaws.com/1700" -P /tmp -A "1700_*.JPG").

Again, be aware of the possible privacy implications resulting from such workaround.

Content crawling - Amazon

What I need to think about, when I am programming an application, which will download the content from Amazon and store it in a database. I have a program, which has his own web crawler for downloading web content. The problem is that this program cannot recognize the important parto of every webpage and we need to somehow help with important part recognition. This issue is important to solve for Amazon webpage.

Extracting textual information from RTF documents

In order to provide indexing of office documents of a big company and to enable web-crawler to gather the necessary information the tool is required to get the textual information from non-textual sources, like PDF, DOCX, ODT, RTF, etc Another requirement is to use PHP without third-party tools, such as antiword, xpdf, or at least OLE under Windows. This requirement is grounded on the fact that, for example, OLE is incredibly slow, even is the task can be solved by it. Another reason is have the imndependent solution, not using any of existing tools and not to depend on the platfrom used. Here the task is to study Rich Text Format, which while evolution till the current 1.9.1 version has more than 300 pages of specifications, that are surely not heping in parsing this format.

Extracting textual information from DOCX and ODT documents

<p>In order to provide indexing of office documents of a big company and to enable web-crawler to gather the necessary information the tool is required to get the textual information from non-textual sources, like PDF, DOCX, ODT, RTF, etc</p><p>&nbsp;</p><p>&nbsp;</p><p>&nbsp;</p><p>Another requirement is to use PHP without third-party tools, such as antiword, xpdf, or at least OLE under Windows. This requirement is grounded on the fact that, for example, OLE is incredibly slow, even is the task can be solved by it. Another reason is have the imndependent solution, not using any of existing tools and not to depend on the platfrom used.</p><p>&nbsp;</p><p>&nbsp;</p> <p>Here the task is to study Office Open XML format as know as Microsoft's DOCX and another similar format, that is OpenDocument Format, as know as ODT from ODF Alliance.</p><p>&nbsp;</p> <p>The first fomat - Office Open XML: DOCX - can be a real problem even if you are simply working in document management system in one company. As this format is not compatible with the old versions of Microsoft Word, it could really be a problem if you receive such a document from a client, having somthing like Microsoft Office 2003. So the ability to extract important textual infromation from such document without purchasing the new Office licence would be a nice idea.</p><p>&nbsp;</p> <p>The same thing is with the ODT format, however this problem is solved much easier, as Open Office is open-source software, so if you get a document in such a format - you just have to download the free software. But anyway, the indexing software will not be able to install all the necessary software, so this task is really important.</p>

download whole website

A small company with a pretty old website. They outsourced the administration of the site and only have access to the content via web interface, so no database, ftp etc. login data is available anymore because the company, to which the administration was outsourced is no more and the login data was lost. A new website will be put in place and they want to old website to stay in a subdomain, just static html pages are enough for this purpose. The goal is to download a whole website incl. all graphics, documents etc. without following external links to other sites for backup and archive reasons. scriptable solutions preferred (no gui apps)
Subscribe to web crawler