[Tutor] Question about python code that is not working
dn
PythonList at DancesWithMice.info
Mon Jun 19 18:39:14 EDT 2023
Apologies if it feels as if I keep telling you what to do: please answer
to the list so that (a) others can jump-in and assist you, and (b) if
anyone else is suffering a similar problem, they can gain (almost) as
much as you - also a reason for selecting a meaningful subject-line for
email messages!
Now, let's talk:-
On 20/06/2023 09.56, Arthur Kolbe wrote:
> Hey again! I've been working on my code for some more, many things that
> needed to be improved. This is the code as of right now:
...
>
> What I want is this:
>
> Software
>
> Enter websites
The way you have broken-down the problem into smaller sub-problems (and
they into smaller ...) is good analysis and design!
When ready to code, what I do is take that narrative specification and
turn it into Python function-names and docstrings. This is workable if
the sub-problems have been broken-down sufficiently - and works on the
grounds that each sub-problem will require one (or more) functions in
which to code its solution.
In this fashion, the top-down design becomes a bottom-up construction.
As the sub-problems are solved, the slightly-larger sub-problems can be
addressed - often a matter of ensuring that the sub-problem solutions
"integrate" correctly (hence term: "integration testing"). Continuing
until it's 'all done'. Ah, would that life were so easy...
Thus, starting from those function-names and docstring solution-methods,
I code those sub-problem solutions, one at a time. This means that I can
also* build a (set of) test(s) to ensure that I've got that (little) bit
correct - it's so much easier to see where things have gone-wrong if
there is only one function in-play!
* I (try to - but am human/lazy/often trying to work quickly) use a
technique called "TDD" (Test-Driven Development) which suggests that one
should use the spec to write the test *first*, and then write code which
will *deliver* to spec.
[however such is possibly a distraction at this moment. So, tuck it
behind you ear, and come back to it when you're inclined]
Thus, if this sub-problem: "enter websites", is built as a
self-contained function, can the function be given a URL as argument,
and respond with the page-header and/or content? There we go - first
test written, and (making assumption) first sub-problem solved!
Get the idea?
(see also "Modular Programming")
> Software checks website if crawling forbidden or not
Good practice!
> If allowed, crawls every page on website, looks for 404/410 pages that
> were once present on the website (status code 200)
There are Linux tools which do this (curl and wget). They have options
to create/vary a pause between making requests of a site - to avoid
'hammering' the server. May be worth a perusal...
> Creates CSV.
>
> Two tables. Both two columns. Table 1 Left C: All websites one
> entered, Right C: EITHER "404 pages found on website", "no 404 pages
> found on website", "Scraping not allowed" or "Website can't be reached".
> Remember, the entire websites are supposed to be crawled for 404 pages.
> So in the second table in the left column all pages that were found.
> right column status code. This second table in the csv is so that I can
> Make sure the program did or didnt find 404 pages.
Business folk can't seem to get enough of spreadsheets (although this is
a .CSV file, cf using openpyxl (or some-such) to build a spreadsheet
directly).
Whereas a web-site verifier/monitor like this, and Python program[me]s
which run 'in the background', are often better-off tracking progress
and results in a "log". There is even a logging library in the PSL
(Python Standard Library)!
> and what is happening right now is this:
> the code when running, creates one file with all pages it finds to the
> first website I enter, then when done with crawling that website,
> creates another file with the same name in the same directory,
Oops!
This problem wouldn't happen if a single log-file were being employed -
similarly a single workbook (although would still find same issue if
delivered as a separate work-sheet for each web-site).
The "websites" list of URLs to be inspected needs to be accompanied by a
'destination' file-name (for this purpose). Alternately, if you can
guarantee unique naming, perhaps use urllib.parse to split each URL into
components and use the web-site (netloc) - with or without TLD; as the
file-name?
> overwriting the old one, where all the pages of the second website I
> entered are, then the third, and when its done it creates one last file
> where all the websites that were entered are shown in a table in the
> left column but for some reason it says "website cant be reached" for
> them in the right column although like I just said all the pages of the
> websites were found in the files created before. First two files are
> just the second table so to say but only for one website, the last file
> is only the first table so to say but not correct.
Re-read this and note how it is difficult to locate exactly where the
error starts - because it is only revealed at the reporting stage.
> Thanks for helping in advance!
The 'help' is non-specific. Should you decide to change the delivery
method or implement the naming-scheme, maybe the problem is solved.
However, better practise will help. If smaller units are tested
in-isolation, the problem will (likely) be revealed sooner, ie calling a
function and *not* gaining the result desired (tested for). If you are
able to narrow-down the code the way you have narrowed-down the
problem-description, I think you will have it beaten (and the next bug,
and those in the next program[me]...
--
Regards,
=dn
More information about the Tutor
mailing list