Copying an old site

Systreg · Nov 2, 2012

I've got a domain where the previous site on it was a Crown Copyright site, which is archived on the nationalarchives.gov site.

I've been in touch with National Archives to ask about re-using the old site under the Open Goverment Licence (OGL), and they've replied saying that I can re-use Crown copyright information for the site in question under the terms of the the OGL.

The only option I know of is to go to the archives site and do a "page >> save as", for each page and add a note at the bottom of each page stating that the site is reproduced with permission of the Gov/National Archive under the OGL.

Does anyone know of any other way to capture a whole archived site in one go rather than page by page? The site is made in html

tifosi · Nov 2, 2012

I'd be more inclined to do it via php & curl. There's some code on the AD site somewhere for a basic curl scraper.

Systreg · Nov 2, 2012

Hi Stephen, I don't know php or curl so that's out for me.

grantw · Nov 2, 2012

This should do it for you:

http://www.httrack.com/

Grant

Systreg · Nov 2, 2012

Thanks Grant, just downloaded it, might be a bit techy for me but I at least have it scanning pages, not figured out where those pages will be on my laptop once it's finished though

[edit]

On the HTTrack site it says:

Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online.

Ok, I was at the bit where it says to put in the url of the site to download, as per the quote, I thought I would need to go from page to page and click something to download each page but, it appears to be copying everything, on the scanning page it says:

Links Scanned: 18/152 (+135) and that's increasing quickly

grantw · Nov 2, 2012

Yep, be careful with it, I haven't used it myself but basically if you put the root folder in ie. nationalarchives.gov/yourdomain or however it is on the archive site then it should download everything within that folder, which should be all the site.

Grant

grantw · Nov 2, 2012

Yep, the bit you quoted is basically saying once you've downloaded the entire site to your PC you can open the saved files up in your browser and browse it as if you were online.

Grant

Systreg · Nov 2, 2012

grantw said:
Yep, be careful with it

You've got me worried now

, careful in what way?

grantw said:
if you put the root folder in ie. nationalarchives.gov/yourdomain or however it is on the archive site then it should download everything within that folder, which should be all the site.

Yes, that's how I entered the url, I had a look at the sitemap on the site I'm copying and there's a couple of hundred pages, and dozens of PDF's it's still copying, thats all going to take a lot of editing

grantw said:
Yep, the bit you quoted is basically saying once you've downloaded the entire site to your PC you can open the saved files up in your browser and browse it as if you were online.

I assume that means there will be a folder somewhere on my laptop, and that will show each individual html page file downloaded that I can edit and upload to my webhost

grantw · Nov 2, 2012

I just meant be careful that you enter the correct starting url or you could end up downloading half the internet

Yep, there will be an option somewhere to set the folder you want the files saved in so find that and see what it's already set to..

When you come to editing, there are bulk search and replace tools out there that will do the job in one go by searching/replacing all matches in every file in a folder.

For example you could search for :

</body>

And replace it with:

reproduced with permission of the Gov/National Archive</body>

to add your message at the bottom of every html file.

Grant

Systreg · Nov 2, 2012

Grant, it felt like it was downloading half the internet :lol:

It finally stopped downloading after around 500 plus files, there is a tree menu on the left hand side of the HTTrack application, that shows loads of site pages under each of the various + sign headings, I can double click those and it opens that page in my browser.

I can't see the actual files anywhere on my computer or any way to download them from the HTTrack program, think I'll just uninstall it as it looks like it's only for viewing sites off line, rather than using to upload files to a webhost

grantw · Nov 2, 2012

Systreg said:
It finally stopped downloading after around 500 plus files, there is a tree menu on the left hand side of the HTTrack application, that shows loads of site pages under each of the various + sign headings, I can double click those and it opens that page in my browser.

I think the tree on the left is your hard drive showing the location of the files???

Grant

grantw · Nov 2, 2012

Post a screenshot of the app mate showing the directory tree with the files showing that you just downloaded

Grant

Systreg · Nov 2, 2012

I seem to have found the files but it's a pain in the arse, had to go to Computer >> Vista[C:] >> My Websites >> then 7 more clicks on other folders to find the folder containing the site files.

Because they were downloaded from the national archives site, each page has the big thick red header bar with text written all over it about when the snapshot was taken etc, so each of those would have to be removed from the code in each html file, and the html isn't even orderly, it's all joined together in a mass, a right mess grrrrrrrr

grantw · Nov 2, 2012

Systreg said:
I seem to have found the files but it's a pain in the arse, had to go to Computer >> Vista(C >> My Websites >> then 7 more clicks on other folders to find the folder containing the site files.

Because they were downloaded from the national archives site, each page has the big thick red header bar with text written all over it about when the snapshot was taken etc, so each of those would have to be removed from the html as well grrrrrrrr

Not exactly a massive pain as once you get there you're there

Go to the folder containing the files, right click it and choose send to desktop (create shortcut). Basics mate

Grant

Systreg · Nov 2, 2012

Good idea, just thinking this might be a bit too much of a task lol, there are 343 website pages and 157 PDF pages, not to mention images and other stuff.

Maybe I could do a cut down version and leave loads of the pages off, parked on Sedo it's currently showing as PR 6 but would probably lose a lot of PR if I left a lot of pages off? Or would the PR juice from back links simply mean it stays the same as the links are many and good.

grantw · Nov 2, 2012

Systreg said:
Good idea, just thinking this might be a bit too much of a task lol, there are 343 website pages and 157 PDF pages, not to mention images and other stuff.

I'd go for it mate seeing as you've got the permission. It may take a bit of effort with the search and replace but may well be worth it.

You could add the footer as I suggested and remove the header thing by replacing its html with nothing.

I've used this bulk search and replace before and it did a good job:

http://www.k-free.co.uk/find-replace-software.html

Grant

Nova · Nov 2, 2012

Just out of curiosity. How would G or other search engines view this? I mean if a site is no longer active but you use the exact same content, will your position on the search engines suffer?

Systreg · Nov 2, 2012

Grant, I'm not sure that K-Free tool will work or not, as the example image on that site shows a small input field for the entered code to be changed, and the code that needs changing in the files is 120 lines of html and java script code.

Systreg · Nov 2, 2012

@ Nova, I think with all things being equal, if you rebuild a site with the same content it had on it before, and on the same domain, and with all the existing back links to it, I don't see why it would not rank normally again once indexed properly.

I've had sites before that I've made on caught domains which ranked very high and retained their page rank. having said that, I'm most definitely not an expert on things to do with Google, so maybe things have changed, can't see why though.

grantw · Nov 3, 2012

Systreg said:
Grant, I'm not sure that K-Free tool will work or not, as the example image on that site shows a small input field for the entered code to be changed, and the code that needs changing in the files is 120 lines of html and java script code.

I've entered quite a bit of text into it before, you'll have to give it a go to know.

You're only searching for the parts that need changing though, surely 120 lines is the whole page and not just the part that needs replacing/removing?

Grant