项目设想

偶然发现一个宝藏壁纸网站:Desktop wallpapers hd, free desktop backgrounds (wallpaperscraft.com)

有很多好看的壁纸供我们免费下载。我便萌生了坏念头:小孩子才做选择,我全都要。

观察网站构造

观察网址规律

打开网站某一页面,比如Anime wallpapers 4k ultra hd 16:10, desktop backgrounds hd, pictures and images (wallpaperscraft.com)

先观察网址,它的页面是由https://wallpaperscraft.com/catalog/anime/3840x2400/page1page2page3……依次递增。

获取详情页网址

点击右键,检查元素(或F12)。

检查元素.png

我们发现每一张壁纸都在

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<a class="wallpapers__link" href="/download/girl_smile_fish_1005833/3840x2400">
<span class="wallpapers__canvas">
<img class="wallpapers__image" src="https://images.wallpaperscraft.com/image/single/girl_smile_fish_1005833_300x188.jpg" alt="Preview wallpaper girl, smile, fish, anime, colorful">
</span>
<span class="wallpapers__info">
<span class="wallpapers__info-rating">

<span class="gui-icon gui-icon_rating"></span>&nbsp;9.5

</span>
3840x2400
<span class="wallpapers__info-downloads">
<span class="gui-icon gui-icon_download"></span>&nbsp;38295
</span>
</span>
<span class="wallpapers__info">girl, smile, fish</span>

</a>

元素中。

这里只展示了壁纸的缩略图,我们想要下载全尺寸的壁纸,必须点进详情页中,也就是<a class="wallpapers__link" href="/download/girl_smile_fish_1005833/3840x2400">中显示的网址:https://wallpaperscraft.com/+/download/girl_smile_fish_1005833/3840x2400

从详情页中下载图片

上一步,我们获得了详情页:https://wallpaperscraft.com/download/girl_smile_fish_1005833/3840x2400

打开,检查元素:

详情页-检查元素.png

我们发现,图片的地址包含在

1
2
3
4
5
<div class="wallpaper__placeholder">
<a class="JS-Popup" href="https://images.wallpaperscraft.com/image/single/girl_smile_fish_1005833_3840x2400.jpg">
<img class="wallpaper__image" src="https://images.wallpaperscraft.com/image/single/girl_smile_fish_1005833_3840x2400.jpg" alt="3840x2400 Wallpaper girl, smile, fish, anime, colorful">
</a>
</div>

元素中。

开始爬虫

磨刀不误砍柴工,观察了这么久,我们终于可以开心地敲代码了。

导入所需库

我们需要的库很简单:urllibBeautifulSoup

1
2
from urllib.request import urlopen
from bs4 import BeautifulSoup

获取详情页地址

编写函数:

1
2
3
4
5
def getDetail(url):
html = urlopen(url)
bsObj = BeautifulSoup(html.read(), features="lxml")
detailLst = bsObj.findAll("a", {"class": "wallpapers__link"})
return ["https://wallpaperscraft.com" + i.attrs["href"] for i in detailLst]

bsObj.findAll("a", {"class": "wallpapers__link"})用于将含有详情页地址的元素放入列表detailLst中。

标签对象的href属性是我们所需要的详情页地址,我们通过.attrs["href"]来获取,最后加上前缀"https://wallpaperscraft.com"将完整地址返回。

从详情页中获取全尺寸图片地址

1
2
3
4
5
def getImg(url):
html = urlopen(url)
bsObj = BeautifulSoup(html.read(), features="lxml")
imgUrl = bsObj.findAll("img", {"class": "wallpaper__image"})[0].attrs["src"]
return imgUrl, imgUrl.split("/")[-1]

同上一部分理,我们只需稍作修改。

imgUrl.split("/")[-1]返回下载图片的名称(包含后缀)。

下载图片

现在万事具备,下载图片就很简单了:

只需要:

1
from urllib.request import urlretrieve

先下载一张图片试试:

1
2
3
4
m, n = getImg("https://wallpaperscraft.com/download/girl_smile_fish_1005833/3840x2400")
print("Downloading:", n)
urlretrieve(m, n)
print("Done.")

收官:批量爬虫

使用循环,这一切就简单多了:

一共52页,我们也没必要全下载。下个10页就差不多了。

1
2
3
4
5
6
7
8
9
10
11
12
for num in range(1, 11):
try:
target = f"https://wallpaperscraft.com/catalog/anime/3840x2400/page{num}"
detailLst = getDetail(target)
print("\033[91mDownloading from:\033[0m", target)
for details in detailLst:
address, name = getImg(details)
print("Downloading:", name)
urlretrieve(address, "downloaded\\" + name)
print("Done.")
except:
pass
成果.png