项目设想

偶然发现一个宝藏壁纸网站：Desktop wallpapers hd, free desktop backgrounds (wallpaperscraft.com)

有很多好看的壁纸供我们免费下载。我便萌生了坏念头：小孩子才做选择，我全都要。

观察网站构造

观察网址规律

打开网站某一页面，比如Anime wallpapers 4k ultra hd 16:10, desktop backgrounds hd, pictures and images (wallpaperscraft.com)

先观察网址，它的页面是由https://wallpaperscraft.com/catalog/anime/3840x2400/和page1、page2、page3……依次递增。

获取详情页网址

点击右键，检查元素（或F12）。

我们发现每一张壁纸都在

<a class="wallpapers__link" href="/download/girl_smile_fish_1005833/3840x2400">
      <span class="wallpapers__canvas">
  <img class="wallpapers__image" src="https://images.wallpaperscraft.com/image/single/girl_smile_fish_1005833_300x188.jpg" alt="Preview wallpaper girl, smile, fish, anime, colorful">
</span>
<span class="wallpapers__info">
  <span class="wallpapers__info-rating">
    
     <span class="gui-icon gui-icon_rating"></span>&nbsp;9.5
    
  </span>
  3840x2400
  <span class="wallpapers__info-downloads">
     <span class="gui-icon gui-icon_download"></span>&nbsp;38295
  </span>
</span>
<span class="wallpapers__info">girl, smile, fish</span>

    </a>

元素中。

这里只展示了壁纸的缩略图，我们想要下载全尺寸的壁纸，必须点进详情页中，也就是<a class="wallpapers__link" href="/download/girl_smile_fish_1005833/3840x2400">中显示的网址：https://wallpaperscraft.com/+/download/girl_smile_fish_1005833/3840x2400

从详情页中下载图片

上一步，我们获得了详情页：https://wallpaperscraft.com/download/girl_smile_fish_1005833/3840x2400

打开，检查元素：

我们发现，图片的地址包含在

<div class="wallpaper__placeholder">
          <a class="JS-Popup" href="https://images.wallpaperscraft.com/image/single/girl_smile_fish_1005833_3840x2400.jpg">
            <img class="wallpaper__image" src="https://images.wallpaperscraft.com/image/single/girl_smile_fish_1005833_3840x2400.jpg" alt="3840x2400 Wallpaper girl, smile, fish, anime, colorful">
          </a>
        </div>

元素中。

开始爬虫

磨刀不误砍柴工，观察了这么久，我们终于可以开心地敲代码了。

导入所需库

我们需要的库很简单：urllib和BeautifulSoup。

1 2	from urllib.request import urlopen from bs4 import BeautifulSoup

获取详情页地址

编写函数：

def getDetail(url):
    html = urlopen(url)
    bsObj = BeautifulSoup(html.read(), features="lxml")
    detailLst = bsObj.findAll("a", {"class": "wallpapers__link"})
    return ["https://wallpaperscraft.com" + i.attrs["href"] for i in detailLst]

bsObj.findAll("a", {"class": "wallpapers__link"})用于将含有详情页地址的元素放入列表detailLst中。

标签对象的href属性是我们所需要的详情页地址，我们通过.attrs["href"]来获取，最后加上前缀"https://wallpaperscraft.com"将完整地址返回。

从详情页中获取全尺寸图片地址

def getImg(url):
    html = urlopen(url)
    bsObj = BeautifulSoup(html.read(), features="lxml")
    imgUrl = bsObj.findAll("img", {"class": "wallpaper__image"})[0].attrs["src"]
    return imgUrl, imgUrl.split("/")[-1]

同上一部分理，我们只需稍作修改。

用imgUrl.split("/")[-1]返回下载图片的名称（包含后缀）。

下载图片

现在万事具备，下载图片就很简单了：

只需要：

1	from urllib.request import urlretrieve

先下载一张图片试试：

m, n = getImg("https://wallpaperscraft.com/download/girl_smile_fish_1005833/3840x2400")
print("Downloading:", n)
urlretrieve(m, n)
print("Done.")

收官：批量爬虫

使用循环，这一切就简单多了：

一共52页，我们也没必要全下载。下个10页就差不多了。

for num in range(1, 11):
    try:
        target = f"https://wallpaperscraft.com/catalog/anime/3840x2400/page{num}"
        detailLst = getDetail(target)
        print("\033[91mDownloading from:\033[0m", target)
        for details in detailLst:
            address, name = getImg(details)
            print("Downloading:", name)
            urlretrieve(address, "downloaded\\" + name)
            print("Done.")
    except:
        pass