场景学校官网的列表翻页 图片批量下载 工具requestsBeautifulSoup4 随机延时反爬 产出图片自动保存到指定文件夹命名格式序号-标题.jpg一、项目背景找到学生时代的作业心血来潮重新又写了一遍。关键还是定位标签转成BeautifulSoup后找到标签特征不断通过find下探。把网页内容获取网页内容解析网页图片下载三个模块写成了三个方法只要有下一页就不断下载。找下一页时发现下一页是部分替换直接省下一大步。剩下的就是不断遍历查找有没有下一页了二、代码整体思路步骤函数名作用1️⃣getCpageNpage(url)请求当前页解析HTML提取下一页链接2️⃣getImageUrl(soup)从当前页解析所有图片的URL 标题3️⃣downloadImage(page_url_dict, folder)遍历字典逐张下载图片到指定文件夹while url:循环不断翻页直到没有下一页为止三、完整代码import requests from bs4 import BeautifulSoup import os from time import sleep import random def getCpageNpage(url): headers { user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36 Edg/148.0.0.0 } page requests.get(urlurl, headersheaders) # 设置编码要不然文本会显示乱码 page.encoding utf-8 soup BeautifulSoup(page.text, html.parser) try: next_page_href (soup.find(div, class_right n_tupian) .find(div, class_pb_sys_common pb_sys_normal pb_sys_style1) .find(span, p_next p_fun) .find(a)[href]) # 查看到当前网址与下一页网址的关系。当前网址最后一个\之后的内容进行替换即可得到下一页网址 replace_str url.split(/)[-1] next_page_url url.replace(replace_str, next_page_href) except Exception as e: next_page_url None return soup, next_page_url def getImageUrl(soup): div soup.find(div, class_right n_tupian) div_ul_li div.find(ul).find_all(li) page_url_dict {} for li in div_ul_li: title li.find(class_img).find(a)[title] src li.find(class_img).find(img)[src] page_url https://www.gzgs.edu.cn/ src # print(title, page_url) page_url_dict[page_url] title return page_url_dict def downloadImage(page_url_dict, folder./images): global index os.makedirs(folder, exist_okTrue) for img in page_url_dict: response requests.get(img) # print(response.content) print(f下载第 {index} 张{page_url_dict[img]}中图片链接{img}) image_name folder / str(index) - page_url_dict[img] .jpg # 图片得用二进制字节流读取保存 with open(image_name, wb) as f: f.write(response.content) index 1 sleep(round(random.uniform(0.5, 1), 2)) if __name__ __main__: url 学校网址 folder 保存路径 index 1 while url: print(url) try: soup, url getCpageNpage(url) except Exception as e: soup print(网页请求失败) print(e) try: page_url_dict getImageUrl(soup) except Exception as e: page_url_dict print(获取url失败) print(e) try: downloadImage(page_url_dict, folder) except Exception as e: print(下载失败) print(e) sleep(round(random.uniform(2, 4), 1))