from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
# Create a crawler instance
crawler = PlaywrightCrawler(
# headless=False,
# browser_type='firefox',
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
data = {
"request_url": context.request.url,
"page_url": context.page.url,
"page_title": await context.page.title(),
"page_content": (await context.page.content())[:10000],
}
await context.push_data(data)

await crawler.run(["https://crawlee.dev"])

if __name__ == "__main__":
asyncio.run(main())

使用Crawlee的PlaywrightCrawler抓取網(wǎng)站標(biāo)題及內(nèi)容示例

2.Requests

每個(gè)網(wǎng)絡(luò)抓取任務(wù)的首要步驟都是向目標(biāo)網(wǎng)站發(fā)送請(qǐng)求并獲取其內(nèi)容,這些內(nèi)容通常為HTML格式。Python中的Requests庫正是為此而生,它憑借“HTTP for humans”的核心理念,極大地簡化了這一過程,因此成為了下載次數(shù)最多的Python包。

? 特征

?? 優(yōu)點(diǎn)

?? 缺點(diǎn)

?? 選擇

httpx, urlib3, http.client, aiohttp

?? 安裝請(qǐng)求

要安裝 Requests 庫,請(qǐng)使用 Python 包管理器 pip:

pip install requests

?? 代碼示例

import requests

response = requests.get('https://api.example.com/data')
if response.status_code == 200:
data = response.json() # Parse JSON response
print(data)
else:
print(f"Request failed with status code: {response.status_code}")

3. HTTPX

HTTPX 作為新一代 HTTP 庫,相較于 Requests,提供了異步和 HTTP/2 等高級(jí)功能。如異步和 HTTP/2 支持。HTTPX 在核心功能與 Requests 保持高度一致,易于上手。鑒于其高性能與良好的擴(kuò)展性,HTTPX 不僅適用于大型項(xiàng)目,也推薦用于小型項(xiàng)目,為未來可能的需求擴(kuò)展預(yù)留空間。

? 特征

?? 優(yōu)點(diǎn)

?? 缺點(diǎn)

?? 選擇

Requests, aiohttp, urlib3, http.client

?? 安裝 HTTPX

要安裝 HTTPX 庫,請(qǐng)使用 Python 包管理器 pip:

pip install httpx

?? 代碼示例

import httpx
import asyncio

async def fetch_data():
async with httpx.AsyncClient() as client:
response = await client.get('https://api.example.com/data')
if response.status_code == 200:
data = response.json() # Parse JSON response
print(data)
else:
print(f"Request failed with status code: {response.status_code}")

# Run the asynchronous function
asyncio.run(fetch_data())

4. Beautiful Soup

一旦你擁有 HTML 內(nèi)容,你需要一種方法去解析它并提取你感興趣的數(shù)據(jù)。為此,Beautiful Soup這一流行的Python HTML解析器應(yīng)運(yùn)而生。它允許你在HTML樹結(jié)構(gòu)中輕松導(dǎo)航和搜索,從而獲取你感興趣的數(shù)據(jù)。其簡單的語法和簡單的設(shè)置也使 Beautiful Soup 成為中小型 Web 抓取項(xiàng)目和 Web 抓取初學(xué)者的絕佳選擇。

? 特征

?? 優(yōu)點(diǎn)

?? 缺點(diǎn)

?? 選擇

lxml、html5lib

?? 安裝 Beautiful Soup

為了安裝Beautiful Soup,請(qǐng)使用pip來安裝相應(yīng)的軟件包。此外,為了獲得更佳的解析功能,我們還建議安裝lxml或html5lib。這些庫與beautifulsoup4配合使用,能夠顯著提升HTML內(nèi)容的解析效果。

pip install beautifulsoup4 lxml

?? 代碼示例

from bs4 import BeautifulSoup
import httpx

# Send an HTTP GET request to the specified URL using the httpx library
response = httpx.get("https://news.ycombinator.com/news")

# Save the content of the response
yc_web_page = response.content

# Use the BeautifulSoup library to parse the HTML content of the webpage
soup = BeautifulSoup(yc_web_page)

# Find all elements with the class "athing" (which represent articles on Hacker News) using the parsed HTML
articles = soup.find_all(class_="athing")

# Loop through each article and extract relevant data, such as the URL, title, and rank
for article in articles:
data = {
"URL": article.find(class_="titleline").find("a").get('href'), # Find the URL of the article by finding the first "a" tag within the element with class "titleline"
"title": article.find(class_="titleline").getText(), # Find the title of the article by getting the text content of the element with class "titleline"
"rank": article.find(class_="rank").getText().replace(".", "") # Find the rank of the article by getting the text content of the element with class "rank" and removing the period character
}
# Print the extracted data for the current article
print(data)

5.  Mechanical Soup

Mechanical Soup 是一個(gè) Python 庫,它提供了Requests和BeautifulSoup庫的更高級(jí)別抽象。它通過將 Requests 的易用性與 Beautiful Soup 的 HTML 解析功能相結(jié)合,簡化了 Web 抓取過程。

? 特征

?? 優(yōu)點(diǎn)

?? 缺點(diǎn)

?? 選擇

Selenium, Playwright, Beautiful Soup

?? 安裝 Mechanical Soup

要安裝 MechanicalSoup,請(qǐng)?jiān)诮K端或命令提示符中運(yùn)行以下命令:

pip install MechanicalSoup

?? 代碼示例

import mechanicalsoup

# Create a MechanicalSoup browser instance
browser = mechanicalsoup.StatefulBrowser()

# Perform a GET request to a webpage
browser.open("https://example.com")

# Extract data using BeautifulSoup methods
page_title = browser.get_current_page().title.text

print("Page Title:", page_title)

6. Selenium

Selenium 是一種廣泛使用的 Web 自動(dòng)化工具,它允許開發(fā)人員以編程方式與 Web 瀏覽器交互。通常用于測試 Web 應(yīng)用程序,但它也可以作為 Web 抓取的強(qiáng)大工具,尤其是在處理需要?jiǎng)討B(tài)內(nèi)容加載的 JavaScript 渲染的網(wǎng)站時(shí)。

? 特征

?? 優(yōu)點(diǎn)

?? 缺點(diǎn)

?? 選擇

Playwright, Mechanical Soup, Crawlee, Scrapy

?? 安裝 Selenium

要安裝 Selenium,請(qǐng)?jiān)诮K端或命令提示符中運(yùn)行以下命令:

pip install selenium

?? 代碼示例

from selenium import webdriver

# Setup the WebDriver (using Chrome in this example)
driver = webdriver.Chrome()

# Navigate to a web page
driver.get("https://example.com")

# Interact with the page (e.g., click a button)
button = driver.find_element_by_id("submit")
button.click()

# Extract data
content = driver.page_source

# Close the browser
driver.quit()

7. Playwright

Playwright 是由 Microsoft 開發(fā)的現(xiàn)代 Web 自動(dòng)化框架。它憑借單個(gè)API支持多個(gè)瀏覽器(包括Chromium、Firefox和WebKit),提供了與網(wǎng)頁交互的強(qiáng)大功能。Playwright因其出色的速度、可靠性和處理復(fù)雜Web應(yīng)用程序的能力,在測試和自動(dòng)化領(lǐng)域備受青睞。與Selenium相似,在處理需要?jiǎng)討B(tài)內(nèi)容加載的網(wǎng)站時(shí),Playwright也是一個(gè)強(qiáng)大的網(wǎng)絡(luò)抓取工具。

? 特征

?? 優(yōu)點(diǎn)

?? 缺點(diǎn)

?? 選擇

Selenium, Crawlee, Scrapy

?? 安裝 Playwright

要安裝 Playwright,請(qǐng)?jiān)诮K端或命令提示符中運(yùn)行以下命令:

pip install playwright

然后,您需要安裝必要的瀏覽器二進(jìn)制文件:

playwright install

?? 代碼示例

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")

# Interact with the page
page.click('button#submit')

# Extract data
content = page.content()

browser.close()

8. Scrapy

Scrapy是一個(gè)功能強(qiáng)大且高度靈活的Python框架,專門用于網(wǎng)絡(luò)抓取。與常用于Web自動(dòng)化的Selenium和Playwright不同,Scrapy的設(shè)計(jì)目標(biāo)是以結(jié)構(gòu)化和可擴(kuò)展的方式從網(wǎng)站抓取大量數(shù)據(jù)。

? 特征

?? 優(yōu)點(diǎn)

?? 缺點(diǎn)

?? 選擇

Crawlee, Beautiful Soup, Selenium, Playwright

?? 安裝 Scrapy

要安裝 Scrapy,請(qǐng)?jiān)谀慕K端或命令提示符中運(yùn)行以下命令:

pip install scrapy

?? 代碼示例

import scrapy

class HackernewsSpiderSpider(scrapy.Spider):
name = 'hackernews_spider'
allowed_domains = ['news.ycombinator.com']
start_urls = ['http://news.ycombinator.com/']

def parse(self, response):
articles = response.css('tr.athing')
for article in articles:
yield {
"URL": article.css(".titleline a::attr(href)").get(),
"title": article.css(".titleline a::text").get(),
"rank": article.css(".rank::text").get().replace(".", "")
}

哪個(gè) Python 抓取庫適合您?

那么,在選擇網(wǎng)絡(luò)抓取項(xiàng)目的庫時(shí),您應(yīng)該考慮哪些選項(xiàng)呢?以下表格總結(jié)了本文介紹的所有庫的功能特性、主要用途、顯著優(yōu)點(diǎn)及潛在缺點(diǎn):

圖書館用例易用性特征優(yōu)點(diǎn)缺點(diǎn)選擇
Crawlee大規(guī)模抓取和瀏覽器自動(dòng)化簡單自動(dòng)并行爬取、代理輪換、持久隊(duì)列易于設(shè)置、干凈的代碼、集成功能新的有限教程Scrapy, Playwright, Beautiful Soup
Requests發(fā)出 HTTP 請(qǐng)求非常簡單簡單的 API、SSL/TLS 支持、流媒體大型社區(qū),有據(jù)可查沒有異步,對(duì)于性能敏感型任務(wù)來說速度較慢httpx, urllib3, aiohttp
HTTPX支持異步的 HTTP 請(qǐng)求簡單異步支持、HTTP/2、可自定義傳輸非阻塞請(qǐng)求,現(xiàn)代標(biāo)準(zhǔn)學(xué)習(xí)強(qiáng)度更大,社區(qū)規(guī)模更小Requests, aiohttp, urllib3
Beautiful SoupHTML/XML 解析非常簡單樹遍歷、編碼處理、多解析器支持語法簡單,非常適合初學(xué)者可擴(kuò)展性有限,不支持 JavaScriptlxml, html5lib
Mechanical Soup表單處理、簡單的網(wǎng)頁抓取簡單請(qǐng)求 + Beautiful Soup 集成,表單提交簡化的界面、會(huì)話處理有限的高級(jí)功能Selenium, Playwright
Selenium瀏覽器自動(dòng)化、JavaScript 密集型網(wǎng)站中等跨瀏覽器的動(dòng)態(tài)內(nèi)容處理模擬復(fù)雜的交互,多語言支持速度較慢,資源密集Playwright, Crawlee, Scrapy
Playwright高級(jí)瀏覽器自動(dòng)化中等多瀏覽器支持、自動(dòng)等待、并行執(zhí)行處理 JS 密集型網(wǎng)站、高級(jí)功能學(xué)習(xí)強(qiáng)度更大,社區(qū)更小Selenium, Crawlee, Scrapy
Scrapy大規(guī)模 Web 抓取異步、分布式抓取、可擴(kuò)展性高效,處理復(fù)雜場景學(xué)習(xí)強(qiáng)度更大,設(shè)置繁重Crawlee, Playwright, Selenium

此處介紹的每種工具在專家抓取工具的工具包中都有其獨(dú)特的用途。根據(jù)任務(wù)需求掌握它們的使用,將賦予您靈活選擇最佳工具的能力,因此,在做出決定前,不妨大膽嘗試每一種工具!

原文鏈接:https://blog.apify.com/what-are-the-best-python-web-scraping-libraries/

上一篇:

2023年頂級(jí)編程語言:企業(yè)的技術(shù)趨勢(shì)

下一篇:

每個(gè)開發(fā)人員都應(yīng)該知道的10個(gè)JAVASCRIPT SEO技巧
#你可能也喜歡這些API文章!

我們有何不同?

API服務(wù)商零注冊(cè)

多API并行試用

數(shù)據(jù)驅(qū)動(dòng)選型,提升決策效率

查看全部API→
??

熱門場景實(shí)測,選對(duì)API

#AI文本生成大模型API

對(duì)比大模型API的內(nèi)容創(chuàng)意新穎性、情感共鳴力、商業(yè)轉(zhuǎn)化潛力

25個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)

#AI深度推理大模型API

對(duì)比大模型API的邏輯推理準(zhǔn)確性、分析深度、可視化建議合理性

10個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)