亚洲男人天堂2024,日韩欧美国产一区二区,91亚洲精华国产精华

Journey of Rural Studies 期刊全部文獻翻譯庫

（共3007篇）

后期排版一下，便于閱讀

①預覽題目、關鍵字等 → ②找到需要文獻 → ③點擊鏈接 → ④文章細讀

靈感來源

快速選擇文獻——中文 VS 英文

要想找到一篇與自己研究內容相關的文獻，應該先從廣度上進行篩選，找到合適自己研究的文章，接著細讀，研究其觀點、研究方法等。相對于閱讀英文，我們對中文的閱讀速度更快、能夠更有效的通過關鍵信息篩選出自己需要的文章。避免文章看了好久終于看完了發(fā)現(xiàn)與自己的研究相關性不大。

于是想法誕生，如果能批量的將英文文獻自動翻譯形成數(shù)據(jù)庫，不僅能夠方便閱讀，而且能在相同時間內閱讀更多的信息，延展文獻篩選的廣度，并且沒有網(wǎng)絡延遲，提高了思維的流暢性。寒假在家順便實驗了一下，感覺效果不錯，實現(xiàn)方式見下文。

邏輯設計

路徑 & 框架

? ? 要想實現(xiàn)這個想法，總體操作流程應該有兩大塊。一是利用 Python 爬取相關的數(shù)據(jù)，二是調用百度翻譯API 接口進行自動翻譯。詳細流程整理如下圖：

物理設計

源碼 & 實現(xiàn)

1、文獻數(shù)據(jù)抓取

本次使用 Journal of Rural Studies期刊作為測試，具體的網(wǎng)址如下，任務就是爬取該期刊從創(chuàng)刊以來到現(xiàn)在所有的文章信息。

https://www.journals.elsevier.com/journal-of-rural-studies/

# 導入庫

import requests as re

from lxml import etree

import pandas as pd

import time



# 構造請求頭

headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

先拿一個網(wǎng)頁做一個測試，看看X path解析結果

url = 'https://www.sciencedirect.com/journal/journal-of-rural-studies/issues'

res = re.get(url,headers = headers).text

res = etree.HTML(res)

testdata = res.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

testdata

? ?結果發(fā)現(xiàn)一級網(wǎng)頁解析結果只有一個次級鏈接，按照道理來說應該有一級網(wǎng)頁的全部鏈接，通過多次嘗試發(fā)現(xiàn)，網(wǎng)頁設計過程中第一個次級鏈接為get請求，而其余的次級鏈接都是POST請求，該網(wǎng)頁一共有page2，為了方便，將所有鏈接都點開之后將網(wǎng)頁保存為HTML文件之后再導入較為方便

html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())

html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())

data1 = html1.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

data2 = html2.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

LINKS = []

LINKS.append(data1)

LINKS.append(data2)

TLINKS = []

for i in LINKS:

    link = 'https://www.sciencedirect.com' + i

    TLINKS = append(link)

得到 TLINKS 是所有一級網(wǎng)頁的鏈接，觀察長度共有158條數(shù)據(jù)，數(shù)據(jù)獲取正確。接下來獲取所有的二級網(wǎng)絡鏈接，這個時候就看看直播之類的吧，訪問國外網(wǎng)站有點慢。完成之后共得到3007個次級鏈接（即3007篇文章）

SUBLINKS = []

for link in TLINKS:

    subres = re.get(link,headers = headers).text

    subres = etree.HTML(subres)

    subllinks = subres.xpath("http://a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")

    SUBLINKS.append(sublinks)

    print("第",TLINKS.index(link),"條數(shù)據(jù)OK")

    time.sleep(0.2)

print('ALL IS OK')



LINKS = []

for i in SUBLINKS:

    link = 'https://www.sciencedirect.com' + i

    LINKS.append(link)

得到二級網(wǎng)頁網(wǎng)頁鏈接之后，需要分析三級網(wǎng)頁的網(wǎng)頁結構，并將需要的信息進行篩選，構造字典比存儲。

allinfo = []

for LINK in LINKS:

    info = {}

    res = re.get(LINK,headers=headers).text

    res = etree.HTML(res)

    vol = res.xpath("http://a[@title = 'Go to table of contents for this volume/issue']/text()")

    datainfo = res.xpath("http://div[@class = 'text-xs']/text()")

    timu = res.xpath("http://span[@class = 'title-text']/text()")

    givenname = res.xpath("http://span[@class='text given-name']/text()")

    surname = res.xpath("http://span[@class='text surname']/text()")

    web = res.xpath("http://a[@class='doi']/@href")

    abstract = res.xpath("http://p[@id='abspara0010']/text()")

    keywords = res.xpath("http://div[@class='keyword']/span/text()")

    highlights = res.xpath("http://dd[@class='list-description']/p/text()")



    info['vol'] = vol

    info['datainfo'] = datainfo

    info['timu'] = timu

    info['givenname'] = givenname

    info['surname'] = surname

    info['web'] = web

    info['abstract'] = abstract

    info['keywords'] = keywords

    info['highlights'] = highlights

    allinfo.append(info)

    print("第",LINKS.index(LINK),"條數(shù)據(jù) IS FINISHED,總進度是",(LINKS.index(LINK)+1)/len(LINKS))



df = pd.DataFrame(allinfo)

df

df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')

?由此數(shù)據(jù)的爬取工作完成，得到了擁有所有文章信息的DataFrame

2、數(shù)據(jù)清洗

? ? 去除掉數(shù)據(jù)中多余的字符、將一些爬取時合并的信息進行拆分，形成面向翻譯的Data Frame

# 刪除多余的字符

data = df.copy()

data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')



# 分割合并的信息

data['date'] = data['datainfo'].str.split(',').str.get(1)

data['page'] = data['datainfo'].str.split(',').str.get(2)

3、關鍵部分批量翻譯

? ? 得到具有全部文獻信息的Data Frame之后，需要調用百度翻譯 API 進行批量翻譯。需要具體看一下官方的技術文檔，所需要的請求參數(shù)在文檔中有詳細的說明。

[https://api.fanyi.baidu.com/doc/21]，

字段名	類型	必填參數(shù)	描述	備注
q	TEXT	Y	請求翻譯query	UTF-8編碼
from	TEXT	Y	請求翻譯的源語言	zh中文、en英語
to	TEXT	Y	譯文語言	zh中文、en英語
salt	TEXT	Y	隨機數(shù)
appid	TEXT	Y	APP ID	自己申請
sign	TEXT	Y	簽名	appid+q+salt+密鑰的MD5值

# 導入相應的庫

import http.client

import hashlib

import urllib

import random

import json

import requests as re



# 構造自動翻譯函數(shù) translateBaidu

def translateBaidu(content):

    appid = '20200119000376***'

    secretKet = 'd7SAX0xhIHEEYQ7qp***'

    url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'

    fromLang = 'en'

    toLang = 'zh'

    salt = str(random.randint(32555,65333))

    sign = appid + content + salt + secretKet

    sign = hashlib.md5(sign.encode('utf-8')).hexdigest()

    try:

        params = {

            'appid' : appid,

            'q' : content

            'from' : fromLang,

            'to' : toLang,

            'salt' : salt,

            'sign' : sign

        }

        res = re.get(url,params)

        jres = res.json()

        # 轉換為json格式之后需要分析json的格式，并取出相應的返回翻譯結果

        dst = str(jres['trans_result'][0]['dst'])

        return dst



    except Exception as e:

        print(e)

?構造完成后測試一下，結果返回正確，當輸入?yún)?shù)為空時，返回‘trans_result’

萬事具備，現(xiàn)在只需要將爬取到的文獻的數(shù)據(jù)利用translateBaidu進行翻譯并構造新的 DateFrame即可。

# 在DataFrame中構建相應的新列

data['trans-timy'] = 'NULL'

data['trans-keywords'] = 'NULL'

data['trans-abstract'] = 'NULL'

data['trans-hightlights'] = 'NULL'



# 開始翻譯并賦值

for i in range(len(data)):

    data['trans-timu'][i] = translateBaidu(data['timu'][i])

    data['trans-keywords'][i] = translateBaidu(data['keywords'][i])

    data['trans-abstract'][i] = translateBaidu(data['abstract'][i])

    data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])

    #按照文檔要求，每秒的請求不能超過10條

    time.sleep(0.5)

print('ALL FINISHED')

?看一下翻譯的效果

最后調用 ODBC 接口把數(shù)據(jù)存入數(shù)據(jù)庫中，保存OK，以后過一段時間睡覺之前跑一下程序就能不斷更新文獻庫了。可以把經(jīng)常看的期刊依葫蘆畫瓢都編寫一下，以后就可以很輕松的監(jiān)察文獻動態(tài)了……

質量測評

機翻 vs 人翻

? ? ?在翻譯完成之后，還是有點擔心百度機翻的質量（谷歌接口有點難搞），所以隨機抽樣幾條數(shù)據(jù)來檢查一下翻譯的質量。emmmm，大概看了一下，感覺比我翻的好（手動滑稽）…….

[關鍵詞翻譯的準確度 > 題目翻譯的準確度 > 摘要 > highlights ]

? ? ?但是粗粗的看一下還是沒有問題的，能夠理解大概的意思，不影響理解。

整理后的代碼

# 相應庫的導入

import requests as re

from lxml import etree

import pandas as pd

import time

import http.client

import hashlib

import urllib

import random

import json

import requests as re



# 請求頭的構造

headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}



# 獲取第一層網(wǎng)頁鏈接

html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())

html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())

data1 = html1.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

data2 = html2.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

LINKS = []

LINKS.append(data1)

LINKS.append(data2)

TLINKS = []

for i in LINKS:

    link = 'https://www.sciencedirect.com' + i

    TLINKS = append(link)



# 獲取第二層網(wǎng)頁鏈接

    SUBLINKS = []

for link in TLINKS:

    subres = re.get(link,headers = headers).text

    subres = etree.HTML(subres)

    subllinks = subres.xpath("http://a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")

    SUBLINKS.append(sublinks)

    print("第",TLINKS.index(link),"條數(shù)據(jù)OK")

    time.sleep(0.2)

print('ALL IS OK')



LINKS = []

for i in SUBLINKS:

    link = 'https://www.sciencedirect.com' + i

    LINKS.append(link)



# 獲取第三層網(wǎng)頁的數(shù)據(jù)

allinfo = []

for LINK in LINKS:

    info = {}

    res = re.get(LINK,headers=headers).text

    res = etree.HTML(res)

    vol = res.xpath("http://a[@title = 'Go to table of contents for this volume/issue']/text()")

    datainfo = res.xpath("http://div[@class = 'text-xs']/text()")

    timu = res.xpath("http://span[@class = 'title-text']/text()")

    givenname = res.xpath("http://span[@class='text given-name']/text()")

    surname = res.xpath("http://span[@class='text surname']/text()")

    web = res.xpath("http://a[@class='doi']/@href")

    abstract = res.xpath("http://p[@id='abspara0010']/text()")

    keywords = res.xpath("http://div[@class='keyword']/span/text()")

    highlights = res.xpath("http://dd[@class='list-description']/p/text()")



    # 字典內部數(shù)據(jù)結構的組織

    info['vol'] = vol

    info['datainfo'] = datainfo

    info['timu'] = timu

    info['givenname'] = givenname

    info['surname'] = surname

    info['web'] = web

    info['abstract'] = abstract

    info['keywords'] = keywords

    info['highlights'] = highlights

    allinfo.append(info)

    print("第",LINKS.index(LINK),"條數(shù)據(jù) IS FINISHED,總進度是",(LINKS.index(LINK)+1)/len(LINKS))



# 保存數(shù)據(jù)到excel文件

df = pd.DataFrame(allinfo)

df

df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')



# 數(shù)據(jù)的初步清洗

data = df.copy()

data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')



data['date'] = data['datainfo'].str.split(',').str.get(1)

data['page'] = data['datainfo'].str.split(',').str.get(2)



# 構造自動翻譯函數(shù) translateBaidu

def translateBaidu(content):

    appid = '20200119000376***'

    secretKet = 'd7SAX0xhIHEEYQ7qp***'

    url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'

    fromLang = 'en'

    toLang = 'zh'

    salt = str(random.randint(32555,65333))

    sign = appid + content + salt + secretKet

    sign = hashlib.md5(sign.encode('utf-8')).hexdigest()



    try:

        params = {

            'appid' : appid,

            'q' : content

            'from' : fromLang,

            'to' : toLang,

            'salt' : salt,

            'sign' : sign

        }

        res = re.get(url,params)

        jres = res.json()

        # 轉換為json格式之后需要分析json的格式，并取出相應的返回翻譯結果

        dst = str(jres['trans_result'][0]['dst'])

        return dst



    except Exception as e:

        print(e)



# 在DataFrame中構建相應的新列

data['trans-timy'] = 'NULL'

data['trans-keywords'] = 'NULL'

data['trans-abstract'] = 'NULL'

data['trans-hightlights'] = 'NULL'



# 開始翻譯并賦值

for i in range(len(data)):

    data['trans-timu'][i] = translateBaidu(data['timu'][i])

    data['trans-keywords'][i] = translateBaidu(data['keywords'][i])

    data['trans-abstract'][i] = translateBaidu(data['abstract'][i])

    data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])

    #按照文檔要求，每秒的請求不能超過10條

    time.sleep(0.5)

print('ALL FINISHED')



# 保存文件

data.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')

本文章轉載微信公眾號@OCD Planners

最新文章

內容目錄

內容概要
成果預覽
靈感來源
邏輯設計
物理設計
質量測評
整理后的代碼

返回頂部

帶有 Django API 的機器學習預測模型

為什么API開發(fā)對現(xiàn)代應用至關重要？

国内精品久久久久影院日本,日本中文字幕视频,99久久精品99999久久,又粗又大又黄又硬又爽毛片