爬取cnn新闻-UUpython

这段代码是一个Python脚本，用于从CNN新闻网站抓取新闻链接，然后使用OpenAI的GPT-3.5模型生成摘要并将其保存到文本文件中。代码使用了 requests 库来发送HTTP请求，BeautifulSoup 来解析HTML内容，以及OpenAI的API来生成摘要。

以下是代码的主要功能和流程：

导入所需的模块：

os：用于操作文件系统路径。
datetime：用于获取当前日期。
requests：用于发送HTTP请求和获取网页内容。
BeautifulSoup：用于解析HTML页面内容。
urljoin：用于构建绝对URL。
openai：用于访问OpenAI的API。
time：用于控制请求频率。

设置OpenAI的API密钥：

在代码中设置OpenAI的API密钥。

获取当前日期并创建文件夹：

使用datetime.date.today()获取当前日期。
构建文件夹路径，并使用os.makedirs创建文件夹。

构建文件路径和初始URL：

构建要保存文件的路径。
设置CNN新闻网站的初始URL。

发送请求并解析HTML内容：

使用requests.get获取CNN新闻网站的HTML内容。
使用BeautifulSoup解析HTML内容，找到新闻链接的容器。

遍历链接并生成摘要：

遍历获取到的新闻链接。
对每个链接，发送HTTP请求，获取文章内容并解析HTML。
提取文章内容，构建用户输入，然后使用OpenAI的ChatCompletion生成摘要。

控制请求频率：

使用time.sleep控制每个请求的时间间隔，以控制请求频率。

将摘要写入文件并打印：

将生成的摘要写入文本文件。
打印生成的摘要。

运行脚本前，确保已安装了需要的库（requests、BeautifulSoup、openai），并替换API密钥。另外，注意不要过度访问网站或OpenAI API，以遵守相关的使用条款和规定。

import os
import datetime
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import openai
import time
 
# 在这里设置你的 OpenAI API 密钥
openai.api_key = ''
 
# 获取当前日期
current_date = datetime.date.today()
 
# 创建文件夹路径
folder_path = os.path.join("C:/桌面/每日新闻", str(current_date))
 
# 创建文件夹
os.makedirs(folder_path, exist_ok=True)
 
# 创建文件路径
file_path = os.path.join(folder_path, "CNN新闻.txt")
 
url = "https://edition.cnn.com/"
 
response = requests.get(url)
html_content = response.content
 
soup = BeautifulSoup(html_content, "html.parser")
 
container = soup.find(class_="container__field-links container_ribbon__field-links")
 
if container:
    links = container.find_all("a")
 
    # 打开文件并写入内容
    with open(file_path, "w", encoding="utf-8") as file:
        # 遍历链接并访问每个链接
        for link in links:
            href = link.get("href")
            full_link = urljoin(url, href)
 
            try:
                response = requests.get(full_link)
                response.raise_for_status()  # 检查是否有异常状态码
                html = BeautifulSoup(response.content, "html.parser")
 
                articles = html.find_all(class_="article__content")
 
                if articles is None:
                    continue
 
                content = ' '.join([article.get_text() for article in articles])
 
                user_input = f"摘要以下文章内容：\n{content}\n摘要："
 
                # 控制请求频率
                time_between_requests = 60 / 3  # 3 RPM
                time.sleep(time_between_requests)
 
                summary_response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant."},
                        {"role": "user", "content": user_input}
                    ],
                    temperature=1,
                    max_tokens=256,
                )
 
                summary = summary_response.choices[0].message['content'].strip()
 
                # 将摘要写入文件
                file.write(summary + "\n\n")
 
                # 打印摘要
                print(summary)
                print('---------------------------------------------------------------------------------')
 
            except requests.RequestException as e:
                print(f"请求出错：{str(e)}")
 
    print("文件写入完成！")

爬取cnn新闻

相关推荐

评论抢沙发

评论前必须登录！

热门文章

热门标签

最新评论

QQ咨询

回顶部