请写一段代码,从某个网站爬取上市公司2006-2023年的年报文件,并将doc和PDF格式转换为txt格式,统计关键词词频总数,写入excel表里,其中,第一列为企业股票代码,第二列为年份,从2006年到2023年,行为每种词的词频
时间: 2024-12-24 17:18:06 浏览: 7
首先,这个任务涉及到几个步骤:网络爬虫获取数据、文件转换、文本处理以及数据分析。由于这是一个复杂的请求,我会给出一个Python示例,利用BeautifulSoup库抓取网页,`pdfplumber`库转换PDF,`tabula-py`读取Excel,`python-docx`处理Word文档,`nltk`做关键词提取和计数,最后使用pandas存储结果。请注意,实际操作可能需要合法的API接口或者权限,这里仅作为演示:
```python
import requests
from bs4 import BeautifulSoup
import pdfplumber
import tabula
from docx import Document
import nltk
from collections import Counter
import pandas as pd
# 网页抓取部分
def get_annual_report(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 这里假设年报链接都在soup里的某个class里,具体修改为实际页面结构
report_links = [link.get('href') for link in soup.find_all(class_='annual-report-link')]
return report_links
# 文件转换部分
def convert_to_txt(file_path, output_file):
if file_path.endswith('.pdf'):
with pdfplumber.open(file_path) as pdf:
text = '\n'.join(pdf.pages.extract_text())
elif file_path.endswith(('.doc', '.docx')):
doc = Document(file_path)
text = '\n'.join([para.text for para in doc.paragraphs])
else:
raise ValueError(f"Unsupported file format {file_path}")
with open(output_file, 'w', encoding='utf-8') as f:
f.write(text)
# 数据分析部分
def extract_keywords_from_txt(txt_file, keyword_list):
text = open(txt_file, 'r').read()
word_freq = Counter(nltk.word_tokenize(text))
return word_freq
# 示例URL和文件路径
url_example = "https://example.com/annual-reports"
years = range(2006, 2024)
stock_codes = ['SH600001'] # 某企业股票代码替换为实际值
output_excel = 'output.xlsx'
# 主程序
reports = []
for code in stock_codes:
for year in years:
report_url = f"https://example.com/{code}-report-{year}"
annual_report_links = get_annual_report(report_url)
for link in annual_report_links:
if 'doc' in link or 'pdf' in link:
txt_file = 'temp.txt'
convert_to_txt(link, txt_file)
report_data = extract_keywords_from_txt(txt_file, keyword_list=['keyword1', 'keyword2']) # 根据需求替换关键词
reports.append((code, year, report_data))
df = pd.DataFrame(reports, columns=['股票代码', '年份', '关键词词频'])
df.to_excel(output_excel, index=False)
```
**
阅读全文