df =pd.read_excel("D:\HBVdata\HBV_P.xlsx") smile_list =df['Smiles'].tolist() mols =[Chem.MolFromSmiles(smile) for smile in smile_list] fingerprints =[Chem.RDKFingerprint(mol) for mol in mols] dg =pd.read_csv("D:\HBVdata\hbvfrag_recap_delete_duplicate.csv") smi_list =dg['mol'].tolist() submols =[Chem.MolFromSmarts(smi) for smi in smi_list] count_dict = {} for submol_idx, submol in enumerate(submols): count = 0 for mol in mols: if mol.HasSubstructMatch(submol): count += 1 smi = smi_list[submol_idx] count_dict[smi] = count counts =[count_dict.values()] total_count =sum(count_dict.values()) freq =[num / total_count for num in counts]针对这段代码分别将列表的值除以其总和
时间: 2024-02-28 12:56:15 浏览: 44
以下是代码实现:
```python
counts = list(count_dict.values())
total_count = sum(counts)
freq = [num / total_count for num in counts]
```
这里首先将字典中的值(即出现次数)提取出来,存储在列表 `counts` 中。然后使用 `sum()` 函数计算出列表中所有元素的总和。接着,使用列表推导式将列表中每个元素除以总和,得到频率列表 `freq`。
相关问题
df =pd.read_excel("D:\HBVdata\HBV_P.xlsx") smile_list =df['Smiles'].tolist() mols =[Chem.MolFromSmiles(smile) for smile in smile_list] fingerprints =[Chem.RDKFingerprint(mol) for mol in mols] dg =pd.read_csv("D:\HBVdata\hbvfrag_recap_delete_duplicate.csv") smi_list =dg['mol'].tolist() submols =[Chem.MolFromSmarts(smi) for smi in smi_list] count =0 for submol in submols: for mol in mols: if mol.HasSubstructMatch(submol): count +=1 print(count)改变这段代码让输出的值为每个分子对应的count
您可以将计数变量 `count` 转换为一个字典,其中键为分子的 SMILES 字符串,值为该分子结构中匹配子结构的数量。代码如下:
```python
df = pd.read_excel("D:\HBVdata\HBV_P.xlsx")
smile_list = df['Smiles'].tolist()
mols = [Chem.MolFromSmiles(smile) for smile in smile_list]
fingerprints = [Chem.RDKFingerprint(mol) for mol in mols]
dg = pd.read_csv("D:\HBVdata\hbvfrag_recap_delete_duplicate.csv")
smi_list = dg['mol'].tolist()
submols = [Chem.MolFromSmarts(smi) for smi in smi_list]
count_dict = {}
for mol in mols:
count = 0
for submol in submols:
if mol.HasSubstructMatch(submol):
count += 1
count_dict[mol.ToSmiles()] = count
print(count_dict)
```
这将输出一个字典,其中键为每个分子的 SMILES 字符串,值为该分子匹配子结构的数量。
decomp =[Recap.RecapDecompose(mol)for mol in mol_list] smiles = [] for dec in decomp:#'rdkit.Chem.Recap.RecapHierarchyNode'>值类型 smi =dec.GetAllChildren() smile =smi.keys() smiles.append(smile) results = [] for fragment in smiles: results.append([fragment]) df_results = pd.DataFrame(results,columns=['fragment']) with pd.ExcelWriter('D:\HBVdata\hbvrecap.xlsx') as writer: df_results.to_excel(writer, index=False)这段代码改进
根据您提供的代码,我可以提出以下改进建议:
1. 将代码分成多个函数,以提高可读性和可维护性。
2. 使用列表推导式简化代码,并使用enumerate函数记录迭代次数。
3. 使用with语句管理文件句柄,以确保文件在使用完毕后能够正确关闭。
4. 使用os.path.join函数代替硬编码路径,以提高代码的可移植性。
下面是改进后的代码示例:
```
import pandas as pd
import os
from rdkit import Chem
from rdkit.Chem import Recap
def get_smiles(mol):
decomp = Recap.RecapDecompose(mol)
smiles = []
for dec in decomp:
smi = dec.GetAllChildren()
smile = smi.keys()
smiles.append(smile)
return smiles
def process_mols(mol_list):
results = []
for i, mol in enumerate(mol_list):
fragments = get_smiles(mol)
for fragment in fragments:
results.append([i, fragment])
return pd.DataFrame(results, columns=['index', 'fragment'])
def save_results(df, filepath):
with pd.ExcelWriter(filepath) as writer:
df.to_excel(writer, index=False)
if __name__ == '__main__':
mol_list = [Chem.MolFromSmiles('CCO'), Chem.MolFromSmiles('CCN')]
df_results = process_mols(mol_list)
filepath = os.path.join('D:', 'HBVdata', 'hbvrecap.xlsx')
save_results(df_results, filepath)
```
上述代码中,首先定义了一个名为get_smiles的函数,用于获取给定分子的SMILES字符串。然后,定义了一个名为process_mols的函数,用于处理给定分子列表,并将结果返回为一个DataFrame对象。最后,定义了一个名为save_results的函数,用于将DataFrame对象保存为Excel文件。
在__main__函数中,我们定义了一个mol_list列表,包含两个简单的分子,然后调用process_mols函数来处理这些分子。最后,我们调用save_results函数将结果保存到指定的Excel文件中。
阅读全文