scheduler Unhealthy Get "http://127.0.0.1:10251/healthz": dial tcp 127.0.0.1:10251: connect: connection refused
时间: 2024-04-18 10:24:20 浏览: 21
这个错误通常是由于调度器(scheduler)无法与 kubelet 通信导致的。kubelet 是运行在每个节点上的 Kubernetes 组件,负责与容器进行交互。你可以尝试以下几个步骤来解决问题:
1. 检查 kubelet 是否正在运行并且没有出现任何错误。你可以使用以下命令检查 kubelet 状态:
```
systemctl status kubelet
```
2. 如果 kubelet 正在运行但出现了错误,你可以尝试重启 kubelet 服务:
```
systemctl restart kubelet
```
3. 检查调度器的日志以获取更多详细信息。调度器的日志通常位于 `/var/log/kubernetes/scheduler.log` 或者 `/var/log/pods/kube-system_kube-scheduler-<node-name>_<pod-id>/scheduler/0.log`。
4. 如果你使用的是 kube-proxy,也可以尝试重启 kube-proxy 服务:
```
systemctl restart kube-proxy
```
如果以上步骤都没有解决问题,你可能需要进一步检查集群的网络配置和组件之间的通信。
相关问题
W0225 17:09:39.827262 43878 manifests.go:214] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC" [control-plane] Creating static Pod manifest for "kube-scheduler" W0225 17:09:39.827977 43878 manifests.go:214] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC" [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests" [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s [kubelet-check] Initial timeout of 40s passed. [kubelet-check] It seems like the kubelet isn't running or healthy. [kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.
根据您提供的信息,kubeadm正在创建kube-scheduler的静态Pod清单,并在"/etc/kubernetes/manifests"目录下创建本地etcd的静态Pod清单。
然后,kubeadm等待kubelet以静态Pod的形式从"/etc/kubernetes/manifests"目录启动控制平面。这个过程可能需要最多4分钟。
但是,在等待过程中,出现了一些问题。警告信息"W0225 17:09:39.827262 43878 manifests.go:214] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC""和"W0225 17:09:39.827977 43878 manifests.go:214] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC""表示默认的kube-apiserver授权模式为"Node,RBAC",并且将使用此模式。
警告信息"[kubelet-check] Initial timeout of 40s passed."和"[kubelet-check] It seems like the kubelet isn't running or healthy."表示在初始超时时间内无法连接到kubelet,可能是kubelet未运行或出现了健康问题。
警告信息"[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused."表示尝试通过HTTP请求检查kubelet的健康状态失败,连接被拒绝。
这些警告表明kubelet可能在启动过程中遇到了问题。您可以检查kubelet的日志以获取更多详细信息,可能需要查看kubelet的日志文件以确定具体的问题所在。
如果您需要进一步的帮助,请提供更多信息或具体的错误日志,以便我们能够更好地帮助您解决问题。
https://github.com/weizhepei/CasRel中run.py解读
`run.py` 是 `CasRel` 项目的入口文件,用于训练和测试模型。以下是 `run.py` 的主要代码解读和功能说明:
### 导入依赖包和模块
首先,`run.py` 导入了所需的依赖包和模块,包括 `torch`、`numpy`、`argparse`、`logging` 等。
```python
import argparse
import logging
import os
import random
import time
import numpy as np
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from casrel import CasRel
from dataset import RE_Dataset
from utils import init_logger, load_tokenizer, set_seed, collate_fn
```
### 解析命令行参数
接下来,`run.py` 解析了命令行参数,包括训练数据路径、模型保存路径、预训练模型路径、学习率等参数。
```python
def set_args():
parser = argparse.ArgumentParser()
parser.add_argument("--train_data", default=None, type=str, required=True,
help="The input training data file (a text file).")
parser.add_argument("--dev_data", default=None, type=str, required=True,
help="The input development data file (a text file).")
parser.add_argument("--test_data", default=None, type=str, required=True,
help="The input testing data file (a text file).")
parser.add_argument("--model_path", default=None, type=str, required=True,
help="Path to save, load model")
parser.add_argument("--pretrain_path", default=None, type=str,
help="Path to pre-trained model")
parser.add_argument("--vocab_path", default=None, type=str, required=True,
help="Path to vocabulary")
parser.add_argument("--batch_size", default=32, type=int,
help="Batch size per GPU/CPU for training.")
parser.add_argument("--gradient_accumulation_steps", default=1, type=int,
help="Number of updates steps to accumulate before performing a backward/update pass.")
parser.add_argument("--learning_rate", default=5e-5, type=float,
help="The initial learning rate for Adam.")
parser.add_argument("--num_train_epochs", default=3, type=int,
help="Total number of training epochs to perform.")
parser.add_argument("--max_seq_length", default=256, type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.")
parser.add_argument("--warmup_proportion", default=0.1, type=float,
help="Linear warmup over warmup_steps.")
parser.add_argument("--weight_decay", default=0.01, type=float,
help="Weight decay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float,
help="Max gradient norm.")
parser.add_argument("--logging_steps", type=int, default=500,
help="Log every X updates steps.")
parser.add_argument("--save_steps", type=int, default=500,
help="Save checkpoint every X updates steps.")
parser.add_argument("--seed", type=int, default=42,
help="random seed for initialization")
parser.add_argument("--device", type=str, default="cuda" if torch.cuda.is_available() else "cpu",
help="selected device (default: cuda if available)")
args = parser.parse_args()
return args
```
### 加载数据和模型
接下来,`run.py` 加载了训练、验证和测试数据,以及 `CasRel` 模型。
```python
def main():
args = set_args()
init_logger()
set_seed(args)
tokenizer = load_tokenizer(args.vocab_path)
train_dataset = RE_Dataset(args.train_data, tokenizer, args.max_seq_length)
dev_dataset = RE_Dataset(args.dev_data, tokenizer, args.max_seq_length)
test_dataset = RE_Dataset(args.test_data, tokenizer, args.max_seq_length)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.batch_size,
collate_fn=collate_fn)
dev_sampler = SequentialSampler(dev_dataset)
dev_dataloader = DataLoader(dev_dataset, sampler=dev_sampler, batch_size=args.batch_size,
collate_fn=collate_fn)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.batch_size,
collate_fn=collate_fn)
model = CasRel(args)
if args.pretrain_path:
model.load_state_dict(torch.load(args.pretrain_path, map_location="cpu"))
logging.info(f"load pre-trained model from {args.pretrain_path}")
model.to(args.device)
```
### 训练模型
接下来,`run.py` 开始训练模型,包括前向传播、反向传播、梯度更新等步骤。
```python
optimizer = torch.optim.Adam([{'params': model.bert.parameters(), 'lr': args.learning_rate},
{'params': model.subject_fc.parameters(), 'lr': args.learning_rate},
{'params': model.object_fc.parameters(), 'lr': args.learning_rate},
{'params': model.predicate_fc.parameters(), 'lr': args.learning_rate},
{'params': model.linear.parameters(), 'lr': args.learning_rate}],
lr=args.learning_rate, eps=args.adam_epsilon, weight_decay=args.weight_decay)
total_steps = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
warmup_steps = int(total_steps * args.warmup_proportion)
scheduler = torch.optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=lambda epoch: 1 / (1 + 0.05 * (epoch - 1))
)
global_step = 0
best_f1 = 0
for epoch in range(args.num_train_epochs):
for step, batch in enumerate(train_dataloader):
model.train()
batch = tuple(t.to(args.device) for t in batch)
inputs = {
"input_ids": batch[0],
"attention_mask": batch[1],
"token_type_ids": batch[2],
"subj_pos": batch[3],
"obj_pos": batch[4],
"subj_type": batch[5],
"obj_type": batch[6],
"subj_label": batch[7],
"obj_label": batch[8],
"predicate_label": batch[9],
}
outputs = model(**inputs)
loss = outputs[0]
loss.backward()
if (step + 1) % args.gradient_accumulation_steps == 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
global_step += 1
if global_step % args.logging_steps == 0:
logging.info(f"Epoch:[{epoch + 1}]/[{args.num_train_epochs}] Step:[{global_step}] "
f"Train loss:{loss.item():.6f}")
if global_step % args.save_steps == 0:
f1 = evaluate(model, dev_dataloader, args)
if f1 > best_f1:
best_f1 = f1
torch.save(model.state_dict(), os.path.join(args.model_path, "best_model.bin"))
logging.info(f"Save model at step [{global_step}] with best f1 {best_f1:.4f}")
```
### 测试模型
最后,`run.py` 对模型进行测试,输出模型在测试集上的预测结果。
```python
model.load_state_dict(torch.load(os.path.join(args.model_path, "best_model.bin"), map_location="cpu"))
logging.info(f"load best model from {os.path.join(args.model_path, 'best_model.bin')}")
f1, precision, recall = evaluate(model, test_dataloader, args)
logging.info(f"Test f1:{f1:.4f} precision:{precision:.4f} recall:{recall:.4f}")
```
以上就是 `run.py` 的主要代码解读和功能说明。