SeJAVA对用SequenceFile随机数生成以(整数,字符串)为(key,Value)的文本文件读取各文件中的key(可以是从1到100000的随机整数),统计各文件中数据按位分布
时间: 2024-01-28 08:05:05 浏览: 91
首先,我们需要生成随机的SequenceFile文件,其中key是整数,value是字符串。可以使用以下代码:
```java
import java.io.IOException;
import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
public class SequenceFileGenerator {
public static void main(String[] args) throws IOException {
String path = "data/sequencefile";
int numFiles = 3;
int numRecordsPerFile = 100000;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
for (int i = 0; i < numFiles; i++) {
String filename = path + "/file_" + i;
Path file = new Path(filename);
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, file,
IntWritable.class, Text.class);
Random rand = new Random();
for (int j = 0; j < numRecordsPerFile; j++) {
IntWritable key = new IntWritable(rand.nextInt(100000) + 1);
Text value = new Text("record_" + j);
writer.append(key, value);
}
writer.close();
}
}
}
```
上述代码中,我们生成了3个SequenceFile文件,每个文件包含10万条记录,key是1到100000之间的随机整数,value是字符串"record_"加上记录的编号。这些文件将被保存在"data/sequencefile"目录下。
接下来,我们需要编写一个MapReduce程序来统计各文件中数据按位分布。我们可以使用IntWritable作为输出的key,表示该位上的数字是多少;使用IntWritable作为输出的value,表示该位上的数字出现的次数。
```java
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class BitDistribution {
public static class BitMapper extends Mapper<IntWritable, Text, IntWritable, IntWritable> {
private IntWritable outputKey = new IntWritable();
private IntWritable outputValue = new IntWritable();
@Override
protected void map(IntWritable key, Text value,
Mapper<IntWritable, Text, IntWritable, IntWritable>.Context context)
throws IOException, InterruptedException {
int num = key.get();
for (int i = 0; i < 5; i++) { // 统计前5位的数字
int digit = (num / (int) Math.pow(10, i)) % 10;
outputKey.set(i * 10 + digit);
outputValue.set(1);
context.write(outputKey, outputValue);
}
}
}
public static class BitReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
private IntWritable outputValue = new IntWritable();
@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values,
Reducer<IntWritable, IntWritable, IntWritable, IntWritable>.Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
outputValue.set(sum);
context.write(key, outputValue);
}
}
public static void main(String[] args)
throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "bit distribution");
job.setJarByClass(BitDistribution.class);
job.setMapperClass(BitMapper.class);
job.setReducerClass(BitReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
```
上述代码中,我们只统计了每个数字的前5位数字的分布情况。如果要统计更多位数,可以修改BitMapper中的for循环。
最后,在命令行中执行以下命令即可运行程序:
```bash
hadoop jar BitDistribution.jar data/sequencefile output
```
其中,BitDistribution.jar是编译好的jar包,data/sequencefile是包含SequenceFile文件的目录,output是输出目录。运行结果将保存在output目录下的part-r-00000文件中。
阅读全文