Hadoop key mismatch

Greenhorn

Posts: 25

posted 14 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hello,

Hope this is the correct forum for a hadoop question.

I have a file with a bunch of lines like this:

It continues on for all 50 states, then there is another word like politics:30 Virginia ... etc.

I want to do a distributed sort on this using mapreduce. I know mapreduce sorts between the map and reduces stages, so I just want to emit from map, then from reduce without processing, but it is not working. Here is my map and reduce function:

Here is my main class

And here is the inputformat class i wrote since FileInputFormat would always fail

class CountInputFormat extends FileInputFormat<Text, Text>
{
	public RecordReader<Text, Text> createRecordReader(InputSplit is, TaskAttemptContext tac) throws IOException, InterruptedException
	{
		CountReader cr = new CountReader();
		cr.initialize(is, tac);
		return cr;
	}
	
}

class CountReader extends RecordReader<Text, Text>
{
	private LineRecordReader lineReader;
	private Text lineKey;
	private Text lineValue;
	
	public void close() throws IOException
	{
		lineReader.close();
	}

public Text getCurrentKey() throws IOException, InterruptedException
	{
		return lineKey;
	}

public Text getCurrentValue() throws IOException, InterruptedException
	{
		return lineValue;
	}

public float getProgress() throws IOException, InterruptedException
	{
		return lineReader.getProgress();
	}

public void initialize(InputSplit is, TaskAttemptContext tac) throws IOException, InterruptedException
	{
		lineReader = new LineRecordReader();
		lineReader.initialize(is, tac);
	}

public boolean nextKeyValue() throws IOException, InterruptedException
	{
		if(!lineReader.nextKeyValue())
		{
			return false;
		}
		
		String[] parts = lineReader.getCurrentValue().toString().split("\\s");
		
		lineValue = new Text(parts[1]);
		lineKey = new Text(parts[0]);
		
		return true;
	}
	
}

Here is the error

Feb 24, 2010 2:48:40 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Feb 24, 2010 2:48:40 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
Feb 24, 2010 2:48:40 PM org.apache.hadoop.mapred.JobClient configureCommandLineOptions
WARNING: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
Feb 24, 2010 2:48:40 PM org.apache.hadoop.mapred.JobClient configureCommandLineOptions
WARNING: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Feb 24, 2010 2:48:40 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Feb 24, 2010 2:48:40 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
Feb 24, 2010 2:48:40 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Feb 24, 2010 2:48:40 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Feb 24, 2010 2:48:40 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Feb 24, 2010 2:48:40 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Feb 24, 2010 2:48:40 PM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807)
	at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:504)
	at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
	at Test3$SortMap.map(Test3.java:88)
	at Test3$SortMap.map(Test3.java:1)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
Feb 24, 2010 2:48:41 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
Feb 24, 2010 2:48:41 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0001
Feb 24, 2010 2:48:41 PM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0

Thanks

Larry Homes

Greenhorn

Posts: 25

posted 14 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Thought I would post the solution I found. It was an incredibly dumb error on my part. In my main class, I named the Job instance sort, but then when setting the mapOutputKey, mapOutputValues, outputKey and outputValue, I use the identifier job. That identifier was from a previous mapreduce in the chain and I had just copied and pasted the code without remembering to change the job identifier.

Did you see how Paul cut 87% off of his electric heat bill with 82 watts of micro heaters?