I used Nutch some time ago. It is a very nice project and you can easily fit different proposes (I know that some big companies use it). Bixo (openbixo) seems to be very nice (I didn't tested yet). Depending your propose and your time I would say to create your own using some parallel programming (there is a lot of details in this part, like using a bloom filter to store the URL already fetched ) and a database (cassandra, e.g.) to store.