Ohio Bronies - Forums

Peanut Bucker is best pony.

You are not logged in.

#1 2014-04-25 22:42:10

Sarteck
Best Pony

/mlp/ Image downloader

Updated my image downloader since 4chan updated shit a few days ago.

Also fixed a few bugs (none particularly important).

So what's this DO?

Basically, it's an easy way to download all images from a thread on /mlp/.

This has been tested on Mac OSX, Ubuntu 12.04 and CentOS 5.3.  This will obviously NOT work on Windows.



#!/bin/sh

# Programmed by Sarteck
# Feel free to use, modify, sell, claim as your own, or whatever, I don't give a shit



PWD="$(pwd)" ;
SCRIPTNAME="$(basename $0)" ;
VERBOSE=false ;
ZIP=false ;
PERSISTENT=false ;
DELAY=1 ;

## Outname is Thread ID number by default
tdir=$(echo "$1" | egrep -o 'https?\:\/\/boards\.4chan\.org\/mlp\/thread\/([0-9]*)' ) ;
OUTNAME="$tdir" ;


function print_usage()
{
  echo "  Usage: $SCRIPTNAME [-hvz] [-p DELAY] [-o OUTNAME] <Thread URI>" ;
  echo "           -h         : Print this menu and exit." ;
  echo "           -v         : Verbose" ;
  echo "           -z         : Tar and Zip output after downloading " ;
  echo "           -p DELAY   : 'Persistent' (will continue running, checking every DELAY seconds for new images, if not downloading)" ;
  echo "           -o OUTNAME : Saves images to OUTNAME directory (thread ID number if unspecified)." ;
  echo "                        If -z flag set, tar/zips file to that name." ;
  echo "                        If directory (or tar/zip file if -z) with OUTNAME exists, uses taht directory(/file)." ;
}

function imglist_init()
{
  ## If for some reason the image list exists, empty it.  Else, create it.
  if [ -f imglist ] ; then >imglist ; else touch imglist ; fi ;
  ## Populate image list
  curl --silent "$threadID" | egrep -o '<a href="[^"]*"' | egrep -o 'i.4cdn.org/[a-zA-Z0-9]*/[0-9]*.(jpg|png|gif)' | uniq > imglist ;
}

function create_directory()
{
  ## Create directory if it doesn't exist.
  ## If -z, check for zipped file, unzip it.
  if $ZIP ; then
    if [ -f "${OUTNAME}.tar.gz" ] ; then
  	tar -zxf ${OUTNAME}.tar.gz ;
  	rm ${OUTNAME}.tar.gz ;
    else
      if [ ! -d $OUTNAME ]; then
        mkdir $OUTNAME ;
        if $VERBOSE ; then echo "$OUTNAME Created." ; fi
      fi
    fi ;
  else
    if [ ! -d $OUTNAME ]; then 
      mkdir $OUTNAME ;
      if $VERBOSE ; then echo "$OUTNAME Created." ; fi
    else
      if $VERBOSE ; then echo "$OUTNAME exists" ; fi
    fi
  fi ;
}

function fetch_images()
{
  imglist_init ;
  nimgs=$(cat imglist | wc -l); #Number of images in thread
  if $VERBOSE ; then echo "$nimgs in thread." ; fi
  dimgs=0; #Number of images actually downloaded
  while read line ; do
    filename=$(echo "$line" | cut -d/ -f3) ;
    if [ -f "$filename" ]; then
      if $VERBOSE ; then echo "$filename exists, skipping..." ; fi
    else
      curl --silent "$line" -o $filename ;
      if $VERBOSE ; then echo "$filename downloaded." ; fi
      let dimgs++
    fi
  done<imglist ;
  echo "$dimgs Images downloaded."
  rm imglist ;
}


function updateThreadStatus()
{
  local PAGE=$(curl --silent "$threadID") ;
  if [[ $PAGE == *'<title>4chan - 404 Not Found</title>'* ]] ; then threadUp=false ; fi
}


### Arguments:
###  -v         : Verbose
###  -z         : Tar/Gzip (Will attempt to untar if previous file exists as well)
###  -p         : Persistent -- will run until thread 404's
###  -o OUTNAME : Name of directory (or tar file) to save images to
###  -h         : usage

OPTIND=1
while getopts ":hvzp:o:" opt; do
  case $opt in
    h)
      print_usage ;
      exit 1 ;
      ;;
    v)
      VERBOSE=true ;
      ;;
    z)
      ZIP=true ;
      ;;
    p)
      PERSISTENT=true ;
      if [[ $OPTARG = *[!0-9]* ]]; then echo "Persistent DELAY '$OPTARG' needs to be an integer." >&2 ; print_usage ; exit 1; fi
      DELAY="$OPTARG" ;
      ;;
    o)
      OUTNAME="$OPTARG" ;
      ;;
    \?)
      echo "Invalid option: -$OPTARG" >&2
      print_usage ;
      exit 1
      ;;
    :)
      echo "Option -$OPTARG requires an argument." >&2
      print_usage ;
      exit 1
      ;;
  esac
done
shift $(($OPTIND - 1)) ;
threadID="$1"
if [ "$threadID" = "" ]; then echo "No THread Specified: $@ " ; print_usage ; exit 1 ; fi

## Moved this bit up to the top.
## Need to think of a more elegant way.
## Outname is Thread ID number by default
#tdir=$(echo "$1" | egrep -o '([0-9]*$)' ) ;
#OUTNAME="$tdir" ;

create_directory ;
cd $OUTNAME ;
echo $(pwd);
cimgs=$(($(ls -l | wc -l )-1)); #Current number of images in directory
if $VERBOSE ; then echo "$cimgs in directory." ; fi
if $PERSISTENT ; then
  threadUp=true
  updateThreadStatus ;
  while $threadUp ; do
    fetch_images ;
    echo "Sleeping for $DELAY seconds..." ;
    sleep $DELAY ;
    updateThreadStatus ;
    if $threadUp ; then
      echo "Thread still live, continuing." ;
    else
      echo "Thread 404'd, exiting." ;
    fi ;
  done ;
else
  fetch_images ;
fi
if $ZIP ; then
  cd .. ;
  tar cf - $OUTNAME | gzip -9 - > ${OUTNAME}.tar.gz ;
  rm -rf $OUTNAME ;
fi ;

exit 1;

Run it like this:

./mlp.sh -vo test http://boards.4chan.org/mlp/thread/1742 … cronomicon

The "slug" (the little SEO bullshit at the end of the URL) is not needed.  You can use HTTP or HTTPS, doesn't matter.


-v: VERBOSE, gives more shit for you to see what's going on.
-h: HELP Prints a help menu and then quits without downloading anyhting.
-z: Tar and Gzip output.
-p NUM: (NUM is required, must be integer), persistently check thread, and attempt the checks every NUM seconds.  (E.G., -p 30 if you want to check every 30 seconds for new images).
-o OUTFILE: (OUTFILE required), specify the name of the directory you want to output to.  (If not specified, uses the Thread ID number.)

Some things to note:

If you already HAVE some images from the thread and run this a second (third, fourth, etc.) time, as long as the OUTFILE is the same, it will not re-download the images--it will simply add new ones.  No worries about bandwidth.

If you specify to ZIP output, and you already have some images (whether in the OUTFILE directory OR in the OUTFILE zip), it will also just add to what you got (unzipping/untarring first if necessary).


I see you...
watching2.png

Offline

#2 2014-04-26 10:35:19

Star ★
Pony
Starshine Trotter

Re: /mlp/ Image downloader

Sarteck wrote:

This will obviously NOT work on Windows.

horseapples

and also I enjoy bikeshed moments, so:

import urllib.request, urllib.parse, lxml.html, posixpath, time, sys, os
out = sys.stdout

for url in sys.argv[1:]:
    out.write('Fetching %s ...\n' % url)

    imgs = (lxml.html.document_fromstring(urllib.request.urlopen(url).read())
              .cssselect(','.join('.fileText a[href^="//"][href$="%d.%s"]' % (d, s)
              for d in range(10) for s in ['jpg','gif','png'])))

    print('%d %s found' % (len(imgs), ('images' if len(imgs) > 1 else 'image')))

    for a in imgs:
        href = urllib.parse.urljoin(url, a.attrib['href'])
        base = posixpath.basename(href)

        if not os.path.exists(base):
            def r(count, block, total):
                p = count * block / total
                b = int(48 * p)
                out.write('\r%-23s %3d%% [%s%s]' % (base, 100 * p, '#' * b, ' ' * ((48 - b))))
                out.flush()

            urllib.request.urlretrieve(href, base, r)
            out.write('\n')
            time.sleep(1)

could be improved with argparse, but for scripts this small I prefer to hack the code to do something different
also doesn't really error check, but whatevs

Offline

#3 2014-04-28 20:14:57

Sarteck
Best Pony

Re: /mlp/ Image downloader

/me still doesn't know a thing about Python.

Seems that everyone is loving it...  I don't want to switch from my good ol' perl and PHP buddies, though. (And BASH, obviously.)

As for mingw, I don't really know what it might or might not support.  I'll have to see if I can run this on my parent's old computer or something.  Heh.


I see you...
watching2.png

Offline

#4 2014-04-29 00:38:45

Star ★
Pony
Starshine Trotter

Re: /mlp/ Image downloader

Well okay, then how about Lua? I have been playing with it a lot lately.

local socket = require 'socket'
socket.http = require 'socket.http'
socket.url = require 'socket.url'
local htmlparser = require 'htmlparser'

for _, url in ipairs(arg) do
    print('Fetching ' .. url .. ' ...')
    local data, status = socket.http.request(url)
    assert(status == 200)

    -- htmlparser's css support is rather limited, so we can only hit one selector at a time
    local doc = htmlparser.parse(data)
    local imgs = {}
    for d = 0, 9 do
        for _, s in ipairs({'jpg', 'gif', 'png'}) do
            for _, a in ipairs(doc:select(string.format('.fileText a[href^="//"][href$="%d.%s"]', d, s))) do
                table.insert(imgs, a)
            end
        end
    end

    print(#imgs .. ' ' .. (#imgs > 1 and 'images' or 'image') .. ' found')

    for _, a in ipairs(imgs) do
        local href = socket.url.absolute(url, a.attributes.href)
        local base = href:sub(href:find('[^/]+/*$'))

        local f = io.open(base)
        if f == nil then -- doesn't exist
            -- (no spiffy progress bar. meh)
            print(base)
            f = assert(io.open(base, 'w'))
            local data = socket.http.request(href)
            f:write(data)
            socket.select(nil, nil, 1) -- delay
        end
        f:close()
    end
end

I've successfully built Unix stuff on MingW, with autoconf/automake and the like, and had to make few (if any) changes.

Offline

#5 2014-04-30 21:25:41

Jinzo
The veteran of many pretend wars

Re: /mlp/ Image downloader

Lol it will work on windows... if you want to run a VM or do Cygwin yikes


sprocket_zpsecd406b8.png
"The world needs more Bronies" - John de Lancie
Twitter @JinzoDefiler

Offline

#6 2014-05-01 11:35:34

Star ★
Pony
Starshine Trotter

Re: /mlp/ Image downloader

Jinzo wrote:

Lol it will work on windows... if you want to run a VM or do Cygwin yikes

why are you making it complicated oh okay complicating things is fun sometimes.

but seriously mingw is all you need most of the time for windows portability

Offline

#7 2014-05-03 11:17:53

Jinzo
The veteran of many pretend wars

Re: /mlp/ Image downloader

Starshine ★ wrote:
Jinzo wrote:

Lol it will work on windows... if you want to run a VM or do Cygwin yikes

why are you making it complicated oh okay complicating things is fun sometimes.

but seriously mingw is all you need most of the time for windows portability


i forgot about mingw to be honest..


sprocket_zpsecd406b8.png
"The world needs more Bronies" - John de Lancie
Twitter @JinzoDefiler

Offline

#8 2014-05-03 12:21:46

Star ★
Pony
Starshine Trotter

Re: /mlp/ Image downloader

... it's been mentioned like 4 times in this topic lol

Offline

Quick reply

Write your message and submit

Board footer

Powdered by FluxBB