Monday 14 September 2015

Loading data to hbase - bulk and non-bulk loading

Loading csv to hadoop fs:

hadoop fs -put test.tsv /tmp/
hadoop fs -ls /tmp/

1. BULK LOADING

 a) Preparing StoreFiles

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" -Dimporttsv.separator="," -Dimporttsv.bulk.output="/tmp/hbaseoutput" t1 /tmp/test.tsv

b) Upload the data from the HFiles located at /tmp/hbaseoutput to the HBase table t1

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /tmp/hbaseoutput t1


2. NON-BULK LOADING

Upload the data from TSV format in HDFS into HBase via Puts

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" t1 /tmp/test.tsv

Connecting to hive through java example.

Visit my new wordpress blog to view this content :)

Sunday 25 May 2014

Postgres tablespace creation

Create a tablespace in postgresql in two simple steps :

1) Make a tablespace directory
mkdir -p /var/lib/pgsql/tablespaces/<tablespace_name>
cd /var/lib/pgsql/tablespaces/
chmod -R 700 <tablespace_name>

2) Create tablespace
psql test
test=# create tablespace <tablespace_name> location '/var/lib/pgsql/tablespaces/<tablespace_name>';

After creating tablespace we should basically include this in the ddl
SET default_tablespace = <tablespace_name>;
Create table mytable(id integer);

and now table 'mytable' will belong to our newly created tablespace!

Thursday 8 May 2014

Slony - number of records yet to be processed in sl_log tables

Two tables in SLONY - sl_log_1 and sl_log_2 stores the changes which need to be propagated to the subscriber nodes. Slony will try to log switch between both of these tables and truncate each of them once all the changes are propagated to the subscriber node. Sometimes there is a chance that these tables grow very huge because of a big table or large data set sync. You could also notice in the logs that SYNC events are taking long time.

remoteWorkerThread_4: SYNC 8002311133 done in 12.30 seconds


Also you may get this error in slony log in master
NOTICE: Slony-I: could not lock sl_log_1 - sl_log_1 not truncated

Finding number of records yet to be processed by slony is important.

Query to find number of records in sl_log_1 yet to be processed by slony 
select count(*) from sl_log_1 where log_txid>(select split_part(cast(ev_snapshot as text),':',1)::bigint from sl_event where ev_seqno=(select st_last_event from sl_status));

similarly you can find number of records yet to be processed in sl_log_2 using

select count(*) from sl_log_2 where log_txid>(select split_part(cast(ev_snapshot as text),':',1)::bigint from sl_event where ev_seqno=(select st_last_event from sl_status));